liftover_pos

Liftover between human genomes for cpg positions.

Coordinate Processing and Validation

This module provides tools for: - Genomic Liftover: Convert coordinates between different genome builds (e.g., hg19 to hg38). - CpG Validation: Fetch sequences at positions and verify they contain valid CpG sites. - Reference Detection: Automatically determine the most likely reference genome for unknown data.

Import Dependencies

Working with Sample Data

# export
import pandas as pd
import re
from bs_cpg.setup import *
from bs_cpg.download_ref import *
from pathlib import Path
import pysam
Downloading sample data to /tmp/bs-cpg-data/.bs-cpg/sample_cpg_df.parquet...
# Example: Load sample CpG data
df = read_sample_cpg(["chromosome", "pos"])
print(f"Sample data shape: {df.shape}")
print(df.head())
chromosome pos
1594976 chr11 30607691
41581 chr1 10269003
2926242 chrX 55066102
573570 chr3 169781668
1930113 chr14 59950571
... ... ...
2556779 chr19 17438314
452764 chr2 241331303
1554781 chr10 135171502
898587 chr6 31869119
1764855 chr12 70024545

1000 rows × 2 columns

# View first few chromosome values
print("Sample chromosomes:")
print(df["chromosome"].to_list()[:5])
['chr11', 'chr1', 'chrX', 'chr3', 'chr14']
# import pysam
# fasta = pysam.FastaFile("data/hg38.fa")
# def fetch_seq(row):
#     return fasta.fetch(row["new_chrom"], row["Start"], row["End"])
## CpG Site Validation

### Fetching Sequences from Reference Genomes

# Example: Open a reference genome
# fasta = pysam.FastaFile(str(get_ref_genome("hg19")))
✅ Final file '/mnt/idms/home/magyary/.bs-cpg/hg19.fa.bgz' already exists.
# Example: Fetch a 50bp sequence from chr1 (0-based coordinates)
# seq = fasta.fetch(reference="chr1", start=90000, end=90050)
# print(f"Sequence: {seq}")
'tctcttgctgccctggagaccagctgccccacgaaggaaacagagccaac'

source

cpg_reads

 cpg_reads (chromosomes:Sequence, positions:Sequence,
            ref_genome:Path|str|pysam.FastaFile, index_base:int=1)

*Fetch the 2bp sequence starting at each given genomic position from a reference genome.

This function retrieves CpG sites from a reference genome by fetching 2-base sequences at specified positions. It intelligently handles different chromosome naming conventions (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: list-like of chromosome names (e.g., ‘chr1’ or ‘1’). positions: list-like of genomic positions (ints). If index_base==1, positions are 1-based; if 0, 0-based. ref_genome: genome name (e.g., ‘hg19’), a path to a bgzipped FASTA, or an open pysam.FastaFile. index_base: 0 or 1 (default: 1). Specifies if input positions are 0-based or 1-based.

Returns: pd.Series: The fetched 2-bp sequences (uppercased). None when unavailable.

Examples: >>> # Fetch CpG reads from sample data >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’, index_base=1) >>> print(reads.head())*


source

cpg_percent

 cpg_percent (reads)

*Calculate the percentage of valid CpG sites in a dataset.

Returns the percent of entries equal to ‘CG’ (case-insensitive) in reads. Accepts any iterable/Series; ignores NA/None values.

Args: reads: An iterable (list, Series, etc.) of 2-bp sequences.

Returns: float: Percentage of entries that are ‘CG’ (0-100).

Examples: >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’) >>> percent_cg = cpg_percent(reads) >>> print(f”CpG sites: {percent_cg:.1f}%“)*


source

guess_ref_and_index_base

 guess_ref_and_index_base (chromosomes, positions,
                           ref_genomes:Sequence[Union[str,ForwardRef('Path
                           ')]], index_bases:Sequence[int]=(0, 1))

*Automatically identify the most likely reference genome and index base.

This function tries combinations of reference genomes and index bases, calculating the percentage of valid CpG sites for each combination. It returns the combination with the highest CpG percentage as the best guess.

Args: chromosomes: Chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Genomic positions corresponding to chromosomes. ref_genomes: List of genome names or paths to test (e.g., [‘hg19’, ‘hg38’]). index_bases: List of index bases to test (default: [0, 1]).

Returns: dict: Contains: - best_genome: Most likely genome name/path - best_index_base: Most likely index base (0 or 1) - best_percent: Percentage of CpG sites with best parameters - results: List of dicts with all tried combinations

Examples: >>> result = guess_ref_and_index_base( … df[‘chromosome’], … df[‘pos’], … ref_genomes=[‘hg19’, ‘hg38’] … ) >>> print(f”Best genome: {result[‘best_genome’]} ({result[‘best_percent’]:.1f}% CpG)“)*

Genomic Liftover

Converting coordinates between genome builds (e.g., hg19 → hg38) is essential when working with data from different sources or genome versions.


source

liftover_positions

 liftover_positions (chromosomes:Sequence, positions:Sequence,
                     genome_from:Optional[str]=None,
                     genome_to:Optional[str]=None, index_base_from:int=1,
                     index_base_to:int=1, return_df:bool=False)

*Convert genomic coordinates between different genome builds or adjust index bases.

This function performs liftover operations when both genome_from and genome_to are specified, or simple base conversion when both are None. It handles edge cases like unmapped regions, invalid coordinates, and chromosome naming variations (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: Sequence of chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Sequence of genomic positions. Must be same length as chromosomes. genome_from: Source genome build (e.g., ‘hg19’). Required for liftover. genome_to: Target genome build (e.g., ‘hg38’). Required for liftover. index_base_from: Input indexing system (0 or 1, default: 1 for 1-based). index_base_to: Output indexing system (0 or 1, default: 1 for 1-based). return_df: If True, return results as a pandas DataFrame; else return tuple of lists.

Returns: Tuple[List, List] or pd.DataFrame: If return_df=False: (new_chromosomes, new_positions) where unmapped positions are None. If return_df=True: DataFrame with detailed input/output columns.

Raises: ValueError: If chromosomes and positions have different lengths, or invalid bases. ImportError: If pandas is required (return_df=True) but not installed.

Examples: >>> # Liftover from hg19 to hg38 (1-based to 1-based) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’, ‘chr1’], … positions=[100001, 200000], … genome_from=‘hg19’, … genome_to=‘hg38’, … index_base_from=1, … index_base_to=1 … ) >>> print(list(zip(new_chroms, new_pos))) >>> >>> # Base conversion only (no liftover) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’], … positions=[100], … index_base_from=0, … index_base_to=1 … )*

# quick invariance checks (smoke tests)
# These won't fail the notebook export but help catch accidental regressions when running interactively.
try:
    # 0->0 should match 1->1 adjusted by conversion on both ends.
    ch0, p0 = liftover_positions(["chr1"], [100000], "hg19", "hg38", index_base_from=0, index_base_to=0)
    ch1, p1 = liftover_positions(["chr1"], [100001], "hg19", "hg38", index_base_from=1, index_base_to=1)
    if ch0[0] is not None and ch1[0] is not None:
        assert ch0[0] == ch1[0]
        assert p0[0] == p1[0]
except Exception:
    # Non-fatal in notebook context
    pass