liftover_pos

Liftover between human genomes for cpg positions.

Coordinate Processing and Validation

This module provides tools for: - Genomic Liftover: Convert coordinates between different genome builds (e.g., hg19 to hg38). - CpG Validation: Fetch sequences at positions and verify they contain valid CpG sites. - Reference Detection: Automatically determine the most likely reference genome for unknown data.

Import Dependencies

Working with Sample Data

# export
import pandas as pd
import re
from bs_cpg.setup import *
from bs_cpg.download_ref import *
from pathlib import Path
import pysam

Downloading sample data to /tmp/bs-cpg-data/.bs-cpg/sample_cpg_df.parquet...

# Example: Load sample CpG data
df = read_sample_cpg(["chromosome", "pos"])
print(f"Sample data shape: {df.shape}")
print(df.head())

	chromosome	pos
1594976	chr11	30607691
41581	chr1	10269003
2926242	chrX	55066102
573570	chr3	169781668
1930113	chr14	59950571
...	...	...
2556779	chr19	17438314
452764	chr2	241331303
1554781	chr10	135171502
898587	chr6	31869119
1764855	chr12	70024545

1000 rows × 2 columns

# View first few chromosome values
print("Sample chromosomes:")
print(df["chromosome"].to_list()[:5])

['chr11', 'chr1', 'chrX', 'chr3', 'chr14']

# import pysam
# fasta = pysam.FastaFile("data/hg38.fa")
# def fetch_seq(row):
#     return fasta.fetch(row["new_chrom"], row["Start"], row["End"])

## CpG Site Validation

### Fetching Sequences from Reference Genomes

# Example: Open a reference genome
# fasta = pysam.FastaFile(str(get_ref_genome("hg19")))

✅ Final file '/mnt/idms/home/magyary/.bs-cpg/hg19.fa.bgz' already exists.

# Example: Fetch a 50bp sequence from chr1 (0-based coordinates)
# seq = fasta.fetch(reference="chr1", start=90000, end=90050)
# print(f"Sequence: {seq}")

'tctcttgctgccctggagaccagctgccccacgaaggaaacagagccaac'

source

cpg_reads

 cpg_reads (chromosomes:Sequence, positions:Sequence,
            ref_genome:Path|str|pysam.FastaFile, index_base:int=1)

*Fetch the 2bp sequence starting at each given genomic position from a reference genome.

This function retrieves CpG sites from a reference genome by fetching 2-base sequences at specified positions. It intelligently handles different chromosome naming conventions (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: list-like of chromosome names (e.g., ‘chr1’ or ‘1’). positions: list-like of genomic positions (ints). If index_base==1, positions are 1-based; if 0, 0-based. ref_genome: genome name (e.g., ‘hg19’), a path to a bgzipped FASTA, or an open pysam.FastaFile. index_base: 0 or 1 (default: 1). Specifies if input positions are 0-based or 1-based.

Returns: pd.Series: The fetched 2-bp sequences (uppercased). None when unavailable.

Examples: >>> # Fetch CpG reads from sample data >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’, index_base=1) >>> print(reads.head())*

source

cpg_percent

 cpg_percent (reads)

*Calculate the percentage of valid CpG sites in a dataset.

Returns the percent of entries equal to ‘CG’ (case-insensitive) in reads. Accepts any iterable/Series; ignores NA/None values.

Args: reads: An iterable (list, Series, etc.) of 2-bp sequences.

Returns: float: Percentage of entries that are ‘CG’ (0-100).

Examples: >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’) >>> percent_cg = cpg_percent(reads) >>> print(f”CpG sites: {percent_cg:.1f}%“)*

source

guess_ref_and_index_base

 guess_ref_and_index_base (chromosomes, positions,
                           ref_genomes:Sequence[Union[str,ForwardRef('Path
                           ')]], index_bases:Sequence[int]=(0, 1))

*Automatically identify the most likely reference genome and index base.

This function tries combinations of reference genomes and index bases, calculating the percentage of valid CpG sites for each combination. It returns the combination with the highest CpG percentage as the best guess.

Args: chromosomes: Chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Genomic positions corresponding to chromosomes. ref_genomes: List of genome names or paths to test (e.g., [‘hg19’, ‘hg38’]). index_bases: List of index bases to test (default: [0, 1]).

Returns: dict: Contains: - best_genome: Most likely genome name/path - best_index_base: Most likely index base (0 or 1) - best_percent: Percentage of CpG sites with best parameters - results: List of dicts with all tried combinations

Examples: >>> result = guess_ref_and_index_base( … df[‘chromosome’], … df[‘pos’], … ref_genomes=[‘hg19’, ‘hg38’] … ) >>> print(f”Best genome: {result[‘best_genome’]} ({result[‘best_percent’]:.1f}% CpG)“)*

Genomic Liftover

Converting coordinates between genome builds (e.g., hg19 → hg38) is essential when working with data from different sources or genome versions.

source

liftover_positions

 liftover_positions (chromosomes:Sequence, positions:Sequence,
                     genome_from:Optional[str]=None,
                     genome_to:Optional[str]=None, index_base_from:int=1,
                     index_base_to:int=1, return_df:bool=False)

*Convert genomic coordinates between different genome builds or adjust index bases.

This function performs liftover operations when both genome_from and genome_to are specified, or simple base conversion when both are None. It handles edge cases like unmapped regions, invalid coordinates, and chromosome naming variations (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: Sequence of chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Sequence of genomic positions. Must be same length as chromosomes. genome_from: Source genome build (e.g., ‘hg19’). Required for liftover. genome_to: Target genome build (e.g., ‘hg38’). Required for liftover. index_base_from: Input indexing system (0 or 1, default: 1 for 1-based). index_base_to: Output indexing system (0 or 1, default: 1 for 1-based). return_df: If True, return results as a pandas DataFrame; else return tuple of lists.

Returns: Tuple[List, List] or pd.DataFrame: If return_df=False: (new_chromosomes, new_positions) where unmapped positions are None. If return_df=True: DataFrame with detailed input/output columns.

Raises: ValueError: If chromosomes and positions have different lengths, or invalid bases. ImportError: If pandas is required (return_df=True) but not installed.

Examples: >>> # Liftover from hg19 to hg38 (1-based to 1-based) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’, ‘chr1’], … positions=[100001, 200000], … genome_from=‘hg19’, … genome_to=‘hg38’, … index_base_from=1, … index_base_to=1 … ) >>> print(list(zip(new_chroms, new_pos))) >>> >>> # Base conversion only (no liftover) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’], … positions=[100], … index_base_from=0, … index_base_to=1 … )*

# quick invariance checks (smoke tests)
# These won't fail the notebook export but help catch accidental regressions when running interactively.
try:
    # 0->0 should match 1->1 adjusted by conversion on both ends.
    ch0, p0 = liftover_positions(["chr1"], [100000], "hg19", "hg38", index_base_from=0, index_base_to=0)
    ch1, p1 = liftover_positions(["chr1"], [100001], "hg19", "hg38", index_base_from=1, index_base_to=1)
    if ch0[0] is not None and ch1[0] is not None:
        assert ch0[0] == ch1[0]
        assert p0[0] == p1[0]
except Exception:
    # Non-fatal in notebook context
    pass