# export
import pandas as pd
import re
from bs_cpg.setup import *
from bs_cpg.download_ref import *
from pathlib import Path
import pysamDownloading sample data to /tmp/bs-cpg-data/.bs-cpg/sample_cpg_df.parquet...
This module provides tools for: - Genomic Liftover: Convert coordinates between different genome builds (e.g., hg19 to hg38). - CpG Validation: Fetch sequences at positions and verify they contain valid CpG sites. - Reference Detection: Automatically determine the most likely reference genome for unknown data.
Downloading sample data to /tmp/bs-cpg-data/.bs-cpg/sample_cpg_df.parquet...
| chromosome | pos | |
|---|---|---|
| 1594976 | chr11 | 30607691 |
| 41581 | chr1 | 10269003 |
| 2926242 | chrX | 55066102 |
| 573570 | chr3 | 169781668 |
| 1930113 | chr14 | 59950571 |
| ... | ... | ... |
| 2556779 | chr19 | 17438314 |
| 452764 | chr2 | 241331303 |
| 1554781 | chr10 | 135171502 |
| 898587 | chr6 | 31869119 |
| 1764855 | chr12 | 70024545 |
1000 rows × 2 columns
['chr11', 'chr1', 'chrX', 'chr3', 'chr14']
✅ Final file '/mnt/idms/home/magyary/.bs-cpg/hg19.fa.bgz' already exists.
'tctcttgctgccctggagaccagctgccccacgaaggaaacagagccaac'
cpg_reads (chromosomes:Sequence, positions:Sequence, ref_genome:Path|str|pysam.FastaFile, index_base:int=1)
*Fetch the 2bp sequence starting at each given genomic position from a reference genome.
This function retrieves CpG sites from a reference genome by fetching 2-base sequences at specified positions. It intelligently handles different chromosome naming conventions (e.g., ‘chr1’ vs ‘1’).
Args: chromosomes: list-like of chromosome names (e.g., ‘chr1’ or ‘1’). positions: list-like of genomic positions (ints). If index_base==1, positions are 1-based; if 0, 0-based. ref_genome: genome name (e.g., ‘hg19’), a path to a bgzipped FASTA, or an open pysam.FastaFile. index_base: 0 or 1 (default: 1). Specifies if input positions are 0-based or 1-based.
Returns: pd.Series: The fetched 2-bp sequences (uppercased). None when unavailable.
Examples: >>> # Fetch CpG reads from sample data >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’, index_base=1) >>> print(reads.head())*
cpg_percent (reads)
*Calculate the percentage of valid CpG sites in a dataset.
Returns the percent of entries equal to ‘CG’ (case-insensitive) in reads. Accepts any iterable/Series; ignores NA/None values.
Args: reads: An iterable (list, Series, etc.) of 2-bp sequences.
Returns: float: Percentage of entries that are ‘CG’ (0-100).
Examples: >>> reads = cpg_reads(df[‘chromosome’], df[‘pos’], ‘hg19’) >>> percent_cg = cpg_percent(reads) >>> print(f”CpG sites: {percent_cg:.1f}%“)*
guess_ref_and_index_base (chromosomes, positions, ref_genomes:Sequence[Union[str,ForwardRef('Path ')]], index_bases:Sequence[int]=(0, 1))
*Automatically identify the most likely reference genome and index base.
This function tries combinations of reference genomes and index bases, calculating the percentage of valid CpG sites for each combination. It returns the combination with the highest CpG percentage as the best guess.
Args: chromosomes: Chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Genomic positions corresponding to chromosomes. ref_genomes: List of genome names or paths to test (e.g., [‘hg19’, ‘hg38’]). index_bases: List of index bases to test (default: [0, 1]).
Returns: dict: Contains: - best_genome: Most likely genome name/path - best_index_base: Most likely index base (0 or 1) - best_percent: Percentage of CpG sites with best parameters - results: List of dicts with all tried combinations
Examples: >>> result = guess_ref_and_index_base( … df[‘chromosome’], … df[‘pos’], … ref_genomes=[‘hg19’, ‘hg38’] … ) >>> print(f”Best genome: {result[‘best_genome’]} ({result[‘best_percent’]:.1f}% CpG)“)*
Converting coordinates between genome builds (e.g., hg19 → hg38) is essential when working with data from different sources or genome versions.
liftover_positions (chromosomes:Sequence, positions:Sequence, genome_from:Optional[str]=None, genome_to:Optional[str]=None, index_base_from:int=1, index_base_to:int=1, return_df:bool=False)
*Convert genomic coordinates between different genome builds or adjust index bases.
This function performs liftover operations when both genome_from and genome_to are specified, or simple base conversion when both are None. It handles edge cases like unmapped regions, invalid coordinates, and chromosome naming variations (e.g., ‘chr1’ vs ‘1’).
Args: chromosomes: Sequence of chromosome identifiers (e.g., [‘chr1’, ‘chr2’]). positions: Sequence of genomic positions. Must be same length as chromosomes. genome_from: Source genome build (e.g., ‘hg19’). Required for liftover. genome_to: Target genome build (e.g., ‘hg38’). Required for liftover. index_base_from: Input indexing system (0 or 1, default: 1 for 1-based). index_base_to: Output indexing system (0 or 1, default: 1 for 1-based). return_df: If True, return results as a pandas DataFrame; else return tuple of lists.
Returns: Tuple[List, List] or pd.DataFrame: If return_df=False: (new_chromosomes, new_positions) where unmapped positions are None. If return_df=True: DataFrame with detailed input/output columns.
Raises: ValueError: If chromosomes and positions have different lengths, or invalid bases. ImportError: If pandas is required (return_df=True) but not installed.
Examples: >>> # Liftover from hg19 to hg38 (1-based to 1-based) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’, ‘chr1’], … positions=[100001, 200000], … genome_from=‘hg19’, … genome_to=‘hg38’, … index_base_from=1, … index_base_to=1 … ) >>> print(list(zip(new_chroms, new_pos))) >>> >>> # Base conversion only (no liftover) >>> new_chroms, new_pos = liftover_positions( … chromosomes=[‘chr1’], … positions=[100], … index_base_from=0, … index_base_to=1 … )*
# quick invariance checks (smoke tests)
# These won't fail the notebook export but help catch accidental regressions when running interactively.
try:
# 0->0 should match 1->1 adjusted by conversion on both ends.
ch0, p0 = liftover_positions(["chr1"], [100000], "hg19", "hg38", index_base_from=0, index_base_to=0)
ch1, p1 = liftover_positions(["chr1"], [100001], "hg19", "hg38", index_base_from=1, index_base_to=1)
if ch0[0] is not None and ch1[0] is not None:
assert ch0[0] == ch1[0]
assert p0[0] == p1[0]
except Exception:
# Non-fatal in notebook context
pass