# liftover_pos


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Coordinate Processing and Validation

This module provides tools for: - **Genomic Liftover**: Convert
coordinates between different genome builds (e.g., hg19 to hg38). -
**CpG Validation**: Fetch sequences at positions and verify they contain
valid CpG sites. - **Reference Detection**: Automatically determine the
most likely reference genome for unknown data.

### Import Dependencies

### Working with Sample Data

``` python
# export
import pandas as pd
import re
from bs_cpg.setup import *
from bs_cpg.download_ref import *
from pathlib import Path
import pysam
```

    Downloading sample data to /tmp/bs-cpg-data/.bs-cpg/sample_cpg_df.parquet...

``` python
# Example: Load sample CpG data
df = read_sample_cpg(["chromosome", "pos"])
print(f"Sample data shape: {df.shape}")
print(df.head())
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">chromosome</th>
<th data-quarto-table-cell-role="th">pos</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">1594976</td>
<td>chr11</td>
<td>30607691</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">41581</td>
<td>chr1</td>
<td>10269003</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2926242</td>
<td>chrX</td>
<td>55066102</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">573570</td>
<td>chr3</td>
<td>169781668</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1930113</td>
<td>chr14</td>
<td>59950571</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2556779</td>
<td>chr19</td>
<td>17438314</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">452764</td>
<td>chr2</td>
<td>241331303</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1554781</td>
<td>chr10</td>
<td>135171502</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">898587</td>
<td>chr6</td>
<td>31869119</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1764855</td>
<td>chr12</td>
<td>70024545</td>
</tr>
</tbody>
</table>

<p>1000 rows × 2 columns</p>
</div>

``` python
# View first few chromosome values
print("Sample chromosomes:")
print(df["chromosome"].to_list()[:5])
```

    ['chr11', 'chr1', 'chrX', 'chr3', 'chr14']

``` python
# import pysam
# fasta = pysam.FastaFile("data/hg38.fa")
# def fetch_seq(row):
#     return fasta.fetch(row["new_chrom"], row["Start"], row["End"])
```

``` python
## CpG Site Validation

### Fetching Sequences from Reference Genomes

# Example: Open a reference genome
# fasta = pysam.FastaFile(str(get_ref_genome("hg19")))
```

    ✅ Final file '/mnt/idms/home/magyary/.bs-cpg/hg19.fa.bgz' already exists.

``` python
# Example: Fetch a 50bp sequence from chr1 (0-based coordinates)
# seq = fasta.fetch(reference="chr1", start=90000, end=90050)
# print(f"Sequence: {seq}")
```

    'tctcttgctgccctggagaccagctgccccacgaaggaaacagagccaac'

------------------------------------------------------------------------

<a
href="https://github.com/magistak/bs-cpg/blob/main/bs_cpg/liftover_ps.py#L14"
target="_blank" style="float:right; font-size:smaller">source</a>

### cpg_reads

>  cpg_reads (chromosomes:Sequence, positions:Sequence,
>                 ref_genome:Path|str|pysam.FastaFile, index_base:int=1)

\*Fetch the 2bp sequence starting at each given genomic position from a
reference genome.

This function retrieves CpG sites from a reference genome by fetching
2-base sequences at specified positions. It intelligently handles
different chromosome naming conventions (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: list-like of chromosome names (e.g., ‘chr1’ or ‘1’).
positions: list-like of genomic positions (ints). If index_base==1,
positions are 1-based; if 0, 0-based. ref_genome: genome name (e.g.,
‘hg19’), a path to a bgzipped FASTA, or an open pysam.FastaFile.
index_base: 0 or 1 (default: 1). Specifies if input positions are
0-based or 1-based.

Returns: pd.Series: The fetched 2-bp sequences (uppercased). None when
unavailable.

Examples: \>\>\> \# Fetch CpG reads from sample data \>\>\> reads =
cpg_reads(df\[‘chromosome’\], df\[‘pos’\], ‘hg19’, index_base=1) \>\>\>
print(reads.head())\*

------------------------------------------------------------------------

<a
href="https://github.com/magistak/bs-cpg/blob/main/bs_cpg/liftover_ps.py#L118"
target="_blank" style="float:right; font-size:smaller">source</a>

### cpg_percent

>  cpg_percent (reads)

\*Calculate the percentage of valid CpG sites in a dataset.

Returns the percent of entries equal to ‘CG’ (case-insensitive) in
`reads`. Accepts any iterable/Series; ignores NA/None values.

Args: reads: An iterable (list, Series, etc.) of 2-bp sequences.

Returns: float: Percentage of entries that are ‘CG’ (0-100).

Examples: \>\>\> reads = cpg_reads(df\[‘chromosome’\], df\[‘pos’\],
‘hg19’) \>\>\> percent_cg = cpg_percent(reads) \>\>\> print(f”CpG sites:
{percent_cg:.1f}%“)\*

------------------------------------------------------------------------

<a
href="https://github.com/magistak/bs-cpg/blob/main/bs_cpg/liftover_ps.py#L148"
target="_blank" style="float:right; font-size:smaller">source</a>

### guess_ref_and_index_base

>  guess_ref_and_index_base (chromosomes, positions,
>                                ref_genomes:Sequence[Union[str,ForwardRef('Path
>                                ')]], index_bases:Sequence[int]=(0, 1))

\*Automatically identify the most likely reference genome and index
base.

This function tries combinations of reference genomes and index bases,
calculating the percentage of valid CpG sites for each combination. It
returns the combination with the highest CpG percentage as the best
guess.

Args: chromosomes: Chromosome identifiers (e.g., \[‘chr1’, ‘chr2’\]).
positions: Genomic positions corresponding to chromosomes. ref_genomes:
List of genome names or paths to test (e.g., \[‘hg19’, ‘hg38’\]).
index_bases: List of index bases to test (default: \[0, 1\]).

Returns: dict: Contains: - best_genome: Most likely genome name/path -
best_index_base: Most likely index base (0 or 1) - best_percent:
Percentage of CpG sites with best parameters - results: List of dicts
with all tried combinations

Examples: \>\>\> result = guess_ref_and_index_base( …
df\[‘chromosome’\], … df\[‘pos’\], … ref_genomes=\[‘hg19’, ‘hg38’\] … )
\>\>\> print(f”Best genome: {result\[‘best_genome’\]}
({result\[‘best_percent’\]:.1f}% CpG)“)\*

## Genomic Liftover

Converting coordinates between genome builds (e.g., hg19 → hg38) is
essential when working with data from different sources or genome
versions.

------------------------------------------------------------------------

<a
href="https://github.com/magistak/bs-cpg/blob/main/bs_cpg/liftover_ps.py#L222"
target="_blank" style="float:right; font-size:smaller">source</a>

### liftover_positions

>  liftover_positions (chromosomes:Sequence, positions:Sequence,
>                          genome_from:Optional[str]=None,
>                          genome_to:Optional[str]=None, index_base_from:int=1,
>                          index_base_to:int=1, return_df:bool=False)

\*Convert genomic coordinates between different genome builds or adjust
index bases.

This function performs liftover operations when both genome_from and
genome_to are specified, or simple base conversion when both are None.
It handles edge cases like unmapped regions, invalid coordinates, and
chromosome naming variations (e.g., ‘chr1’ vs ‘1’).

Args: chromosomes: Sequence of chromosome identifiers (e.g., \[‘chr1’,
‘chr2’\]). positions: Sequence of genomic positions. Must be same length
as chromosomes. genome_from: Source genome build (e.g., ‘hg19’).
Required for liftover. genome_to: Target genome build (e.g., ‘hg38’).
Required for liftover. index_base_from: Input indexing system (0 or 1,
default: 1 for 1-based). index_base_to: Output indexing system (0 or 1,
default: 1 for 1-based). return_df: If True, return results as a pandas
DataFrame; else return tuple of lists.

Returns: Tuple\[List, List\] or pd.DataFrame: If return_df=False:
(new_chromosomes, new_positions) where unmapped positions are None. If
return_df=True: DataFrame with detailed input/output columns.

Raises: ValueError: If chromosomes and positions have different lengths,
or invalid bases. ImportError: If pandas is required (return_df=True)
but not installed.

Examples: \>\>\> \# Liftover from hg19 to hg38 (1-based to 1-based)
\>\>\> new_chroms, new_pos = liftover_positions( … chromosomes=\[‘chr1’,
‘chr1’\], … positions=\[100001, 200000\], … genome_from=‘hg19’, …
genome_to=‘hg38’, … index_base_from=1, … index_base_to=1 … ) \>\>\>
print(list(zip(new_chroms, new_pos))) \>\>\> \>\>\> \# Base conversion
only (no liftover) \>\>\> new_chroms, new_pos = liftover_positions( …
chromosomes=\[‘chr1’\], … positions=\[100\], … index_base_from=0, …
index_base_to=1 … )\*

``` python
# quick invariance checks (smoke tests)
# These won't fail the notebook export but help catch accidental regressions when running interactively.
try:
    # 0->0 should match 1->1 adjusted by conversion on both ends.
    ch0, p0 = liftover_positions(["chr1"], [100000], "hg19", "hg38", index_base_from=0, index_base_to=0)
    ch1, p1 = liftover_positions(["chr1"], [100001], "hg19", "hg38", index_base_from=1, index_base_to=1)
    if ch0[0] is not None and ch1[0] is not None:
        assert ch0[0] == ch1[0]
        assert p0[0] == p1[0]
except Exception:
    # Non-fatal in notebook context
    pass
```
