setup

Setup directories to store data.

Data Path Management

The package uses a centralized configuration system to manage where data is stored locally. This is essential for consistency across different workflows and environments.

Priority-based Path Resolution

The get_base_data_path() function determines the data storage location using a three-tier priority system:

Environment Variable (BS_CPG_DATA): Highest priority, ideal for CI/CD pipelines and automated workflows.
Config File (~/.bs-cpg-config.json): For returning users, stores the preferred path.
Interactive Prompt: Last resort, asks the user to specify a path and saves it for future use.

This ensures flexibility across different deployment scenarios.

source

get_base_data_path

 get_base_data_path ()

Determines the base data path with a clear priority: 1. BS_CPG_DATA environment variable. 2. Path stored in ~/.bs-cpg-config.json. 3. Prompts the user for the path as a last resort.

source

read_sample_cpg

 read_sample_cpg (columns:list=None, force_download:bool=False)

*Downloads and reads a sample CpG Parquet file.

This function fetches a sample dataset from the project’s GitHub repository. It caches the file locally to avoid re-downloading on subsequent calls.

Args: columns (list, optional): A list of columns to read from the file. Defaults to None (all columns). force_download (bool, optional): If True, forces a re-download of the file even if it exists locally. Defaults to False.

Returns: pd.DataFrame: A DataFrame containing the sample CpG data.

Examples: >>> # Load the entire sample dataset >>> df = read_sample_cpg() >>> >>> # Load only specific columns >>> df_subset = read_sample_cpg(columns=[“chromosome”, “pos”]) >>> >>> # Force re-download if needed >>> df_fresh = read_sample_cpg(force_download=True)*

# Example: Get the base data path
base_path = get_base_data_path()
print(f"Data will be stored in: {base_path}")

Path('/mnt/idms/home/magyary/.bs-cpg')