download_processed

Download processed RRBS data. This is usually in BED format.

GEO Integration with Resilient Fetching

The module provides a resilient wrapper around geofetch.Geofetcher that includes:

  • Automatic Retries: Uses exponential backoff to handle transient network failures.
  • Smart Caching: Locally caches project metadata to avoid redundant network requests.
  • Clear Interface: Simple methods for listing and downloading datasets.

Import and Setup


source

Geofetcher

 Geofetcher (name:str='', metadata_root:str='', metadata_folder:str='',
             just_metadata:bool=False, refresh_metadata:bool=False,
             config_template:str|None=None,
             pipeline_samples:str|None=None,
             pipeline_project:str|None=None, skip:int=0,
             acc_anno:bool=False, use_key_subset:bool=False,
             processed:bool=False, data_source:str='samples',
             filter:str|None=None, filter_size:str|None=None,
             geo_folder:str='.', split_experiments:bool=False,
             bam_folder:str='', fq_folder:str='', sra_folder:str='',
             bam_conversion:bool=False, picard_path:str='',
             input:str|None=None, const_limit_project:int=50,
             const_limit_discard:int=1000, attr_limit_truncate:int=500,
             max_soft_size:str='1GB', discard_soft:bool=False,
             add_dotfile:bool=False, disable_progressbar:bool=False,
             add_convert_modifier:bool=False, opts:object|None=None,
             max_prefetch_size:str|int|None=None, **kwargs:object)

Class to download or get projects, metadata, data from GEO and SRA.

# Example: Create a Geofetcher instance
geo = Geofetcher(just_metadata=True)
acc = 'GSE51239'
print(f"Fetching metadata for project: {acc}")
[INFO] [18:08:41] Metadata folder: /mnt/idms/home/magyary/bs-dna-methyl/nbs/project_name

List Available Projects

Query GEO for available processed files within a project:

# Example: Get project metadata
# projects = geo.get_projects(acc)
# projects
[INFO] [18:08:41] Metadata folder: /mnt/idms/home/magyary/bs-dna-methyl/nbs/project_name

[INFO] [18:08:41] Trying GSE51239 (not a file) as accession...

[INFO] [18:08:41] Trying GSE51239 (not a file) as accession...

[INFO] [18:08:41] Skipped 0 accessions. Starting now.

[INFO] [18:08:41] Processing accession 1 of 1: 'GSE51239'

[INFO] [18:08:43] Processed 48 samples.

[INFO] [18:08:43] Expanding metadata list...

[INFO] [18:08:43] Found SRA Project accession: SRP030612

[INFO] [18:08:43] Downloading SRP030612 sra metadata

[INFO] [18:08:46] Parsing SRA file to download SRR records

[INFO] [18:08:46] Dry run, no data will be downloaded

[INFO] [18:08:46] Finished processing 1 accession(s)

[INFO] [18:08:46] Cleaning soft files ...

[INFO] [18:08:46] Creating complete project annotation sheets and config file...
{'GSE51239_raw': Project
 48 samples (showing first 20): hsperm-524-90, hsperm-530-90, hsperm-533-90, hsperm-534-90, h8c-1, h8c-2, hblast-1, hblast-2, hblast-3, hblastsingle-2, hblastsingle-5, hicm-1, hicm-2, hte-1, hte-2, hesp0-e1, hesp0-e4, hesp0-e5, hesp1-e1, hesp1-e4
 Sections: name, pep_version, sample_table, experiment_metadata, sample_modifiers, description}

Download Processed Files

Create a Geofetcher instance configured for downloading processed data:

# projects_files = geof.get_projects(acc, just_metadata=False, ignore_cache=True)
[INFO] [18:09:55] Metadata folder: /mnt/idms/home/magyary/bs-dna-methyl/nbs/project_name

[INFO] [18:09:55] Trying GSE51239 (not a file) as accession...

[INFO] [18:09:55] Trying GSE51239 (not a file) as accession...

[INFO] [18:09:55] Skipped 0 accessions. Starting now.

[INFO] [18:09:55] Processing accession 1 of 1: 'GSE51239'

[INFO] [18:09:57] Processed 48 samples.

[INFO] [18:09:57] Expanding metadata list...

[INFO] [18:09:57] Found SRA Project accession: SRP030612

[INFO] [18:09:57] Downloading SRP030612 sra metadata

[INFO] [18:09:58] Parsing SRA file to download SRR records

[INFO] [18:09:58] Getting SRR: SRR1003182  in (GSE51239)
2025-07-28T16:09:58 prefetch.3.2.1: 1) Resolving 'SRR1003182'...
2025-07-28T16:09:59 prefetch.3.2.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
[INFO] [18:10:00] Getting SRR: SRR1003183  in (GSE51239)
2025-07-28T16:10:00 prefetch.3.2.1: 1) 'SRR1003182' is found locally 
2025-07-28T16:10:00 prefetch.3.2.1: 1) Resolving 'SRR1003183'...
2025-07-28T16:10:01 prefetch.3.2.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2025-07-28T16:10:02 prefetch.3.2.1: 1) Downloading 'SRR1003183'...
2025-07-28T16:10:02 prefetch.3.2.1:  SRA Normalized Format file is being retrieved
2025-07-28T16:10:02 prefetch.3.2.1:  Downloading via HTTPS...
2025-07-28T16:10:02 prefetch.3.2.1:    Continue download of 'SRR1003183' from 154660408

Explore Downloaded Files

Once downloaded, you can explore the sample table to see available processed files: