Source ETL API#
Base#
- class gene.etl.base.Base(database, host, data_dir, src_data_dir, seqrepo_dir=PosixPath('/usr/local/share/seqrepo/latest'), *args, **kwargs)[source]#
The ETL base class.
- __init__(database, host, data_dir, src_data_dir, seqrepo_dir=PosixPath('/usr/local/share/seqrepo/latest'), *args, **kwargs)[source]#
Instantiate Base class.
- Parameters:
database (
AbstractDatabase) – database instancehost (
str) – Hostname of FTP sitedata_dir (
str) – Data directory of FTP site to look atsrc_data_dir (
Path) – Data directory for sourceseqrepo_dir (
Path) – Path to seqrepo directory
NCBI#
- class gene.etl.ncbi.NCBI(database, host='ftp.ncbi.nlm.nih.gov', data_dir='gene/DATA/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/ncbi'))[source]#
Bases:
BaseETL class for NCBI source
- __init__(database, host='ftp.ncbi.nlm.nih.gov', data_dir='gene/DATA/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/ncbi'))[source]#
Construct the NCBI ETL instance.
- Parameters:
database (
AbstractDatabase) – gene database for adding new datahost (
str) – FTP host namedata_dir (
str) – FTP data directory to usesrc_data_dir (
Path) – Data directory for NCBI
HGNC#
- class gene.etl.hgnc.HGNC(database, host='ftp.ebi.ac.uk', data_dir='pub/databases/genenames/hgnc/json/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/hgnc'), fn='hgnc_complete_set.json')[source]#
Bases:
BaseETL the HGNC source into the normalized database.
- __init__(database, host='ftp.ebi.ac.uk', data_dir='pub/databases/genenames/hgnc/json/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/hgnc'), fn='hgnc_complete_set.json')[source]#
Initialize HGNC ETL class.
- Parameters:
database (
AbstractDatabase) – DynamoDB databasehost (
str) – FTP host namedata_dir (
str) – FTP data directory to usesrc_data_dir (
Path) – Data directory for HGNCfn (
str) – Data file to download
Ensembl#
- class gene.etl.ensembl.Ensembl(database, host='ftp.ensembl.org', data_dir='pub/current_gff3/homo_sapiens/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/ensembl'))[source]#
Bases:
BaseETL the Ensembl source into the normalized database.
- __init__(database, host='ftp.ensembl.org', data_dir='pub/current_gff3/homo_sapiens/', src_data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/gene-normalizer/envs/0.1.37/lib/python3.10/site-packages/gene/data/ensembl'))[source]#
Initialize Ensembl ETL class.
- Parameters:
database (
AbstractDatabase) – DynamoDB databasehost (
str) – FTP host namedata_dir (
str) – FTP data directory to usesrc_data_dir (
Path) – Data directory for Ensembl
Normalized Records#
- class gene.etl.merge.Merge(database)[source]#
Bases:
objectHandles record merging.
- __init__(database)[source]#
Initialize Merge instance.
- Parameters:
database (
AbstractDatabase) – db instance to use for record retrieval and creation.
Chromosome Location#
- class gene.etl.vrs_locations.chromosome_location.ChromosomeLocation[source]#
The class for GA4GH Chromosome Location.
- get_location(location, gene)[source]#
Transform a gene’s location into a Chromosome Location.
- Parameters:
location (
Dict) – A gene’s location.gene (
Dict) – A transformed gene record.
- Return type:
Optional[Dict]- Returns:
If location is a valid VRS ChromosomeLocation, return a dictionary containing the ChromosomeLocation. Else, return None.
Sequence Location#
- class gene.etl.vrs_locations.sequence_location.SequenceLocation[source]#
The class for GA4GH Sequence Location.
- add_location(seqid, gene, params, sr)[source]#
Get a gene’s Sequence Location.
- Parameters:
seqid (
str) – The sequence ID.gene (
Feature) – A gene from the source file.params (
Dict) – The transformed gene record.sr (
SeqRepo) – Access to the SeqRepo
- Return type:
Dict- Returns:
A dictionary of a GA4GH VRS SequenceLocation.