Source ETL API#

Base#

class gene.etl.base.Base(database, host, data_dir, src_data_dir, seqrepo_dir=SEQREPO_ROOT_DIR, *args, **kwargs)[source]#

The ETL base class.

__init__(database, host, data_dir, src_data_dir, seqrepo_dir=SEQREPO_ROOT_DIR, *args, **kwargs)[source]#

Instantiate Base class.

Parameters:
  • database (AbstractDatabase) – database instance

  • host (str) – Hostname of FTP site

  • data_dir (str) – Data directory of FTP site to look at

  • src_data_dir (Path) – Data directory for source

  • seqrepo_dir (Path) – Path to seqrepo directory

get_seqrepo(seqrepo_dir)[source]#

Return SeqRepo instance if seqrepo_dir exists.

Parameters:

seqrepo_dir (Path) – Path to seqrepo directory

Return type:

SeqRepo

Returns:

SeqRepo instance

abstract perform_etl()[source]#

Extract, Transform, and Load data into database.

Return type:

List[str]

Returns:

Concept IDs of concepts successfully loaded

NCBI#

class gene.etl.ncbi.NCBI(database, host='ftp.ncbi.nlm.nih.gov', data_dir='gene/DATA/', src_data_dir=APP_ROOT / 'data' / 'ncbi')[source]#

Bases: Base

ETL class for NCBI source

__init__(database, host='ftp.ncbi.nlm.nih.gov', data_dir='gene/DATA/', src_data_dir=APP_ROOT / 'data' / 'ncbi')[source]#

Construct the NCBI ETL instance.

Parameters:
  • database (AbstractDatabase) – gene database for adding new data

  • host (str) – FTP host name

  • data_dir (str) – FTP data directory to use

  • src_data_dir (Path) – Data directory for NCBI

perform_etl()[source]#

Perform ETL methods.

Return type:

List[str]

Returns:

Concept IDs of concepts successfully loaded

HGNC#

class gene.etl.hgnc.HGNC(database, host='ftp.ebi.ac.uk', data_dir='pub/databases/genenames/hgnc/json/', src_data_dir=APP_ROOT / 'data' / 'hgnc', fn='hgnc_complete_set.json')[source]#

Bases: Base

ETL the HGNC source into the normalized database.

__init__(database, host='ftp.ebi.ac.uk', data_dir='pub/databases/genenames/hgnc/json/', src_data_dir=APP_ROOT / 'data' / 'hgnc', fn='hgnc_complete_set.json')[source]#

Initialize HGNC ETL class.

Parameters:
  • database (AbstractDatabase) – DynamoDB database

  • host (str) – FTP host name

  • data_dir (str) – FTP data directory to use

  • src_data_dir (Path) – Data directory for HGNC

  • fn (str) – Data file to download

perform_etl(*args, **kwargs)[source]#

Extract, Transform, and Load data into DynamoDB database.

Return type:

List[str]

Returns:

Concept IDs of concepts successfully loaded

Ensembl#

class gene.etl.ensembl.Ensembl(database, host='ftp.ensembl.org', data_dir='pub/current_gff3/homo_sapiens/', src_data_dir=APP_ROOT / 'data' / 'ensembl')[source]#

Bases: Base

ETL the Ensembl source into the normalized database.

__init__(database, host='ftp.ensembl.org', data_dir='pub/current_gff3/homo_sapiens/', src_data_dir=APP_ROOT / 'data' / 'ensembl')[source]#

Initialize Ensembl ETL class.

Parameters:
  • database (AbstractDatabase) – DynamoDB database

  • host (str) – FTP host name

  • data_dir (str) – FTP data directory to use

  • src_data_dir (Path) – Data directory for Ensembl

perform_etl(*args, **kwargs)[source]#

Extract, Transform, and Load data into DynamoDB database.

Return type:

List[str]

Returns:

Concept IDs of concepts successfully loaded

Normalized Records#

class gene.etl.merge.Merge(database)[source]#

Bases: object

Handles record merging.

__init__(database)[source]#

Initialize Merge instance.

Parameters:

database (AbstractDatabase) – db instance to use for record retrieval and creation.

create_merged_concepts(record_ids)[source]#

Create concept groups, generate merged concept records, and update database.

Parameters:

record_ids (Set[str]) – concept identifiers from which groups should be generated. Should not include any records from excluded sources.

Return type:

None

Chromosome Location#

class gene.etl.vrs_locations.chromosome_location.ChromosomeLocation[source]#

The class for GA4GH Chromosome Location.

get_location(location, gene)[source]#

Transform a gene’s location into a Chromosome Location.

Parameters:
  • location (Dict) – A gene’s location.

  • gene (Dict) – A transformed gene record.

Return type:

Optional[Dict]

Returns:

If location is a valid VRS ChromosomeLocation, return a dictionary containing the ChromosomeLocation. Else, return None.

set_interval_range(loc, arm_ix, location)[source]#

Set the location interval range.

Parameters:
  • loc (str) – A gene location

  • arm_ix (int) – The index of the q or p arm for a given location

  • location (Dict) – A gene’s location

Return type:

None

Sequence Location#

class gene.etl.vrs_locations.sequence_location.SequenceLocation[source]#

The class for GA4GH Sequence Location.

add_location(seqid, gene, params, sr)[source]#

Get a gene’s Sequence Location.

Parameters:
  • seqid (str) – The sequence ID.

  • gene (Feature) – A gene from the source file.

  • params (Dict) – The transformed gene record.

  • sr (SeqRepo) – Access to the SeqRepo

Return type:

Dict

Returns:

A dictionary of a GA4GH VRS SequenceLocation.

get_aliases(sr, seqid)[source]#

Get aliases for a sequence id

Parameters:
  • sr (SeqRepo) – seqrepo instance

  • seqid (str) – Sequence ID accession

Return type:

List[str]

Returns:

List of aliases for seqid