Loading and updating data#

The primary means of managing Gene Normalizer data is via the included command-line interface.

Note

See the ETL API documentation for information on programmatic access to the data loader classes.

gene-normalizer#

gene-normalizer [OPTIONS] COMMAND [ARGS]...

Manage Gene Normalizer data.

Options

--version#: Show the version and exit.

check-db#

gene-normalizer check-db [OPTIONS]

Perform basic checks on DB health and population. Exits with status code 1 if DB schema is uninitialized or if critical tables appear to be empty.

$ gene-normalizer check-db
$ echo $?
1  # indicates failure

This command is equivalent to the combination of the database classes’ check_schema_initialized() and check_tables_populated() methods:

>>> from gene.database import create_db
>>> db = create_db()
>>> db.check_schema_initialized() and db.check_tables_populated()
True  # DB passes checks

Options

--db_url <db_url>#: URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/gene_normalizer").

--silent#: Suppress output to console.

dump-database#

gene-normalizer dump-database [OPTIONS]

Dump data from database into file.

DynamoDB export to existing dynamodb_local_latest directory:

$ gene-normalizer dump-database -o dynamodb_local_latest --db_url http://localhost:8001

Options

-o, --output_directory <output_directory>#: Output location to write to

--db_url <db_url>#: URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/gene_normalizer").

--silent#: Suppress output to console.

dump-mappings#

gene-normalizer dump-mappings [OPTIONS]

Produce JSON Lines file dump of concept referents (e.g. name/label, alias, xrefs) and the associated concept.

By default, produces output for all known referents to a normalized ID. The --scope option can be used to constrain this either to all non-merged identity records:

$ gene-normalizer dump-mappings --scope identity

Or to the identity records of a specific source:

$ gene-normalizer dump-mappings --scope ncit

The first object in the .jsonl file will contain metadata about parameters used to create the document.

Options

--db_url <db_url>#: URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/gene_normalizer").

--scope <scope>#

Scope of mappings – either an item type (merged/normalized vs base source records), or base records of an individaul source

Options:: RecordType.IDENTITY | RecordType.MERGER | SourceName.HGNC | SourceName.ENSEMBL | SourceName.NCBI

-o, --outfile <outfile>#: Output location to write to

--protein-coding-only#: Whether to constrain mappings to only include genes annotated as protein-coding

update#

gene-normalizer update [OPTIONS] [SOURCES]...

Update provided normalizer SOURCES in the gene database.

Valid SOURCES are "HGNC", "NCBI", and "Ensembl" (case is irrelevant).

SOURCES are optional, but if not provided, either --all or --normalize must be used.

For example, the following command will update NCBI and HGNC source records:

$ gene-normalizer update HGNC NCBI

To completely reload all source records and construct normalized concepts, use the --all and --normalize options:

$ gene-normalizer update --all --normalize

The Gene Normalizer will fetch the latest available data from all sources if local data is out-of-date. To suppress this and force usage of local files only, use the --use_existing flag:

$ gene-normalizer update --all --use_existing

Options

--all#: Update records for all sources.

--normalize#: Create normalized records.

--db_url <db_url>#: URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/gene_normalizer").

--aws_instance#: Use cloud DynamodDB instance.

--use_existing#: Use most recent locally-available source data instead of fetching latest version

--silent#: Suppress output to console.

Arguments

SOURCES#: Optional argument(s)

update-from-remote#

gene-normalizer update-from-remote [OPTIONS]

Update data from remotely-hosted DB dump. By default, fetches from latest available dump on VICC S3 bucket; specific URLs can be provided instead by command line option or ``GENE_NORM_REMOTE_DB_URL ``environment variable.

Options

--data_url <data_url>#: URL to data dump

--db_url <db_url>#: URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/gene_normalizer").

--silent#: Suppress output to console.