Loading and updating data#
The primary means of managing Gene Normalizer data is via the included command-line interface.
Note
See the ETL API documentation for information on programmatic access to the data loader classes.
gene-normalizer#
gene-normalizer [OPTIONS] COMMAND [ARGS]...
Manage Gene Normalizer data.
Options
- --version#
Show the version and exit.
check-db#
gene-normalizer check-db [OPTIONS]
Perform basic checks on DB health and population. Exits with status code 1 if DB schema is uninitialized or if critical tables appear to be empty.
$ gene-normalizer check-db
$ echo $?
1 # indicates failure
This command is equivalent to the combination of the database classes’
check_schema_initialized() and check_tables_populated() methods:
>>> from gene.database import create_db
>>> db = create_db()
>>> db.check_schema_initialized() and db.check_tables_populated()
True # DB passes checks
Options
- --db_url <db_url>#
URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g.
"http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g."postgresql://postgres:password@localhost:5432/gene_normalizer").
- --silent#
Suppress output to console.
dump-database#
gene-normalizer dump-database [OPTIONS]
Dump data from database into file.
Options
- -o, --output_directory <output_directory>#
Output location to write to
- --db_url <db_url>#
URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g.
"http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g."postgresql://postgres:password@localhost:5432/gene_normalizer").
- --silent#
Suppress output to console.
update#
gene-normalizer update [OPTIONS] [SOURCES]...
Update provided normalizer SOURCES in the gene database.
Valid SOURCES are "HGNC", "NCBI", and "Ensembl" (case is irrelevant).
SOURCES are optional, but if not provided, either --all or --normalize must be used.
For example, the following command will update NCBI and HGNC source records:
$ gene-normalizer update HGNC NCBI
To completely reload all source records and construct normalized concepts, use the
--all and --normalize options:
$ gene-normalizer update --all --normalize
The Gene Normalizer will fetch the latest available data from all sources if local
data is out-of-date. To suppress this and force usage of local files only, use the
--use_existing flag:
$ gene-normalizer update --all --use_existing
Options
- --all#
Update records for all sources.
- --normalize#
Create normalized records.
- --db_url <db_url>#
URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g.
"http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g."postgresql://postgres:password@localhost:5432/gene_normalizer").
- --aws_instance#
Use cloud DynamodDB instance.
- --use_existing#
Use most recent locally-available source data instead of fetching latest version
- --silent#
Suppress output to console.
Arguments
- SOURCES#
Optional argument(s)
update-from-remote#
gene-normalizer update-from-remote [OPTIONS]
Update data from remotely-hosted DB dump. By default, fetches from latest available dump on VICC S3 bucket; specific URLs can be provided instead by command line option or ``GENE_NORM_REMOTE_DB_URL ``environment variable.
Options
- --data_url <data_url>#
URL to data dump
- --db_url <db_url>#
URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g.
"http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g."postgresql://postgres:password@localhost:5432/gene_normalizer").
- --silent#
Suppress output to console.