PostgreSQL#

The Gene Normalizer can store and retrieve gene records from a PostgreSQL database. See the “Getting Started” section of the PostgreSQL documentation for basic installation instructions.

Note

See the PostgreSQL handler API reference for information on programmatic access.

Local setup#

To populate the Gene Normalizer, a connection must be established to an existing PostgreSQL database, so one must be created manually when performing Gene Normalizer setup. Most PostgreSQL distributions include the createdb utility for this purpose. For example, to create a database named gene_normalizer in a local database listening on port 5432 using the PostgreSQL user named postgres, run the following shell command:

createdb -h localhost -p 5432 -U postgres gene_normalizer

Once created, set the environment variable GENE_NORM_DB_URL to a connection description for that database. The following command provides a connection to a database named gene_normalizer in a local PostgreSQL instance, using port 5432, under the username postgres with no required password. See the PostgreSQL connection string documentation for more information.

export GENE_NORM_DB_URL=postgres://postgres@localhost:5432/gene_normalizer

Load from remote source#

The Gene Normalizer’s PostgreSQL class provides the gene_norm_update_remote shell command to refresh its data directly from a remotely-stored SQL dump, instead of acquiring, transforming, and loading source data. This enables data loading on the order of seconds rather than hours. See the command description at gene_norm_update_remote --help for more information.

By default, this command will fetch the latest data dump provided by the VICC. Alternative URLs can be set with the --data_url option:

gene_norm_update_remote --data_url=https://vicc-normalizers.s3.us-east-2.amazonaws.com/gene_normalization/postgresql/gene_norm_20230322163523.sql.tar.gz

Create SQL dump from database#

The Gene Normalizer’s PostgreSQL class also provides the gene_norm_dump shell command to create a SQL dump of current data into a file. This command will create a file named gene_norm_YYYYMMDDHHmmss.sql in the current directory; the -o option can be used to specify an alternate location, like so:

gene_norm_dump -o ~/.gene_data/

See gene_norm_dump --help for more information.