Installation#
The Gene Normalizer can be installed from PyPI. Users who have access to a PostgreSQL database and don’t need to regenerate the Gene Normalizer database can use the quick installation instructions. To use a DynamoDB instance, or to enable local data updates, use the full installation instructions.
Note
The Gene Normalizer defines five optional dependency groups in total:
etl
provides dependencies for regenerating data from sources. It’s necessary for users who don’t intend to rely on existing database dumps.pg
provides dependencies for connecting to a PostgreSQL database. It’s not necessary for users who are using a DynamoDB backend.dev
provides development dependencies, such as static code analysis. It’s required for contributing to the Gene Normalizer, but otherwise unnecessary.test
provides dependencies for running tests. As withdev
, it’s mostly relevant for contributors.docs
provides dependencies for documentation generation. It’s only relevant for contributors.
Quick Installation#
Requirements#
A UNIX-like environment (e.g. MacOS, WSL, Ubuntu)
Python 3.8+
A recent version of PostgreSQL (ideally at least 11+)
Package installation#
Install the Gene Normalizer, and the pg
dependency group, via PyPI:
pip install "gene-normalizer[pg]"
Database setup#
Create a new PostgreSQL database. For example, using the psql createdb utility, and assuming that postgres
is a valid user:
createdb -h localhost -p 5432 -U postgres gene_normalizer
Set the environment variable GENE_NORM_DB_URL
to a connection description for that database. See the PostgreSQL connection string documentation for more information:
export GENE_NORM_DB_URL=postgres://postgres@localhost:5432/gene_normalizer
Load data#
Use the gene_norm_update_remote
shell command to load data from the most recent remotely-stored data dump:
gene_norm_update_remote
Start service#
Finally, start an instance of the gene normalizer API on port 5000:
uvicorn gene.main:app --port=5000
Point your browser to http://localhost:5000/gene/. You should see the SwaggerUI page demonstrating available REST endpoints.
The beginning of the response to a GET request to http://localhost:5000/gene/normalize?q=braf should look something like this:
{
"query": "BRAF",
"warnings": [],
"match_type": 100,
"service_meta_": {
"name": "gene-normalizer",
"version": "0.3.0-dev1",
"response_datetime": "2023-09-29 14:53:07.329897",
"url": "https://github.com/cancervariants/gene-normalization"
},
"normalized_id": "hgnc:1097",
"gene": {
"id": "normalize.gene.hgnc:1097",
"label": "BRAF",
...
}
}
Full Installation#
Requirements#
A UNIX-like environment (e.g. MacOS, WSL, Ubuntu) with superuser permissions
Python 3.8+
A recent version of PostgreSQL (ideally at least 11+), if using PostgreSQL as the database backend
An available Java runtime (version 8.x or newer), or Docker Desktop, if using DynamoDB as the database backend
Package installation#
First, install the Gene Normalizer from PyPI:
pip install "gene-normalizer[etl]"
The [etl]
option installs dependencies necessary for using the gene.etl
package, which performs data loading operations.
Users intending to utilize PostgreSQL to store source data should also include the pg
dependency group:
pip install "gene-normalizer[etl,pg]"
SeqRepo#
Next, acquire SeqRepo sequence and alias data.
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29 # Replace with latest version using `seqrepo list-remote-instances` if outdated
If you encounter an error like the following:
PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'
You may need to manually finish moving sequence files (replace the XXXXXX characters in the path below with the temporary name created by your instance):
sudo mv /usr/local/share/seqrepo/2021-01-29.XXXXXXX /usr/local/share/seqrepo/2021-01-29
By default, the Gene Normalizer expects seqrepo data to be located at /usr/local/share/seqrepo/latest
. To designate an alternate location, set the SEQREPO_ROOT_DIR
environment variable.
Database setup#
The Gene Normalizer requires a separate database process for data storage and retrieval. See the instructions on database setup and population for the available database options:
By default, the Gene Normalizer will attempt to connect to a DynamoDB instance listening at http://localhost:8000
.
Load data#
To load all source data, and then generate normalized records, use the following shell command:
gene_norm_update --update_all --update_merged
This will download the latest available versions of all source data files, extract and transform recognized gene concepts, load them into the database, and construct normalized concept groups. For more specific update commands, see Loading and updating data.
Start service#
Start an instance of the gene normalizer API:
uvicorn gene.main:app --port=5000
Point your browser to http://localhost:5000/gene/. You should see the SwaggerUI page demonstrating available REST endpoints.