Usage#

Overview#

The Gene Normalizer provides three different search modes:

  • search: for each source, find the record or records that best match the given search string. Returns gene records.

  • normalize: find the normalized concept that best matches the given search string. Return a merged record that incorporates data from all associated records from each source. Returns a normalized gene object. See Building normalized concepts and records for more information.

  • normalize_unmerged: return each source record associated with the normalized concept that best matches the given search string. Returns gene records.

REST endpoints#

Once HTTP service is activated, OpenAPI documentation for the REST endpoints is available under the /genes path (e.g., with default service parameters, at http://localhost:8000/genes), describing endpoint parameters and response objects, and providing some minimal example queries. A live instance is available at https://normalize.cancervariants.org/gene.

The individual endpoints are:

  • /genes/search

  • /genes/normalize

  • /genes/normalize_unmerged

Internal Python API#

Each search mode can be accessed directly within Python using the query API:

>>> from gene.database import create_db
>>> from gene.query import QueryHandler
>>> q = QueryHandler(create_db())
>>> normalized_response = q.normalize('HER2')
>>> normalized_response
>>> normalized_response.match_type
<MatchType.ALIAS: 60>
>>> normalized_response.gene.label
'ERBB2'

Critically, the QueryHandler class must receive a database interface instance as its first argument. The most straightforward way to construct a database instance, as demonstrated above, is with the create_db() method. This method tries to build a database connection based on a number of conditions, which are resolved in the following order:

  1. if environment variable GENE_NORM_ENV is set to a value, or if the aws_instance method argument is True, try to create a cloud DynamoDB connection

  2. if the db_url method argument is given a non-None value, try to create a DB connection to that address (if it looks like a PostgreSQL URL, create a PostgreSQL connection, but otherwise try DynamoDB)

  3. if the GENE_NORM_DB_URL environment variable is set, try to create a DB connection to that address (if it looks like a PostgreSQL URL, create a PostgreSQL connection, but otherwise try DynamoDB)

  4. otherwise, attempt a DynamoDB connection to the default URL, http://localhost:8000

Users hoping for a more explicit connection declaration may instead call a database class directly, e.g.:

from gene.database.postgresql import PostgresDatabase
from gene.query import QueryHandler
pg_db = PostgresDatabase(
    user="postgres",
    password="matthew_cannon2",
    db_name="gene_normalizer"
)
q = QueryHandler(pg_db)

See the API documentation for the database, DynamoDB, and PostgreSQL modules for more details.

Inputs#

Gene symbols and aliases often contain only a handful of characters, raising a non-zero risk that search terms can be ambiguous or conflicting (see our lab's research on this topic). As described below, the Gene Normalizer will return the “best available” match where multiple are available, but users are advised to use concept identifiers or current, approved HGNC symbols where available.

Match types#

The best match for a search string is determined by which fields in a gene record that it matches against. The Gene Normalizer will first try to match a search string against known concept IDs and gene symbols, then check for matches against previous or deprecated symbols, then aliases, etc. Matches are case-insensitive but must otherwise be exact.

class gene.schemas.MatchType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: IntEnum

Define string constraints for use in Match Type attributes.

ALIAS = 60[source]#
ASSOCIATED_WITH = 60[source]#
CONCEPT_ID = 100[source]#
FUZZY_MATCH = 20[source]#
NO_MATCH = 0[source]#
PREV_SYMBOL = 80[source]#
SYMBOL = 100[source]#
XREF = 60[source]#

Note

The FUZZY_MATCH Match Type is not currently used by the Gene Normalizer.