gene.database.database#

Provide abstract Database class and relevant tools for database initialization.

class gene.database.database.AbstractDatabase(db_url=None, **db_args)[source]#

Define the database interface. This class should never be called directly by a user, but should be used as the parent class for all concrete database implementations.

abstract __init__(db_url=None, **db_args)[source]#

Initialize database instance.

Generally, implementing classes should be able to construct a connection by something like a libpq URL. Any additional arguments or DB-specific parameters can be passed as keywords.

Parameters:
  • db_url (Optional[str]) – address/connection description for database

  • db_args – any DB implementation-specific parameters

Raises:

DatabaseInitializationException – if initial setup fails

abstract add_merged_record(record)[source]#

Add merged record to database.

Parameters:

record (Dict) – merged record to add

Return type:

None

abstract add_record(record, src_name)[source]#

Add new record to database.

Parameters:
  • record (Dict) – record to upload

  • src_name (SourceName) – name of source for record.

Return type:

None

abstract add_source_metadata(src_name, data)[source]#

Add new source metadata entry.

Parameters:
Raises:

DatabaseWriteException – if write fails

Return type:

None

abstract check_schema_initialized()[source]#

Check if database schema is properly initialized.

Return type:

bool

Returns:

True if DB appears to be fully initialized, False otherwise

abstract check_tables_populated()[source]#

Perform rudimentary checks to see if tables are populated.

Emphasis is on rudimentary – if some rogueish element has deleted half of the gene aliases, this method won’t pick it up. It just wants to see if a few critical tables have at least a small number of records.

Return type:

bool

Returns:

True if queries successful, false if DB appears empty

abstract close_connection()[source]#

Perform any manual connection closure procedures if necessary.

Return type:

None

abstract complete_write_transaction()[source]#

Conclude transaction or batch writing if relevant.

Return type:

None

abstract delete_normalized_concepts()[source]#

Remove merged records from the database. Use when performing a new update of normalized data.

Raises:
Return type:

None

abstract delete_source(src_name)[source]#

Delete all data for a source. Use when updating source data.

Parameters:

src_name (SourceName) – name of source to delete

Raises:
Return type:

None

abstract drop_db()[source]#

Initiate total teardown of DB. Useful for quickly resetting the entirety of the data. Requires manual confirmation.

Raises:

DatabaseWriteException – if called in a protected setting with confirmation silenced.

Return type:

None

abstract export_db(export_location)[source]#

Dump DB to specified location.

Parameters:

export_location (Path) – path to save DB dump at

Raise:

NotImplementedError if not supported by DB

Return type:

None

abstract get_all_concept_ids()[source]#

Retrieve all available concept IDs for use in generating normalized records.

Return type:

Set[str]

Returns:

List of concept IDs as strings.

abstract get_all_records(record_type)[source]#

Retrieve all source or normalized records. Either return all source records, or all records that qualify as “normalized” (i.e., merged groups + source records that are otherwise ungrouped).

For example,

>>> from gene.database import create_db
>>> from gene.schemas import RecordType
>>> db = create_db()
>>> for record in db.get_all_records(RecordType.MERGER):
>>>     pass  # do something
Parameters:

record_type (RecordType) – type of result to return

Return type:

Generator[Dict, None, None]

Returns:

Generator that lazily provides records as they are retrieved

abstract get_record_by_id(concept_id, case_sensitive=True, merge=False)[source]#

Fetch record corresponding to provided concept ID

Parameters:
  • concept_id (str) – concept ID for gene record

  • case_sensitive (bool) – if true, performs exact lookup, which may be quicker. Otherwise, performs filter operation, which doesn’t require correct casing.

  • merge (bool) – if true, look for merged record; look for identity record otherwise.

Return type:

Optional[Dict]

Returns:

complete gene record, if match is found; None otherwise

abstract get_refs_by_type(search_term, ref_type)[source]#

Retrieve concept IDs for records matching the user’s query. Other methods are responsible for actually retrieving full records.

Parameters:
  • search_term (str) – string to match against

  • ref_type (RefType) – type of match to look for.

Return type:

List[str]

Returns:

list of associated concept IDs. Empty if lookup fails.

abstract get_source_metadata(src_name)[source]#

Get license, versioning, data lookup, etc information for a source.

Parameters:

src_name (Union[str, SourceName]) – name of the source to get data for

Return type:

Dict

abstract initialize_db()[source]#

Perform all necessary parts of database setup. Should be tolerant of existing content – ie, this method is also responsible for checking whether the DB is already set up.

Raises:

DatabaseInitializationException – if initialization fails

Return type:

None

abstract list_tables()[source]#

Return names of tables in database.

Return type:

List[str]

Returns:

Table names in database

abstract load_from_remote(url=None)[source]#

Load DB from remote dump. Warning: Deletes all existing data.

Parameters:

url (Optional[str]) – remote location to retrieve gzipped dump file from

Raise:

NotImplementedError if not supported by DB

Return type:

None

abstract update_merge_ref(concept_id, merge_ref)[source]#

Update the merged record reference of an individual record to a new value.

Parameters:
  • concept_id (str) – record to update

  • merge_ref (Any) – new ref value

Raises:

DatabaseWriteException – if attempting to update non-existent record

Return type:

None

class gene.database.database.AwsEnvName(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

AWS environment name that is being used

DEVELOPMENT = 'Dev'[source]#
PRODUCTION = 'Prod'[source]#
STAGING = 'Staging'[source]#
exception gene.database.database.DatabaseException[source]#

Create custom class for handling database exceptions

exception gene.database.database.DatabaseInitializationException[source]#

Create custom exception for errors during DB connection initialization.

exception gene.database.database.DatabaseReadException[source]#

Create custom exception for lookup/read errors

exception gene.database.database.DatabaseWriteException[source]#

Create custom exception for write errors

gene.database.database.confirm_aws_db_use(env_name)[source]#

Check to ensure that AWS instance should actually be used.

Return type:

None

gene.database.database.create_db(db_url=None, aws_instance=False)[source]#

Database factory method. Checks environment variables and provided parameters and creates a DB instance.

Generally prefers to return a DynamoDB instance, unless all DDB-relevant environment variables are unset and a libpq-compliant URI is assigned to db_url. See the Usage section of the documentation for details.

Some examples:

>>> from gene.database import create_db
>>> default_db = create_db()  # by default, creates DynamoDB connection on port 8000
>>>
>>> postgres_url = "postgresql://postgres@localhost:5432/gene_normalizer"
>>> pg_db = create_db(postgres_url)  # creates Postgres connection at port 5432
>>>
>>> import os
>>> os.environ["GENE_NORM_DB_URL"] = "http://localhost:8001"
>>> local_db = create_db()  # creates DynamoDB connection on port 8001
>>>
>>> os.environ["GENE_NORM_ENV"] = "Prod"
>>> prod_db = create_db()  # creates connection to AWS cloud DynamoDB instance,
>>>                        # overruling `GENE_NORM_DB_URL` variable setting

Precedence is handled for connection settings like so:

  1. if environment variable GENE_NORM_ENV is set to a value, or if the aws_instance method argument is True, try to create a cloud DynamoDB connection

  2. if the db_url method argument is given a non-None value, try to create a DB connection to that address (if it looks like a PostgreSQL URL, create a PostgreSQL connection, but otherwise try DynamoDB)

  3. if the GENE_NORM_DB_URL environment variable is set, try to create a DB connection to that address (if it looks like a PostgreSQL URL, create a PostgreSQL connection, but otherwise try DynamoDB)

  4. otherwise, attempt a DynamoDB connection to the default URL, http://localhost:8000

Parameters:
  • db_url (Optional[str]) – address to database instance

  • aws_instance (bool) – use hosted DynamoDB instance, not local DB

Return type:

AbstractDatabase

Returns:

constructed Database instance