Normalization#

Overview#

The Gene Normalizer extracts gene descriptions and related metadata from curated knowledge sources and stores them as gene records. Once stored, it also provides a mapping from each gene record to a normalized gene concept, and produced a combined record for that concept.

Basic information model#

Data resources, such as NCBI Gene, HGNC, and Ensembl, provide descriptions of individual genes, which we refer to as records. Our normalization routines construct mappings between those records and individual normalized concepts. Those concepts are abstract representations of “true” unique entities that exist on the genome. By combining the normalized concept with its associated source records to produce a normalized record, we are able to provide a more comprehensive description of individual genes.

The gene record#

The gene.etl package contains classes for extracting relevant data for each source record. The gene.schemas.BaseGene class demonstrates the kinds of information that the ETL methods attempt to acquire from each source:

class gene.schemas.BaseGene(**data)[source]

Bases: BaseModel

Base gene model. Provide shared resources for records produced by /search and /normalize_unmerged.

aliases: List[Annotated[str]][source]
associated_with: List[Annotated[str]][source]
concept_id: Annotated[str][source]
gene_type: Optional[Annotated[str]][source]
label: Optional[Annotated[str]][source]
location_annotations: List[Annotated[str]][source]
locations: Union[List[SequenceLocation], List[GeneSequenceLocation]][source]
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

previous_symbols: List[Annotated[str]][source]
strand: Optional[Strand][source]
symbol: Annotated[str][source]
symbol_status: Optional[SymbolStatus][source]
xrefs: List[Annotated[str]][source]

Building normalized concepts and records#

Once all source records have been loaded into the database, normalized concept construction proceeds by grouping source records according to cross-references. Consider the following records referring to genes OTX2P1 and OTX2P2:

  • The NCBI record for OTX2P1, ncbigene:100033409, references HGNC record hgnc:33281

  • The HGNC record for OTX2P1, hgnc:33281, references Ensembl record ensembl:ENSG00000234644

  • The NCBI record for OTX2P2, ncbigene:100419816, references both HGNC record hgnc:54560 and Ensembl record ensembl:ENSG00000227134

  • The HGNC record for OTX2P2, hgnc:54560, references Ensembl record ensembl:ENSG00000227134 and NCBI record ncbigene:100419816

  • The Ensembl record for OTX2P2, ensembl:ENSG00000227134, references HGNC record hgnc:54560

From this, the Gene Normalizer is able to produce two concept groups (one for each record), which the following visual makes clear:

Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

In practice, gene curation by these sources is quite thorough, and most records for well-understood genes in each source contain cross-reference to the corresponding records in the other sources. However, for normalized concept generation, it is sufficient for any record to be included in a normalized concept grouping if there is at least one cross-reference, in either direction, joining it to the rest of the concept group.

After grouping is complete, a concept ID for each normalized concept is selected from the record from the highest-priority source in each group. The SourcePriority class defines this priority ranking:

class gene.schemas.SourcePriority(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Define priorities for sources when building merged concepts.

ENSEMBL = 2[source]
HGNC = 1[source]
NCBI = 3[source]

Normalized gene records are constructed by merging known data from all associated gene records. For array-like fields (e.g. aliases, cross-references to entries in other data sources), data from all sources are simply combined. For scalar-like fields (e.g. the gene’s symbol), the value is selected from an individual source record according to the priority assigned to the source.

The normalized record#

Normalized records are structured as Genes per the VRS 2.x schema. The normalized gene concept ID is given, and additional metadata such as a label, mappings, aliases, and Extensions (for more complex information such as loci and gene type) is included. For example, the normalized result for the BRAF gene may be described as follows:

Example

{
  "id": "normalize.gene.hgnc:1097",
  "label": "BRAF",
  "extensions": [
    {
      "type": "Extension",
      "name": "symbol_status",
      "value": "approved"
    },
    {
      "type": "Extension",
      "name": "approved_name",
      "value": "B-Raf proto-oncogene, serine/threonine kinase"
    },
    {
      "type": "Extension",
      "name": "strand",
      "value": "-"
    },
    {
      "type": "Extension",
      "name": "ensembl_locations",
      "value": [
        {
          "id": "ga4gh:SL.WJ0hsPzXuK54mQyVysTqUNV5jaCATnRf",
          "type": "SequenceLocation",
          "sequenceReference": {
            "type": "SequenceReference",
            "refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
          },
          "start": 140719326,
          "end": 140924929
        }
      ]
    },
    {
      "type": "Extension",
      "name": "ncbi_locations",
      "value": [
        {
          "id": "ga4gh:SL.uNBZoxhjhohl24VlIut-JxPJAGfJ7EQE",
          "type": "SequenceLocation",
          "sequenceReference": {
            "type": "SequenceReference",
            "refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
          },
          "start": 140713327,
          "end": 140924929
        }
      ]
    },
    {
      "type": "Extension",
      "name": "hgnc_locus_type",
      "value": "gene with protein product"
    },
    {
      "type": "Extension",
      "name": "ncbi_gene_type",
      "value": "protein-coding"
    },
    {
      "type": "Extension",
      "name": "ensembl_biotype",
      "value": "protein_coding"
    }
  ],
  "mappings": [
    {
      "coding": {
        "system": "ncbigene",
        "code": "673"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ensembl",
        "code": "ENSG00000157764"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "iuphar",
        "code": "1943"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "omim",
        "code": "164757"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ccds",
        "code": "CCDS94218"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "pubmed",
        "code": "1565476"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "vega",
        "code": "OTTHUMG00000157457"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ucsc",
        "code": "uc003vwc.5"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ena.embl",
        "code": "M95712"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ccds",
        "code": "CCDS87555"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ccds",
        "code": "CCDS5863"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "cosmic",
        "code": "BRAF"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "pubmed",
        "code": "2284096"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "orphanet",
        "code": "119066"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "refseq",
        "code": "NM_004333"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "uniprot",
        "code": "P15056"
      },
      "relation": "relatedMatch"
    },
    {
      "coding": {
        "system": "ccds",
        "code": "CCDS94219"
      },
      "relation": "relatedMatch"
    }
  ],
  "type": "Gene",
  "aliases": [
    "RAFB1",
    "B-RAF1",
    "BRAF1",
    "BRAF-1",
    "B-raf",
    "NS7"
  ]
}