Add coordinates to genomic features #88

ahwagner · 2018-02-16T23:12:26Z

Currently, 32% of our associations in 0.8 have no genomic coordinates assigned to at least one feature.

These fall into the following categories:

Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates
Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.
7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of representative transcript coordinates.
Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of representative transcript coordinates, one from each constituent gene.
Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of representative transcript coordinates.
exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript.

Following these rules, we go from 68.4% of associations with variant start/end coords to 98.8%.

Many of these items refer to a "representative" transcript. This is annotated in ensembl as a "canonical" transcript.

Tagging as review per discussion with @jgoecks.

@bwalsh would you provide an estimate of the effort needed to make these changes?

I uploaded my WIP analysis workbook to the genie-analysis branch: https://github.com/ohsu-comp-bio/g2p-aggregator/blob/genie-analysis/notebooks/knowledgebase_comparison.ipynb

See the Feature coordinate filtering section for regular expressions that can help in selecting features for annotating with start/end coordinates.

The text was updated successfully, but these errors were encountered:

bwalsh · 2018-02-19T17:39:34Z

@ahwagner @jgoecks @mayfielg - can we discuss these changes during Tuesday's call?

ahwagner · 2018-02-19T19:29:35Z

@bwalsh sounds good.

bwalsh · 2018-02-20T15:59:23Z

Notes:

Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates

Example:

{
  "name": "ABL1:E255K",
  "links": [
    "http://reg.genome.network/refseq/RS000009",
    "http://myvariant.info/v1/variant/chr9:g.133738363G>A?assembly=hg19",
    "http://reg.genome.network/refseq/RS000033",
    "http://www.ncbi.nlm.nih.gov/clinvar/variation/376090",
    "http://reg.genome.network/refseq/RS000057",
    "http://reg.genome.network/refseq/RS002120",
    "http://myvariant.info/v1/variant/chr9:g.130862976G>A?assembly=hg38",
    "http://www.ncbi.nlm.nih.gov/snp/121913448",
    "http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=12573",
    "http://www.ncbi.nlm.nih.gov/clinvar/?term=362969[alleleid]",
    "http://reg.genome.network/allele/CA16602551"
  ],
  "start": 133738363,
  "synonyms": [
    "NC_000009.10:g.132728184G>A",
    "NG_012034.1:g.154096G>A",
    "NC_000009.11:g.133738363G>A",
    "chr9:g.133738363G>A",
    "NC_000009.12:g.130862976G>A",
    "COSM12573",
    "chr9:g.130862976G>A",
    "CM000671.1:g.133738363G>A",
    "CM000671.2:g.130862976G>A"
  ],
  "biomarker_type": "mutant",
  "referenceName": "GRCh37",
  "geneSymbol": "ABL1",
  "alt": "A",
  "ref": "G",
  "chromosome": "9",
  "description": "ABL1:T315A,F317L,F317V,F317I,F317C,F317I,Y253H,E255K,E255V,F359V,F359C,F359I"
}

Rule:

# pseudo code
if feature.start and feature.ref and not feature.end:
  feature.end = feature.start + len(feature.ref) - 1

Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.

Allele registry does not have a name lookup.
- see http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/landing
How to verify cosmic lookup?

bwalsh · 2018-02-21T23:36:52Z

from #88

Currently, searches of the type Gene Variant (e.g. BRAF V600E) do a global OR search between terms. Ideally, we could rename the features.name field to the mutation name (without the prepended gene symbol) and by default limit the search to a global AND condition.

Subtasks:

remove prepended gene symbol in features.name, propogate changes to feature_names.keyword.
change search behavior to AND by default

bwalsh · 2018-02-22T22:51:17Z

Association feature_names of the format amplification or loss should have one corresponding feature of representative transcript coordinates.

Using mygene.info and ensembl, we can look up gene location transcriptions fairly easily.
Issues:

which transcription(s) to use?
how does 'amplification' or 'loss' enter into the algorithm?

Example:

From cgi

{u'Targeting': u'', u'Alteration': u'AR:amp', u'Source': u'PMID:23589709;PMID:21859989', u'cDNA': u'', u'Primary Tumor type': u'Prostate adenocarcinoma', u'individual_mutation': u'', u'Drug full name': u'AR inhibitor next gens', u'Association': u'Responsive', u'Drug family': u'[AR inhibitor next gen]', u'Biomarker': u'AR amplification', u'Drug': u'[]', u'Curator': u'RDientsmann', u'gDNA': u'', u'Drug status': u'', u'Gene': u'AR', u'transcript': u'', u'strand': u'', u'info': u'', u'Assay type': u'', u'Alteration type': u'CNA', u'region': u'', u'Evidence level': u'Pre-clinical', u'Primary Tumor acronym': u'PRAD', u'Metastatic Tumor Type': u''}

Feature after current processing:

{u'geneSymbol': u'AR', u'referenceName': u'GRCh37', u'biomarker_type': u'amplification', u'name': u'', u'description': u'AR:amp'}

Experiment:

query mygene.info

$ curl -X GET "http://mygene.info/v3/query?q=AR&fields=HGNC%2Censembl.gene" -H "accept: application/json"
  "max_score": 370.9837,
  "took": 24,
  "total": 937,
  "hits": [
    {
      "HGNC": "644",
      "_id": "367",
      "_score": 370.9837,
      "ensembl": {
        "gene": "ENSG00000169083"
      }
    },
    {
      "_id": "11835",
      "_score": 306.0449,
      "ensembl": {
        "gene": "ENSMUSG00000046532"
      }
    },
....

using the ensembl id, formulate the location of the gene

$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '{assembly_name: .assembly_name, chromosome: .seq_region_name, start: .start, end: .end  }'
{
  "assembly_name": "GRCh37",
  "chromosome": "X",
  "start": 66764465,
  "end": 66950461
}

Transcripts are also available
exactly how are these transcripts mapped to 'amplification' ?

$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '.Transcript'
[
  {
    "source": "ensembl_havana",
    "object_type": "Transcript",
    "logic_name": "ensembl_havana_transcript",
    "Exon": [
      {
        "object_type": "Exon",
        "version": 7,
        "species": "homo_sapiens",
        "assembly_name": "GRCh37",
        "end": 66766604,
        "seq_region_name": "X",
        "db_type": "core",
        "strand": 1,
        "id": "ENSE00001326500",
        "start": 66764465
      },
      {
        "object_type": "Exon",
        "version": 1,
        "species": "homo_sapiens",
        "assembly_name": "GRCh37",
        "end": 66863249,
        "seq_region_name": "X",
        "db_type": "core",
        "strand": 1,
        "id": "ENSE00003606052",
        "start": 66863098
      },
....

bwalsh · 2018-02-22T23:15:45Z

Associations describing protein changes

A quick test shows this was a legitimate miss.

$ grep ERBB2 cosmic_lookup_table.tsv | grep S783P | wc -l
       0

Reviewing all the 187 unique gene/protein items these are the only ones that have grep 'hits'

(FGFR3, Y375C)       1
(FGFR2, M538I)       2
(FGFR2, C383R)       2
(FGFR3, K652E)       1
(FGFR3, G372C)       1

grep example:

$ grep FGFR2 cosmic_lookup_table.tsv | grep M538I
FGFR2_ENST00000369056	c.1614G>T	p.M538I	37	10	123258070	123258070	C	A	-
FGFR2_ENST00000457416	c.1614G>T	p.M538I	37	10	123258070	123258070	C	A	-

unit test

def test_misses():
    MISSES = ['FGFR3 Y375C', 'FGFR2 M538I', 'FGFR2 C383R', 'FGFR3 K652E', 'FGFR3 G372C']
    for profile in MISSES:
        gene_index, mut_index, biomarkers, fusions = jax._parse_profile(profile)
        if not (len(gene_index) == len(mut_index) == len(biomarkers)):
            print(
                "ERROR: This molecular profile has been parsed incorrectly!")
            print(json.dumps(
                {"molecular_profile": profile},
                indent=2, sort_keys=True))
        print gene_index, mut_index, biomarkers, fusions
        matches = LOOKUP_TABLE.get_entries(gene_index[0], mut_index[0])
        print matches

>>>

['FGFR3'] ['Y375C'] [''] []
[]
['FGFR2'] ['M538I'] [''] []
[]
['FGFR2'] ['C383R'] [''] []
[]
['FGFR3'] ['K652E'] [''] []
[]
['FGFR3'] ['G372C'] [''] []
[]

@mayfielg, @jgoecks - at least for these 4 examples, the identifiers exist, but are not being returned

ahwagner · 2018-02-22T23:22:49Z

On selecting canonical transcripts (for GrCh37), while this is mentioned in the Ensembl glossary, the only canonical transcripts I could find are those listed at UCSC. What follows are instructions for selecting those transcripts:

From Ensembl Gene .gtf file, gene symbol -> Gene transcripts:

zgrep 'gene_name "BRAF"' Homo_sapiens.GRCh37.87.gtf.gz | awk '($3 == "transcript")' | perl -pe 's/.*transcript_id "(\w+)".*/$1/' | tee BRAF.txt
ENST00000496384
ENST00000288602
ENST00000479537
ENST00000497784
ENST00000469930

Find UCSC high-quality transcripts:

zgrep -Ff BRAF.txt knownToEnsembl.txt.gz | cut -f1 | tee BRAF.ucsc.txt
uc003vwc.4

If multiple remaining transcripts, filter on the knownIsoform table (schema, file):

zgrep -Ff BRAF.ucsc.txt knownIsoforms.txt.gz | cut -f2 | tee BRAF.canonical.ucsc.txt
uc003vwc.4

Translate to refseq mRNA for mutalyzer using knownToRefseq table:

zgrep -Ff BRAF.canonical.ucsc.txt knownToRefSeq.txt.gz | cut -f2
NM_004333

From there, a query of NM_004333:p.V600E at mutalyzer returns a genomic coordinate!

The last step is to get the most recent version of the transcript before sending to allele registry--will write a response in this thread with those instructions later.

bwalsh · 2018-02-23T00:57:02Z

Protein lookup via myvariant.info

BRAF V600E example

$ curl -s http://myvariant.info/v1/query?q=BRAF%20V600E  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "7",
  "start": 140534535,
  "end": 140534535,
  "ref": "A",
  "alt": "G"
}

others from our "misses"

$ curl -s http://myvariant.info/v1/query?q=FLT3%20N676D | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "13",
  "start": 28644637,
  "end": 28644637,
  "ref": "T",
  "alt": "A"
}

$ curl -s http://myvariant.info/v1/query?q=FGFR3%20G372C  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "4",
  "start": 1803650,
  "end": 1803650,
  "ref": "G",
  "alt": "A"
}

bwalsh · 2018-02-23T01:21:26Z

@ahwagner : can you take a look at the output in this file. It should correspond to the ppm_re 'found' items from your notebook.
ppm_re.json.txt

bwalsh · 2018-02-23T01:34:02Z

quick call to get gene location

$ curl -s -X GET "http://mygene.info/v3/query?q=AR&fields=genomic_pos_hg19" -H "accept: application/json" | jq .hits[0].genomic_pos_hg19
{
  "chr": "X",
  "end": 66950461,
  "start": 66764465,
  "strand": 1
}

bwalsh · 2018-02-27T15:03:16Z

Related issue: #92

ahwagner · 2018-03-12T15:36:33Z

@bwalsh the quick call you mentioned in #88 (comment) is sufficient to grab the coordinates for many of the above sets. I have revised the requirements of these sets to take advantage of this API call, avoiding the difficulty of selecting a representative transcript:

7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of ~~representative transcript~~ gene coordinates.
Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of ~~representative transcript~~ gene coordinates, one from each constituent gene.
Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of ~~representative transcript~~ gene coordinates.
exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, ~~use the exon coordinates from the representative transcript~~ do not normalize.

bwalsh · 2018-03-12T19:49:35Z

7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).

Breakdown

all molecularmatch 2085
no feature_names associations 155
no feature_names and no feature.start 13
of those 13 associations without start [BRCA, PDGFR, PORCN, VEGF, VEGFR]
of those 5 genes 4 have no hits with genomic_pos_hg19; 1 [PORCN] have multiple answers for genomic_pos_hg19

Next steps:

@ahwagner: can you confirm you are seeing the same?
bwalsh: add code to handle multiple genomic_pos_hg19 answers. Strategy - skip chromosome *_PATCH

BRCA

$ curl 'http://mygene.info/v3/query?q=BRCA&fields=genomic_pos_hg19'
{
  "max_score": 27.281399,
  "took": 6,
  "total": 1,
  "hits": [
    {
      "_id": "3376065",
      "_score": 27.281399
    }
  ]
}

PDGFR

$ curl 'http://mygene.info/v3/query?q=PDGFR&fields=genomic_pos_hg19'
{
  "max_score": 10.167263,
  "took": 8,
  "total": 156,
  "hits": [
    {
      "_id": "100487523",
      "_score": 10.167263
    }

PORCN

$ curl 'http://mygene.info/v3/query?q=PORCN&fields=genomic_pos_hg19'
{
  "max_score": 446.9968,
  "took": 12,
  "total": 193,
  "hits": [
    {
      "_id": "64840",
      "_score": 446.9968,
      "genomic_pos_hg19": [
        {
          "chr": "HG1436_HG1432_PATCH",
          "end": 48380213,
          "start": 48368361,
          "strand": 1
        },
        {
          "chr": "X",
          "end": 48379202,
          "start": 48367350,
          "strand": 1
        }
      ]
    },

VEGF

$ curl 'http://mygene.info/v3/query?q=VEGF&fields=genomic_pos_hg19'
{
  "max_score": 354.96655,
  "took": 14,
  "total": 580,
  "hits": [
    {
      "_id": "104948932",
      "_score": 354.96655
    },

VEGFR

$ curl 'http://mygene.info/v3/query?q=VEGFR&fields=genomic_pos_hg19'
{
  "max_score": 12.964587,
  "took": 11,
  "total": 40,
  "hits": [
    {
      "_id": "443177",
      "_score": 12.964587
    },

grmayfie · 2018-03-20T20:18:49Z

@ahwagner Can we close? Please close if appropriate.

ahwagner added the review label Feb 16, 2018

bwalsh self-assigned this Feb 19, 2018

This was referenced Feb 21, 2018

Feature names inconsistent #86

Closed

UI Search Results #89

Closed

bwalsh mentioned this issue Feb 23, 2018

Feature enricher #91

Merged

bwalsh mentioned this issue Mar 2, 2018

Feature enricher + API #94

Closed

grmayfie mentioned this issue Mar 14, 2018

harmonize location information? #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coordinates to genomic features #88

Add coordinates to genomic features #88

ahwagner commented Feb 16, 2018

bwalsh commented Feb 19, 2018

ahwagner commented Feb 19, 2018

bwalsh commented Feb 20, 2018 •

edited

Loading

bwalsh commented Feb 21, 2018

bwalsh commented Feb 22, 2018

exactly how are these transcripts mapped to 'amplification' ?

bwalsh commented Feb 22, 2018 •

edited

Loading

ahwagner commented Feb 22, 2018

bwalsh commented Feb 23, 2018 •

edited

Loading

bwalsh commented Feb 23, 2018

bwalsh commented Feb 23, 2018

bwalsh commented Feb 27, 2018

ahwagner commented Mar 12, 2018

bwalsh commented Mar 12, 2018 •

edited

Loading

grmayfie commented Mar 20, 2018

Add coordinates to genomic features #88

Add coordinates to genomic features #88

Comments

ahwagner commented Feb 16, 2018

bwalsh commented Feb 19, 2018

ahwagner commented Feb 19, 2018

bwalsh commented Feb 20, 2018 • edited Loading

Notes:

bwalsh commented Feb 21, 2018

bwalsh commented Feb 22, 2018

exactly how are these transcripts mapped to 'amplification' ?

bwalsh commented Feb 22, 2018 • edited Loading

ahwagner commented Feb 22, 2018

bwalsh commented Feb 23, 2018 • edited Loading

BRAF V600E example

others from our "misses"

bwalsh commented Feb 23, 2018

bwalsh commented Feb 23, 2018

quick call to get gene location

bwalsh commented Feb 27, 2018

ahwagner commented Mar 12, 2018

bwalsh commented Mar 12, 2018 • edited Loading

BRCA

PDGFR

PORCN

VEGF

VEGFR

grmayfie commented Mar 20, 2018

bwalsh commented Feb 20, 2018 •

edited

Loading

bwalsh commented Feb 22, 2018 •

edited

Loading

bwalsh commented Feb 23, 2018 •

edited

Loading

bwalsh commented Mar 12, 2018 •

edited

Loading