Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coordinates to genomic features #88

Open
7 tasks
ahwagner opened this issue Feb 16, 2018 · 14 comments
Open
7 tasks

Add coordinates to genomic features #88

ahwagner opened this issue Feb 16, 2018 · 14 comments
Assignees
Labels

Comments

@ahwagner
Copy link
Collaborator

Currently, 32% of our associations in 0.8 have no genomic coordinates assigned to at least one feature.

These fall into the following categories:

  • Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates
  • Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.
  • 7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
  • Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of representative transcript coordinates.
  • Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of representative transcript coordinates, one from each constituent gene.
  • Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of representative transcript coordinates.
  • exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript.

Following these rules, we go from 68.4% of associations with variant start/end coords to 98.8%.

Many of these items refer to a "representative" transcript. This is annotated in ensembl as a "canonical" transcript.

Tagging as review per discussion with @jgoecks.

@bwalsh would you provide an estimate of the effort needed to make these changes?

I uploaded my WIP analysis workbook to the genie-analysis branch: https://github.com/ohsu-comp-bio/g2p-aggregator/blob/genie-analysis/notebooks/knowledgebase_comparison.ipynb

See the Feature coordinate filtering section for regular expressions that can help in selecting features for annotating with start/end coordinates.

@bwalsh bwalsh self-assigned this Feb 19, 2018
@bwalsh
Copy link
Member

bwalsh commented Feb 19, 2018

@ahwagner @jgoecks @mayfielg - can we discuss these changes during Tuesday's call?

@ahwagner
Copy link
Collaborator Author

@bwalsh sounds good.

@bwalsh
Copy link
Member

bwalsh commented Feb 20, 2018

Notes:


Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates

Example:

{
  "name": "ABL1:E255K",
  "links": [
    "http://reg.genome.network/refseq/RS000009",
    "http://myvariant.info/v1/variant/chr9:g.133738363G>A?assembly=hg19",
    "http://reg.genome.network/refseq/RS000033",
    "http://www.ncbi.nlm.nih.gov/clinvar/variation/376090",
    "http://reg.genome.network/refseq/RS000057",
    "http://reg.genome.network/refseq/RS002120",
    "http://myvariant.info/v1/variant/chr9:g.130862976G>A?assembly=hg38",
    "http://www.ncbi.nlm.nih.gov/snp/121913448",
    "http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=12573",
    "http://www.ncbi.nlm.nih.gov/clinvar/?term=362969[alleleid]",
    "http://reg.genome.network/allele/CA16602551"
  ],
  "start": 133738363,
  "synonyms": [
    "NC_000009.10:g.132728184G>A",
    "NG_012034.1:g.154096G>A",
    "NC_000009.11:g.133738363G>A",
    "chr9:g.133738363G>A",
    "NC_000009.12:g.130862976G>A",
    "COSM12573",
    "chr9:g.130862976G>A",
    "CM000671.1:g.133738363G>A",
    "CM000671.2:g.130862976G>A"
  ],
  "biomarker_type": "mutant",
  "referenceName": "GRCh37",
  "geneSymbol": "ABL1",
  "alt": "A",
  "ref": "G",
  "chromosome": "9",
  "description": "ABL1:T315A,F317L,F317V,F317I,F317C,F317I,Y253H,E255K,E255V,F359V,F359C,F359I"
}

Rule:

# pseudo code
if feature.start and feature.ref and not feature.end:
  feature.end = feature.start + len(feature.ref) - 1

Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.

This was referenced Feb 21, 2018
@bwalsh
Copy link
Member

bwalsh commented Feb 21, 2018

from #88

Currently, searches of the type Gene Variant (e.g. BRAF V600E) do a global OR search between terms. Ideally, we could rename the features.name field to the mutation name (without the prepended gene symbol) and by default limit the search to a global AND condition.

Subtasks:

remove prepended gene symbol in features.name, propogate changes to feature_names.keyword.
change search behavior to AND by default

@bwalsh
Copy link
Member

bwalsh commented Feb 22, 2018

Association feature_names of the format amplification or loss should have one corresponding feature of representative transcript coordinates.

Using mygene.info and ensembl, we can look up gene location transcriptions fairly easily.
Issues:

  • which transcription(s) to use?
  • how does 'amplification' or 'loss' enter into the algorithm?

Example:

From cgi

{u'Targeting': u'', u'Alteration': u'AR:amp', u'Source': u'PMID:23589709;PMID:21859989', u'cDNA': u'', u'Primary Tumor type': u'Prostate adenocarcinoma', u'individual_mutation': u'', u'Drug full name': u'AR inhibitor next gens', u'Association': u'Responsive', u'Drug family': u'[AR inhibitor next gen]', u'Biomarker': u'AR amplification', u'Drug': u'[]', u'Curator': u'RDientsmann', u'gDNA': u'', u'Drug status': u'', u'Gene': u'AR', u'transcript': u'', u'strand': u'', u'info': u'', u'Assay type': u'', u'Alteration type': u'CNA', u'region': u'', u'Evidence level': u'Pre-clinical', u'Primary Tumor acronym': u'PRAD', u'Metastatic Tumor Type': u''}

Feature after current processing:

{u'geneSymbol': u'AR', u'referenceName': u'GRCh37', u'biomarker_type': u'amplification', u'name': u'', u'description': u'AR:amp'}

Experiment:

  • query mygene.info
$ curl -X GET "http://mygene.info/v3/query?q=AR&fields=HGNC%2Censembl.gene" -H "accept: application/json"
  "max_score": 370.9837,
  "took": 24,
  "total": 937,
  "hits": [
    {
      "HGNC": "644",
      "_id": "367",
      "_score": 370.9837,
      "ensembl": {
        "gene": "ENSG00000169083"
      }
    },
    {
      "_id": "11835",
      "_score": 306.0449,
      "ensembl": {
        "gene": "ENSMUSG00000046532"
      }
    },
....
  • using the ensembl id, formulate the location of the gene
$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '{assembly_name: .assembly_name, chromosome: .seq_region_name, start: .start, end: .end  }'
{
  "assembly_name": "GRCh37",
  "chromosome": "X",
  "start": 66764465,
  "end": 66950461
}
  • Transcripts are also available

    exactly how are these transcripts mapped to 'amplification' ?

$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '.Transcript'
[
  {
    "source": "ensembl_havana",
    "object_type": "Transcript",
    "logic_name": "ensembl_havana_transcript",
    "Exon": [
      {
        "object_type": "Exon",
        "version": 7,
        "species": "homo_sapiens",
        "assembly_name": "GRCh37",
        "end": 66766604,
        "seq_region_name": "X",
        "db_type": "core",
        "strand": 1,
        "id": "ENSE00001326500",
        "start": 66764465
      },
      {
        "object_type": "Exon",
        "version": 1,
        "species": "homo_sapiens",
        "assembly_name": "GRCh37",
        "end": 66863249,
        "seq_region_name": "X",
        "db_type": "core",
        "strand": 1,
        "id": "ENSE00003606052",
        "start": 66863098
      },
....

@bwalsh
Copy link
Member

bwalsh commented Feb 22, 2018

Associations describing protein changes

image

A quick test shows this was a legitimate miss.

$ grep ERBB2 cosmic_lookup_table.tsv | grep S783P | wc -l
       0

Reviewing all the 187 unique gene/protein items these are the only ones that have grep 'hits'

(FGFR3, Y375C)       1
(FGFR2, M538I)       2
(FGFR2, C383R)       2
(FGFR3, K652E)       1
(FGFR3, G372C)       1

grep example:

$ grep FGFR2 cosmic_lookup_table.tsv | grep M538I
FGFR2_ENST00000369056	c.1614G>T	p.M538I	37	10	123258070	123258070	C	A	-
FGFR2_ENST00000457416	c.1614G>T	p.M538I	37	10	123258070	123258070	C	A	-

unit test

def test_misses():
    MISSES = ['FGFR3 Y375C', 'FGFR2 M538I', 'FGFR2 C383R', 'FGFR3 K652E', 'FGFR3 G372C']
    for profile in MISSES:
        gene_index, mut_index, biomarkers, fusions = jax._parse_profile(profile)
        if not (len(gene_index) == len(mut_index) == len(biomarkers)):
            print(
                "ERROR: This molecular profile has been parsed incorrectly!")
            print(json.dumps(
                {"molecular_profile": profile},
                indent=2, sort_keys=True))
        print gene_index, mut_index, biomarkers, fusions
        matches = LOOKUP_TABLE.get_entries(gene_index[0], mut_index[0])
        print matches

>>>

['FGFR3'] ['Y375C'] [''] []
[]
['FGFR2'] ['M538I'] [''] []
[]
['FGFR2'] ['C383R'] [''] []
[]
['FGFR3'] ['K652E'] [''] []
[]
['FGFR3'] ['G372C'] [''] []
[]

@mayfielg, @jgoecks - at least for these 4 examples, the identifiers exist, but are not being returned

@ahwagner
Copy link
Collaborator Author

On selecting canonical transcripts (for GrCh37), while this is mentioned in the Ensembl glossary, the only canonical transcripts I could find are those listed at UCSC. What follows are instructions for selecting those transcripts:

  1. From Ensembl Gene .gtf file, gene symbol -> Gene transcripts:
zgrep 'gene_name "BRAF"' Homo_sapiens.GRCh37.87.gtf.gz | awk '($3 == "transcript")' | perl -pe 's/.*transcript_id "(\w+)".*/$1/' | tee BRAF.txt
ENST00000496384
ENST00000288602
ENST00000479537
ENST00000497784
ENST00000469930
  1. Find UCSC high-quality transcripts:
zgrep -Ff BRAF.txt knownToEnsembl.txt.gz | cut -f1 | tee BRAF.ucsc.txt
uc003vwc.4
  1. If multiple remaining transcripts, filter on the knownIsoform table (schema, file):
zgrep -Ff BRAF.ucsc.txt knownIsoforms.txt.gz | cut -f2 | tee BRAF.canonical.ucsc.txt
uc003vwc.4
  1. Translate to refseq mRNA for mutalyzer using knownToRefseq table:
zgrep -Ff BRAF.canonical.ucsc.txt knownToRefSeq.txt.gz | cut -f2
NM_004333

From there, a query of NM_004333:p.V600E at mutalyzer returns a genomic coordinate!

The last step is to get the most recent version of the transcript before sending to allele registry--will write a response in this thread with those instructions later.

@bwalsh
Copy link
Member

bwalsh commented Feb 23, 2018

Protein lookup via myvariant.info

BRAF V600E example

$ curl -s http://myvariant.info/v1/query?q=BRAF%20V600E  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "7",
  "start": 140534535,
  "end": 140534535,
  "ref": "A",
  "alt": "G"
}

others from our "misses"

$ curl -s http://myvariant.info/v1/query?q=FLT3%20N676D | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "13",
  "start": 28644637,
  "end": 28644637,
  "ref": "T",
  "alt": "A"
}
$ curl -s http://myvariant.info/v1/query?q=FGFR3%20G372C  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "4",
  "start": 1803650,
  "end": 1803650,
  "ref": "G",
  "alt": "A"
}

@bwalsh
Copy link
Member

bwalsh commented Feb 23, 2018

@ahwagner : can you take a look at the output in this file. It should correspond to the ppm_re 'found' items from your notebook.
ppm_re.json.txt

@bwalsh
Copy link
Member

bwalsh commented Feb 23, 2018

quick call to get gene location

$ curl -s -X GET "http://mygene.info/v3/query?q=AR&fields=genomic_pos_hg19" -H "accept: application/json" | jq .hits[0].genomic_pos_hg19
{
  "chr": "X",
  "end": 66950461,
  "start": 66764465,
  "strand": 1
}

@bwalsh bwalsh mentioned this issue Feb 23, 2018
@bwalsh
Copy link
Member

bwalsh commented Feb 27, 2018

Related issue: #92

@ahwagner
Copy link
Collaborator Author

@bwalsh the quick call you mentioned in #88 (comment) is sufficient to grab the coordinates for many of the above sets. I have revised the requirements of these sets to take advantage of this API call, avoiding the difficulty of selecting a representative transcript:

  • 7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
  • Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of representative transcript gene coordinates.
  • Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of representative transcript gene coordinates, one from each constituent gene.
  • Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of representative transcript gene coordinates.
  • exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript do not normalize.

@bwalsh
Copy link
Member

bwalsh commented Mar 12, 2018

7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).

Breakdown

  • all molecularmatch 2085
  • no feature_names associations 155
  • no feature_names and no feature.start 13
  • of those 13 associations without start [BRCA, PDGFR, PORCN, VEGF, VEGFR]
  • of those 5 genes 4 have no hits with genomic_pos_hg19; 1 [PORCN] have multiple answers for genomic_pos_hg19

Next steps:

  • @ahwagner: can you confirm you are seeing the same?
  • bwalsh: add code to handle multiple genomic_pos_hg19 answers. Strategy - skip chromosome *_PATCH

BRCA

$ curl 'http://mygene.info/v3/query?q=BRCA&fields=genomic_pos_hg19'
{
  "max_score": 27.281399,
  "took": 6,
  "total": 1,
  "hits": [
    {
      "_id": "3376065",
      "_score": 27.281399
    }
  ]
}

PDGFR

$ curl 'http://mygene.info/v3/query?q=PDGFR&fields=genomic_pos_hg19'
{
  "max_score": 10.167263,
  "took": 8,
  "total": 156,
  "hits": [
    {
      "_id": "100487523",
      "_score": 10.167263
    }

PORCN

$ curl 'http://mygene.info/v3/query?q=PORCN&fields=genomic_pos_hg19'
{
  "max_score": 446.9968,
  "took": 12,
  "total": 193,
  "hits": [
    {
      "_id": "64840",
      "_score": 446.9968,
      "genomic_pos_hg19": [
        {
          "chr": "HG1436_HG1432_PATCH",
          "end": 48380213,
          "start": 48368361,
          "strand": 1
        },
        {
          "chr": "X",
          "end": 48379202,
          "start": 48367350,
          "strand": 1
        }
      ]
    },

VEGF

$ curl 'http://mygene.info/v3/query?q=VEGF&fields=genomic_pos_hg19'
{
  "max_score": 354.96655,
  "took": 14,
  "total": 580,
  "hits": [
    {
      "_id": "104948932",
      "_score": 354.96655
    },

VEGFR

$ curl 'http://mygene.info/v3/query?q=VEGFR&fields=genomic_pos_hg19'
{
  "max_score": 12.964587,
  "took": 11,
  "total": 40,
  "hits": [
    {
      "_id": "443177",
      "_score": 12.964587
    },

@grmayfie
Copy link
Contributor

@ahwagner Can we close? Please close if appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants