-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add coordinates to genomic features #88
Comments
@bwalsh sounds good. |
Notes:
Example:
Rule:
|
from #88 Currently, searches of the type Gene Variant (e.g. BRAF V600E) do a global OR search between terms. Ideally, we could rename the features.name field to the mutation name (without the prepended gene symbol) and by default limit the search to a global AND condition. Subtasks: remove prepended gene symbol in features.name, propogate changes to feature_names.keyword. |
Using mygene.info and ensembl, we can look up gene location transcriptions fairly easily.
Example: From cgi
Feature after current processing:
Experiment:
|
A quick test shows this was a legitimate miss.
Reviewing all the 187 unique gene/protein items these are the only ones that have grep 'hits'
grep example:
unit test
@mayfielg, @jgoecks - at least for these 4 examples, the identifiers exist, but are not being returned |
On selecting canonical transcripts (for GrCh37), while this is mentioned in the Ensembl glossary, the only canonical transcripts I could find are those listed at UCSC. What follows are instructions for selecting those transcripts:
zgrep 'gene_name "BRAF"' Homo_sapiens.GRCh37.87.gtf.gz | awk '($3 == "transcript")' | perl -pe 's/.*transcript_id "(\w+)".*/$1/' | tee BRAF.txt
ENST00000496384
ENST00000288602
ENST00000479537
ENST00000497784
ENST00000469930
zgrep -Ff BRAF.txt knownToEnsembl.txt.gz | cut -f1 | tee BRAF.ucsc.txt
uc003vwc.4 zgrep -Ff BRAF.ucsc.txt knownIsoforms.txt.gz | cut -f2 | tee BRAF.canonical.ucsc.txt
uc003vwc.4
zgrep -Ff BRAF.canonical.ucsc.txt knownToRefSeq.txt.gz | cut -f2
NM_004333 From there, a query of NM_004333:p.V600E at mutalyzer returns a genomic coordinate! The last step is to get the most recent version of the transcript before sending to allele registry--will write a response in this thread with those instructions later. |
Protein lookup via myvariant.info BRAF V600E example
others from our "misses"
|
@ahwagner : can you take a look at the output in this file. It should correspond to the ppm_re 'found' items from your notebook. |
quick call to get gene location
|
Related issue: #92 |
@bwalsh the quick call you mentioned in #88 (comment) is sufficient to grab the coordinates for many of the above sets. I have revised the requirements of these sets to take advantage of this API call, avoiding the difficulty of selecting a representative transcript:
|
Breakdown
Next steps:
BRCA
PDGFR
PORCN
VEGF
VEGFR
|
@ahwagner Can we close? Please close if appropriate. |
Currently, 32% of our associations in 0.8 have no genomic coordinates assigned to at least one feature.
These fall into the following categories:
BRCA2 D806H
orKIT W557_K558del
orBRAF V487_P492delinsA
orKIT S501_A502dup
) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.feature_names
. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).feature_names
of the format<gene> amplification
or<gene> loss
should have one corresponding feature of representative transcript coordinates.feature_names
of the format<gene>-<gene>
or<gene>:Fusions
) should have two corresponding features of representative transcript coordinates, one from each constituent gene.<gene> mutation
,<gene> inact mut
) should have one corresponding feature of representative transcript coordinates.feature_names
) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript.Following these rules, we go from 68.4% of associations with variant start/end coords to 98.8%.
Many of these items refer to a "representative" transcript. This is annotated in ensembl as a "canonical" transcript.
Tagging as
review
per discussion with @jgoecks.@bwalsh would you provide an estimate of the effort needed to make these changes?
I uploaded my WIP analysis workbook to the
genie-analysis
branch: https://github.com/ohsu-comp-bio/g2p-aggregator/blob/genie-analysis/notebooks/knowledgebase_comparison.ipynbSee the
Feature coordinate filtering
section for regular expressions that can help in selecting features for annotating with start/end coordinates.The text was updated successfully, but these errors were encountered: