-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a potential recipe on conversion of secondary IDs to primary IDs #340
Comments
The problem here is this. Many database delete or deprecate identifiers, including ELIXIR Core Resources. It is currently not easy to figure out in some data set if outdated identifiers are used. If there are, then a string match of identifiers with up-to-date database does not work. The BridgeDb Project started collecting this kind in information (see BridgeDb Tiwid) but there is no guidance on how to use this yet. |
Author would be Lucas, with support from @DeniseSl22 and me. |
But, there are also some databases that do keep track of their secondary IDs (HMDB, ChEBI for example). So I would start with those ("the easy example"), and then we can think of how to apply a similar workflow to the removed IDs which we just don't have in a mapping file. There's also the BED tool which found a way to keep track of these (using a graph database) for gene/protein identifiers |
as mentioned today in the call, but written here: for me it is not even clear what secondary and primary identifiers are. But I am looking forward to the abstract to see whether I understand it then. Best! |
@robertgiessmann : thanks for asking! The primary IDs are the IDs the databases wants you to use, when referring to a specific molecular entity. The secondary IDs, are IDs that a database has which refer to a similar entity as was meant by the primary one (so duplicates). These IDs are at some point cleaned up and considered "old/outdated", linked to the primary one (in various ways), or deleted. When there is a link between primary and secondary, we can actually understand which entity is meant in a dataset which is annotated with old IDs. I hope this explains it a bit more, if not let me know! |
Hi @DeniseSl22 , sorry to say, actually I am more confused now. Do you speak about cross-refs? Shall we consider a specific example? I noticed https://github.com/bridgedb/tiwid/ -- taking from there: https://github.com/bridgedb/tiwid/blob/main/data/chebi.csv there was once a ChEBI identifier "594834" (I guess...) -- which does not resolve right now: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:594834 => returns error Doing a Google Search for it, and looking at https://bioinformatics.charite.de/supertarget/index.php?site=drug_target&id=9977819-CSK_HUMAN , it seems CHEBI:594834 refered once to https://pubchem.ncbi.nlm.nih.gov/compound/9977819 , which has status Non-live in PubChem, and was thus probably remove from ChEBI. We can only speculate on the reasons of this Non-live, as there is no information, but that doesn't matter after all. Is there a primary and secondary identifier in here, already? |
Hi @robertgiessmann , more confusion was not my intention ;) the example you provide, is indeed an example of an ID which has been removed (and we don't really know where it went, finding that is probably only possible by asking the ChEBI team and them checking their archived data or so....). Here's two example, which hopefully puts it in more perspective: Water has as primary ID: CHEBI:15377; there's also an item on the ChEBI website called "secondary ChEBI IDs", with items: "CHEBI:5585, CHEBI:42857, CHEBI:42043, CHEBI:44292, CHEBI:44819, CHEBI:43228, CHEBI:44701, CHEBI:10743, CHEBI:13352, CHEBI:27313" . Urobilin in HMDB has the main/primary ID: HMDB0004160 (that's also linked the URL for this item, and other databases should use as cross-ref. This compound also has some other IDs connected to it (under "Secondary Accession numbers"): HMDB0004159, HMDB0004161, HMDB04159, HMDB04160, HMDB04161 . So again, at some point the HMDB team realised they had the same compound in their database as individual entries, after which they merged them to one entry, and selecting one ID as the new main one (the primary). And with HMDB, there's also the change in ID structure; the HMDB 3.x version used IDs with the structure HMDHabcde, while HMDB 4.x uses HMDB00abcde (with abcde as random numbers). So, the entry for Urobilin was present in HMDB 3.x three times (HMDB04159, HMDB04160, HMDB04161). Then the ID structure itself got changed (to (HMDB0004159, HMDB0004160, HMDB0004161). And after that, the compound was considered the same, and the three entries got merged and one ID was selected to be the main one (HMDB0004160). Lot's of databases have these "issues", since duplicate entries need to be dealt with. I think the way ChEBI and HMDB do this, makes the changes traceable (in the two examples above, not in the example you provided). Removing duplicate entries without being able to link that information together, creates problems for data analysis (as is the case with your example, finding out which compound is meant with "CHEBI:594834" will take quite some time). |
Ah, I see now -- also where the wording "secondary identifier" derives from... Cool! Well, yeah, that's a common problem. I would split it into multiple issues, I guess:
Do you see any more intrinsic aspects of this, in this context? Thanks again for the explanation, really helped me a lot -- I guess this can be recycled straight into the recipe! 👍 |
@proccaserra, It will take some time to wrap up this recipe, but I am working on it. |
For the BridgeDbR issue that sparked the idea for this recipe see here by @DeniseSl22. In this recipe we would provide a workflow to map secondary IDs to primary IDs. The recipe would therefore be mostly hands-on but it could also include a theoretical part to highlight the importance of the task in the context of improving data interoperability (as long as the content doesn't overlap with the Identifier mapping recipe. I could be in charge of the development of the required scripts/jupyter notebooks and might be able to also work on the theoretical side if we decide to also include it.
The text was updated successfully, but these errors were encountered: