Develop a potential recipe on conversion of secondary IDs to primary IDs #340

lucas-ubm · 2021-06-04T13:49:45Z

egonw · 2021-06-06T05:00:31Z

The problem here is this. Many database delete or deprecate identifiers, including ELIXIR Core Resources. It is currently not easy to figure out in some data set if outdated identifiers are used. If there are, then a string match of identifiers with up-to-date database does not work. The BridgeDb Project started collecting this kind in information (see BridgeDb Tiwid) but there is no guidance on how to use this yet.

egonw · 2021-06-06T05:01:30Z

Author would be Lucas, with support from @DeniseSl22 and me.

DeniseSl22 · 2021-06-07T08:33:34Z

But, there are also some databases that do keep track of their secondary IDs (HMDB, ChEBI for example). So I would start with those ("the easy example"), and then we can think of how to apply a similar workflow to the removed IDs which we just don't have in a mapping file. There's also the BED tool which found a way to keep track of these (using a graph database) for gene/protein identifiers

ghost · 2021-06-09T10:29:42Z

as mentioned today in the call, but written here: for me it is not even clear what secondary and primary identifiers are. But I am looking forward to the abstract to see whether I understand it then. Best!

DeniseSl22 · 2021-06-09T10:46:35Z

@robertgiessmann : thanks for asking! The primary IDs are the IDs the databases wants you to use, when referring to a specific molecular entity. The secondary IDs, are IDs that a database has which refer to a similar entity as was meant by the primary one (so duplicates). These IDs are at some point cleaned up and considered "old/outdated", linked to the primary one (in various ways), or deleted. When there is a link between primary and secondary, we can actually understand which entity is meant in a dataset which is annotated with old IDs. I hope this explains it a bit more, if not let me know!

ghost · 2021-06-09T11:02:23Z

Hi @DeniseSl22 , sorry to say, actually I am more confused now. Do you speak about cross-refs? Shall we consider a specific example?

I noticed https://github.com/bridgedb/tiwid/ -- taking from there:

https://github.com/bridgedb/tiwid/blob/main/data/chebi.csv

there was once a ChEBI identifier "594834" (I guess...) -- which does not resolve right now: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:594834 => returns error

Doing a Google Search for it, and looking at https://bioinformatics.charite.de/supertarget/index.php?site=drug_target&id=9977819-CSK_HUMAN , it seems CHEBI:594834 refered once to https://pubchem.ncbi.nlm.nih.gov/compound/9977819 , which has status Non-live in PubChem, and was thus probably remove from ChEBI.

We can only speculate on the reasons of this Non-live, as there is no information, but that doesn't matter after all.

Is there a primary and secondary identifier in here, already?

DeniseSl22 · 2021-06-09T12:24:37Z

Hi @robertgiessmann , more confusion was not my intention ;) the example you provide, is indeed an example of an ID which has been removed (and we don't really know where it went, finding that is probably only possible by asking the ChEBI team and them checking their archived data or so....). Here's two example, which hopefully puts it in more perspective:

Water has as primary ID: CHEBI:15377; there's also an item on the ChEBI website called "secondary ChEBI IDs", with items: "CHEBI:5585, CHEBI:42857, CHEBI:42043, CHEBI:44292, CHEBI:44819, CHEBI:43228, CHEBI:44701, CHEBI:10743, CHEBI:13352, CHEBI:27313" .
So, within the CHEBI database at some point there where 11 entries for water (1x the primary ID, and 10x the secondary IDs). These have been merged to one database entry, where the ID 15377 has been selected as the one to use from now on (the primary one).

Urobilin in HMDB has the main/primary ID: HMDB0004160 (that's also linked the URL for this item, and other databases should use as cross-ref. This compound also has some other IDs connected to it (under "Secondary Accession numbers"): HMDB0004159, HMDB0004161, HMDB04159, HMDB04160, HMDB04161 . So again, at some point the HMDB team realised they had the same compound in their database as individual entries, after which they merged them to one entry, and selecting one ID as the new main one (the primary). And with HMDB, there's also the change in ID structure; the HMDB 3.x version used IDs with the structure HMDHabcde, while HMDB 4.x uses HMDB00abcde (with abcde as random numbers). So, the entry for Urobilin was present in HMDB 3.x three times (HMDB04159, HMDB04160, HMDB04161). Then the ID structure itself got changed (to (HMDB0004159, HMDB0004160, HMDB0004161). And after that, the compound was considered the same, and the three entries got merged and one ID was selected to be the main one (HMDB0004160).

Lot's of databases have these "issues", since duplicate entries need to be dealt with. I think the way ChEBI and HMDB do this, makes the changes traceable (in the two examples above, not in the example you provided). Removing duplicate entries without being able to link that information together, creates problems for data analysis (as is the case with your example, finding out which compound is meant with "CHEBI:594834" will take quite some time).

ghost · 2021-06-09T14:52:52Z

Ah, I see now -- also where the wording "secondary identifier" derives from...

Cool! Well, yeah, that's a common problem.

I would split it into multiple issues, I guess:

"versioning" of identifiers? / identifiers across different versions of one database
(good) ways of doing (de-)duplication
does the behavior of xyz (ChEBI, HMDB, ...) fit well to the FAIR guiding principles? (especially: https://www.go-fair.org/fair-principles/a2-metadata-accessible-even-data-no-longer-available/)

Do you see any more intrinsic aspects of this, in this context?

Thanks again for the explanation, really helped me a lot -- I guess this can be recycled straight into the recipe! 👍

tabbassidaloii · 2022-08-31T13:25:22Z

@proccaserra, It will take some time to wrap up this recipe, but I am working on it.

lucas-ubm added issue type: proposal issue contains general proposals on how to proceed forwards author's task: write recipe author has to write the full recipe labels Jun 4, 2021

lucas-ubm assigned egonw and lucas-ubm and unassigned egonw and lucas-ubm Jun 4, 2021

ghost added issue type: proposal - new recipe issue suggests to create one new recipe and removed issue type: proposal issue contains general proposals on how to proceed forwards labels Aug 3, 2021

egonw assigned tabbassidaloii and unassigned lucas-ubm Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a potential recipe on conversion of secondary IDs to primary IDs #340

Develop a potential recipe on conversion of secondary IDs to primary IDs #340

lucas-ubm commented Jun 4, 2021 •

edited by tabbassidaloii

Loading

egonw commented Jun 6, 2021

egonw commented Jun 6, 2021

DeniseSl22 commented Jun 7, 2021

ghost commented Jun 9, 2021

DeniseSl22 commented Jun 9, 2021

ghost commented Jun 9, 2021

DeniseSl22 commented Jun 9, 2021

ghost commented Jun 9, 2021

tabbassidaloii commented Aug 31, 2022

Develop a potential recipe on conversion of secondary IDs to primary IDs #340

Develop a potential recipe on conversion of secondary IDs to primary IDs #340

Comments

lucas-ubm commented Jun 4, 2021 • edited by tabbassidaloii Loading

egonw commented Jun 6, 2021

egonw commented Jun 6, 2021

DeniseSl22 commented Jun 7, 2021

ghost commented Jun 9, 2021

DeniseSl22 commented Jun 9, 2021

ghost commented Jun 9, 2021

DeniseSl22 commented Jun 9, 2021

ghost commented Jun 9, 2021

tabbassidaloii commented Aug 31, 2022

lucas-ubm commented Jun 4, 2021 •

edited by tabbassidaloii

Loading