Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GAF loading for RNAcentral annotations #1255

Open
kimrutherford opened this issue Jan 24, 2025 · 8 comments
Open

Fix GAF loading for RNAcentral annotations #1255

kimrutherford opened this issue Jan 24, 2025 · 8 comments

Comments

@kimrutherford
Copy link
Member

These errors should be fixable. There are about 1000 RNAcentral annotations that are failing because of this:

line 140430: gene feature not found, none of the identifiers (URS000030D4E9_4896 URS000030D4E9_4896) from this annotation match a systematic ID in Chado - skipping
line 140431: gene feature not found, none of the identifiers (URS000030D4E9_4896 URS000030D4E9_4896) from this annotation match a systematic ID in Chado - skipping
line 140435: gene feature not found, none of the identifiers (URS0000314D2B_4896 URS0000314D2B_4896) from this annotation match a systematic ID in Chado - skipping

https://curation.pombase.org/dumps/builds/pombase-build-2025-01-23/logs/log.2025-01-22-23-42-44.gaf-load-output

Probably we just need to remove the _4896 from the IDs like URS000030D4E9_4896 and look up in Chado using the base ID URS000030D4E9.

@kimrutherford kimrutherford self-assigned this Jan 24, 2025
@kimrutherford
Copy link
Member Author

This is a little trickier than I thought. The URS IDs aren't unique so some annotation rows from the GOA GAF file will map to multiple genes. Before I go ahead with implementing it, is it OK to do that?

kimrutherford added a commit that referenced this issue Jan 24, 2025
@ValWood
Copy link
Member

ValWood commented Jan 24, 2025

I think its the other way around. One pombe genes will have multiple URS IDs .

If it's the other way around it sounds wrong and we should look into that!

@kimrutherford
Copy link
Member Author

I think its the other way around. One pombe genes will have multiple URS IDs

There are two genes in Chado with 2 URS IDs but those have 2 transcripts - each transcript has its own URS.

There are quite few URS IDs that are attached to more than one gene.

Some examples:

https://rnacentral.org/rna/URS0000314D2B/4896
URS0000314D2B │ SPRRNA.10 ║
URS0000314D2B │ SPRRNA.17 ║
URS0000314D2B │ SPRRNA.20 ║
URS0000314D2B │ SPRRNA.28 ║
URS0000314D2B │ SPRRNA.34 ║
URS0000314D2B │ SPRRNA.39 ║

URS0000415965 │ SPATRNAALA.01 ║
URS0000415965 │ SPATRNAALA.04 ║
URS0000415965 │ SPATRNAALA.05 ║
URS0000415965 │ SPBTRNAALA.07 ║
URS0000415965 │ SPBTRNAALA.08 ║
URS0000415965 │ SPBTRNAALA.09 ║
URS0000415965 │ SPBTRNAALA.10 ║
URS0000415965 │ SPBTRNAALA.11 ║
URS0000415965 │ SPCTRNAALA.12 ║

URS00002BA4D5 │ SPATRNATRP.01 ║
URS00002BA4D5 │ SPBTRNATRP.02 ║
URS00002BA4D5 │ SPBTRNATRP.03 ║

@ValWood
Copy link
Member

ValWood commented Jan 24, 2025

OK, that's a bit weird. We should report that too.

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Jan 24, 2025
@kimrutherford
Copy link
Member Author

Fix GAF loading for RNAcentral annotations

That's fixed for the next load. Once it's done I'll compare with the previous load to see how many extra annotations we have. I suspect it won't be many.

@kimrutherford
Copy link
Member Author

Once it's done I'll compare with the previous load to see how many extra annotations we have. I suspect it won't be many.

168 :-)

On Monday I'll double check to make sure we're getting all the possible annotations.

@kimrutherford
Copy link
Member Author

168 :-)
On Monday I'll double check to make sure we're getting all the possible annotations.

I think it's OK. There only 630 pombe RNAcentral annotations (I miscounted earlier because I included the japonicus annotations). There a 160 or so URS IDs that don't have corresponding pombe genes which causes 334 of the 630 annotations to fail to load.

Some of the remaining annotations are filtered because of: pombe-embl/supporting_files/GO_terms_excluded_from_pombase.txt

And then some annotations are filtered because you have a more specific annotation already.

@kimrutherford
Copy link
Member Author

URS0000415965 │ SPATRNAALA.01 ║
URS0000415965 │ SPATRNAALA.04 ║
URS0000415965 │ SPATRNAALA.05 ║
...

OK, that's a bit weird. We should report that too.

It looks like they group identical genes into one entry:
https://rnacentral.org/rna/URS0000415965/4896

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants