Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to maintenance script / ensembl push #1646

Merged
merged 13 commits into from
Jan 17, 2025

Conversation

andrewkern
Copy link
Member

@andrewkern andrewkern commented Jan 7, 2025

This PR addresses (closes?) #1521.

think i've finally cleaned this up a bit. this PR overhauls the update-genome-data portion of the maintenance script such that:

  • species whose genomes were manually created are skipped
  • species where the ensembl Rest API returns mismatching chromosome names with the current release are skipped
  • detailed logging warnings of these issues are created, along with a final report of which species were skipped in the update and why.

the final report looks like this:

=== Species Update Summary ===
2025-01-07 14:34:46,744 [91493] WARNING  maint: The following species were not updated:
2025-01-07 14:34:46,744 [91493] WARNING  maint:   - AnoCar (Ensembl ID: anolis_carolinensis):
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Existing chromosomes: ['1', '2', '3', '4', '5', '6', 'LGa', 'LGb', 'LGc', 'LGd', 'LGf', 'LGg', 'LGh', 'MT']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '2', '3', '4', '5', '6', 'a', 'b', 'c', 'd', 'f', 'g', 'h']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - CanFam (Ensembl ID: canis_lupus_familiaris):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - DroSec (Ensembl ID: drosophila_sechellia): Manually created genome data file
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - GasAcu (Ensembl ID: gasterosteus_aculeatus):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9', 'MT', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['I', 'II', 'III', 'IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XIX', 'XV', 'XVI', 'XVII', 'XVIII', 'XX', 'XXI', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - HelMel (Ensembl ID: heliconius_melpomene):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: []
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PanTro (Ensembl ID: pan_troglodytes):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PonAbe (Ensembl ID: pongo_abelii):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - StrAga (Ensembl ID: streptococcus_agalactiae_GCA_001017915):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1']
2025-01-07 14:34:46,746 [91493] WARNING  maint:     New chromosomes: []

as only some of the species are being updated here, we should arguably move from writing out ensembl_info.py file which has release data, to the release being held in a species specific slot in each genome_data.py file of the catalog. I'm happy to make that edit if people agree.

@andrewkern
Copy link
Member Author

an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well

@petrelharp
Copy link
Contributor

Okay, over in #1521 we said
Screenshot from 2025-01-07 15-12-21
and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

What about the other species, for which maintenance failed - what do you think we should do there?

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

@petrelharp
Copy link
Contributor

petrelharp commented Jan 7, 2025

And I guess we need to also change the tests:


=================================== FAILURES ===================================
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_CanFam.TestSpeciesData object at 0x7f34c0c12110>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "canis_familiaris"
E       AssertionError: assert 'canis_lupus_familiaris' == 'canis_familiaris'
E         
E         - canis_familiaris
E         + canis_lupus_familiaris
E         ?     ++++++

tests/test_CanFam.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_GasAcu.TestSpeciesData object at 0x7f34c0ba1780>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "9307941"
E       AssertionError: assert 'gasterosteus_aculeatus' == '9307941'
E         
E         - 9307941
E         + gasterosteus_aculeatus

tests/test_GasAcu.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_StrAga.TestSpeciesData object at 0x7f34c09b0d30>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "NA"
E       AssertionError: assert 'streptococcu...GCA_001017915' == 'NA'
E         
E         - NA
E         + streptococcus_agalactiae_GCA_001017915

tests/test_StrAga.py:12: AssertionError
_________________________________ test_version _________________________________

    def test_version():
        release = stdpopsim.catalog.ensembl_info.release
>       assert release == 103
E       assert 113 == 103

I'll go have a look at the tests now an update them in the PR

@andrewkern andrewkern mentioned this pull request Jan 8, 2025
11 tasks
@andrewkern
Copy link
Member Author

Okay, over in #1521 we said Screenshot from 2025-01-07 15-12-21 and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon

What about the other species, for which maintenance failed - what do you think we should do there?

If we include this property, nothing will need to done, other than maintain the current build version I think.

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

that's correct.

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.84%. Comparing base (a344665) to head (5753a8a).
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1646   +/-   ##
=======================================
  Coverage   99.84%   99.84%           
=======================================
  Files         133      132    -1     
  Lines        4569     4584   +15     
  Branches      472      467    -5     
=======================================
+ Hits         4562     4577   +15     
  Misses          3        3           
  Partials        4        4           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewkern
Copy link
Member Author

a simple way forward for including species specific build info would look like this:

  • the data dict in each species genome_data.py would be extended with two slots
    • build_type which would take the values manual, ensembl, or NCBI
    • ensembl_build which would take the values NA or <ensembl_build_version_id, the later which would be auto-populated by the maintenance script

including a build_type property in the dict would also allow for an easy check as to which genomes the maintenance script should attempt to upload

@petrelharp
Copy link
Contributor

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

@petrelharp
Copy link
Contributor

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

@andrewkern
Copy link
Member Author

andrewkern commented Jan 8, 2025

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

actually i was suggesting that it would go not in the Genome object but in genome_data.data like here, but yes that would eventually get propagated to Genome

@andrewkern
Copy link
Member Author

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

This is a good suggestion-- genome_data.data could hold this build dict. As yes I mean None

@petrelharp
Copy link
Contributor

The data dictionary in genome_data.py is not visible to end users; it is back-end for how we set up the Genome and Species objects that are user-visible, and are what we should be discussing. Probably everything in that dict is then mirrored as an attribute in Genome, so the distinction is moot, though?

@andrewkern
Copy link
Member Author

okay @petrelharp -- added the attributes we talked about the Genome API. This led to lots of downstream changes as we discussed. Sorry for the large PR

@@ -177,6 +183,10 @@ def black_format(code):


def ensembl_stdpopsim_id(ensembl_id):
if ensembl_id == "canis_lupus_familiaris":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps insert a comment explaining what this is doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, explaining why this is necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris", so where does ensembl_id equal "canis_familiaris"? Just missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - could you either take this out, or say in the comment exactly what needs to happen to take it out? (It says "once we have moved to the new names" - but: moved what names? where? I am confused. =)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a larger comment now explaining the context that says:

    # below is do deal with name changes in Ensembl
    # TODO: remove this once we have moved to the new names
    # this was added as a temporary fix to allow the release to be
    # updated to 113, where the names of these species were changed
    # in the future we might keep this code block, but comment it out
    # to show others how maintenance updates were performed in the case
    # where the species name changed in Ensembl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what I understood from talking to you - I thought that (a) we could remove this code right now if we wanted, since this if clause will never get triggered; but (b) we were going to leave the code in as an example of how to do this. Is that right?

maintenance/main.py Outdated Show resolved Hide resolved
data = self.ensembl_client.get_genome_data(ensembl_id)

# Preserve existing assembly source or default to "ensembl"
if genome_data_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, this duplicates the code above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and, it seems like it's doing the same thing, almost, as the code above?)

maintenance/main.py Outdated Show resolved Hide resolved
data["assembly_build_version"] = None

# Check if existing genome data exists and compare chromosome names
if genome_data_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?

for species_id, eid in embl_ids:
try:
result = writer.write_genome_data(eid)
if result is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that function could return None?

@@ -391,7 +520,7 @@ def add_species(ensembl_id, force):
"""
writer = DataWriter()
writer.add_species(ensembl_id.lower(), force=force)
writer.write_ensembl_release()
# writer.write_ensembl_release()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

"MT": 1,
}
_ploidy = {str(i): _species_ploidy for i in range(1, 30)}
_ploidy.update({"X": _species_ploidy, "MT": 1})

_chromosomes = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes is no longer used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right. forgot to delete these lines. will do

_mutation_rate = 4e-9
_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)}
_mutation_rate_data["MT"] = (
_mutation_rate # note this is likely incorrect but consistent with current setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have this note on nearly all the species, right? no need here?

@@ -46,11 +45,24 @@
)
)

_genome = stdpopsim.Genome(
chromosomes=_chromosomes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes no longer used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it's used for chrom ids

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - in that case delete _chromosomes and do like this:

_ploidy_data = {str(i): _species_ploidy for i in genome_data.data["chromosomes"]}

Leaving all that unused stuff around makes it look like you set the mutation rate in _chromosomes, but that's not used?

@@ -45,6 +45,10 @@ class Genome:
:vartype assembly_name: str
:ivar assembly_accession: The ID of the genome assembly accession.
:vartype assembly_accession: str
:ivar assembly_source: The source of the genome assembly data.
:vartype assembly_source: str
:ivar assembly_build_version: The version of the genome assembly build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ivar assembly_build_version: The version of the genome assembly build.
:ivar assembly_build_version: The version of the genome assembly build, or `None`.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Some newly extraneous code, and some clarification needed, I think?

@andrewkern andrewkern requested a review from petrelharp January 13, 2025 19:07
@andrewkern
Copy link
Member Author

okay @petrelharp I've made all the requested changes. The biggest change was to alter the maintenance.py script to be less redundant, but I think the older version made a more clear read. Take a look.

@andrewkern
Copy link
Member Author

also codecov is complaining about 99.72% coverage...

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things, I think...

@@ -3,22 +3,24 @@
"assembly_accession": "GCA_003254395.2",
"assembly_name": "Amel_HAv3.1",
"chromosomes": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is why Codecov is complaining (no more synonyms, so this bit of code in ApiMel/species.py is not needed)
Screenshot from 2025-01-13 21-54-18

# commented this out for now as this case
# is not occurring and it's messing with code coverage
# if syn not in chrom.synonyms:
# chrom.synonyms.append(syn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry - I misled you here. Note that two lines down, this function is called on chr_synonyms_dict, so this function is used. And, codecov is telling us not that this line isn't called it's that the other branch of the if is never evaluated - i.e., that it's always true that syn not in chrom.synonyms. So lines 135-145 are equivalent to

for chrom in _genome.chromosomes:
    chrom.synonyms.append(chr_synonyms_dict[chrom.id])

I think what's going on here is that usually we get synonyms from genome_data.py (ie ensembl); but here we wanted to add some synonyms manually; however some of these overlapped with ensembl.

And, deleting these lines will remove these synonyms (ie, we won't use chr_synonyms_dict any more)! So, we shouldn't do that.

How about this: delete add_if_unique and change lines 144-145 to

for chrom in _genome.chromosomes:
    syns = list(set(chrom.synonyms + chr_synonyms_dict[chrom.id])
    chrom.synonyms = syns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay made this change. it passed the ApiMel test. Let's see what codecov says

maintenance/main.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

Just clarification. If you agree with my change, accept it and then go ahead and squash & merge.

Co-authored-by: Peter Ralph <[email protected]>
@andrewkern andrewkern merged commit abde1c4 into popsim-consortium:main Jan 17, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants