changes to maintenance script / ensembl push #1646

andrewkern · 2025-01-07T22:48:16Z

This PR addresses (closes?) #1521.

think i've finally cleaned this up a bit. this PR overhauls the update-genome-data portion of the maintenance script such that:

species whose genomes were manually created are skipped
species where the ensembl Rest API returns mismatching chromosome names with the current release are skipped
detailed logging warnings of these issues are created, along with a final report of which species were skipped in the update and why.

the final report looks like this:

=== Species Update Summary ===
2025-01-07 14:34:46,744 [91493] WARNING  maint: The following species were not updated:
2025-01-07 14:34:46,744 [91493] WARNING  maint:   - AnoCar (Ensembl ID: anolis_carolinensis):
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Existing chromosomes: ['1', '2', '3', '4', '5', '6', 'LGa', 'LGb', 'LGc', 'LGd', 'LGf', 'LGg', 'LGh', 'MT']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '2', '3', '4', '5', '6', 'a', 'b', 'c', 'd', 'f', 'g', 'h']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - CanFam (Ensembl ID: canis_lupus_familiaris):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - DroSec (Ensembl ID: drosophila_sechellia): Manually created genome data file
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - GasAcu (Ensembl ID: gasterosteus_aculeatus):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9', 'MT', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['I', 'II', 'III', 'IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XIX', 'XV', 'XVI', 'XVII', 'XVIII', 'XX', 'XXI', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - HelMel (Ensembl ID: heliconius_melpomene):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: []
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PanTro (Ensembl ID: pan_troglodytes):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PonAbe (Ensembl ID: pongo_abelii):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - StrAga (Ensembl ID: streptococcus_agalactiae_GCA_001017915):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1']
2025-01-07 14:34:46,746 [91493] WARNING  maint:     New chromosomes: []

as only some of the species are being updated here, we should arguably move from writing out ensembl_info.py file which has release data, to the release being held in a species specific slot in each genome_data.py file of the catalog. I'm happy to make that edit if people agree.

andrewkern · 2025-01-07T22:50:00Z

an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well

petrelharp · 2025-01-07T23:20:23Z

Okay, over in #1521 we said

and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

What about the other species, for which maintenance failed - what do you think we should do there?

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

petrelharp · 2025-01-07T23:21:27Z

And I guess we need to also change the tests:


=================================== FAILURES ===================================
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_CanFam.TestSpeciesData object at 0x7f34c0c12110>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "canis_familiaris"
E       AssertionError: assert 'canis_lupus_familiaris' == 'canis_familiaris'
E         
E         - canis_familiaris
E         + canis_lupus_familiaris
E         ?     ++++++

tests/test_CanFam.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_GasAcu.TestSpeciesData object at 0x7f34c0ba1780>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "9307941"
E       AssertionError: assert 'gasterosteus_aculeatus' == '9307941'
E         
E         - 9307941
E         + gasterosteus_aculeatus

tests/test_GasAcu.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_StrAga.TestSpeciesData object at 0x7f34c09b0d30>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "NA"
E       AssertionError: assert 'streptococcu...GCA_001017915' == 'NA'
E         
E         - NA
E         + streptococcus_agalactiae_GCA_001017915

tests/test_StrAga.py:12: AssertionError
_________________________________ test_version _________________________________

    def test_version():
        release = stdpopsim.catalog.ensembl_info.release
>       assert release == 103
E       assert 113 == 103

I'll go have a look at the tests now an update them in the PR

andrewkern · 2025-01-08T18:46:13Z

Okay, over in #1521 we said and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon

What about the other species, for which maintenance failed - what do you think we should do there?

If we include this property, nothing will need to done, other than maintain the current build version I think.

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

that's correct.

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

codecov · 2025-01-08T18:59:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.84%. Comparing base (a344665) to head (5753a8a).
Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1646   +/-   ##
=======================================
  Coverage   99.84%   99.84%           
=======================================
  Files         133      132    -1     
  Lines        4569     4584   +15     
  Branches      472      467    -5     
=======================================
+ Hits         4562     4577   +15     
  Misses          3        3           
  Partials        4        4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrewkern · 2025-01-08T19:05:01Z

a simple way forward for including species specific build info would look like this:

the data dict in each species genome_data.py would be extended with two slots
- build_type which would take the values manual, ensembl, or NCBI
- ensembl_build which would take the values NA or <ensembl_build_version_id, the later which would be auto-populated by the maintenance script

including a build_type property in the dict would also allow for an easy check as to which genomes the maintenance script should attempt to upload

petrelharp · 2025-01-08T22:48:38Z

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

petrelharp · 2025-01-08T22:54:54Z

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

andrewkern · 2025-01-08T23:46:32Z

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

actually i was suggesting that it would go not in the Genome object but in genome_data.data like here, but yes that would eventually get propagated to Genome

andrewkern · 2025-01-08T23:47:21Z

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

This is a good suggestion-- genome_data.data could hold this build dict. As yes I mean None

petrelharp · 2025-01-09T14:35:40Z

The data dictionary in genome_data.py is not visible to end users; it is back-end for how we set up the Genome and Species objects that are user-visible, and are what we should be discussing. Probably everything in that dict is then mirrored as an attribute in Genome, so the distinction is moot, though?

andrewkern · 2025-01-10T00:01:29Z

okay @petrelharp -- added the attributes we talked about the Genome API. This led to lots of downstream changes as we discussed. Sorry for the large PR

petrelharp · 2025-01-13T02:46:21Z

maintenance/main.py

@@ -177,6 +183,10 @@ def black_format(code):


 def ensembl_stdpopsim_id(ensembl_id):
+    if ensembl_id == "canis_lupus_familiaris":


perhaps insert a comment explaining what this is doing?

I mean, explaining why this is necessary

And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris", so where does ensembl_id equal "canis_familiaris"? Just missing something here.

hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out

Okay - could you either take this out, or say in the comment exactly what needs to happen to take it out? (It says "once we have moved to the new names" - but: moved what names? where? I am confused. =)

added a larger comment now explaining the context that says:

# below is do deal with name changes in Ensembl # TODO: remove this once we have moved to the new names # this was added as a temporary fix to allow the release to be # updated to 113, where the names of these species were changed # in the future we might keep this code block, but comment it out # to show others how maintenance updates were performed in the case # where the species name changed in Ensembl

This isn't what I understood from talking to you - I thought that (a) we could remove this code right now if we wanted, since this if clause will never get triggered; but (b) we were going to leave the code in as an example of how to do this. Is that right?

maintenance/main.py

petrelharp · 2025-01-13T14:03:23Z

maintenance/main.py

        data = self.ensembl_client.get_genome_data(ensembl_id)
+
+        # Preserve existing assembly source or default to "ensembl"
+        if genome_data_path.exists():


wait, this duplicates the code above?

(and, it seems like it's doing the same thing, almost, as the code above?)

maintenance/main.py

petrelharp · 2025-01-13T14:09:03Z

maintenance/main.py

+            data["assembly_build_version"] = None
+
+        # Check if existing genome data exists and compare chromosome names
+        if genome_data_path.exists():


This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?

maintenance/main.py

petrelharp · 2025-01-13T14:14:15Z

maintenance/main.py

+    for species_id, eid in embl_ids:
+        try:
+            result = writer.write_genome_data(eid)
+            if result is not None:


I don't see how that function could return None?

petrelharp · 2025-01-13T14:18:01Z

maintenance/main.py

@@ -391,7 +520,7 @@ def add_species(ensembl_id, force):
    """
    writer = DataWriter()
    writer.add_species(ensembl_id.lower(), force=force)
-    writer.write_ensembl_release()
+    # writer.write_ensembl_release()


stdpopsim/catalog/ApiMel/genome_data.py

petrelharp · 2025-01-13T14:33:12Z

stdpopsim/catalog/BosTau/species.py

-    "MT": 1,
-}
+_ploidy = {str(i): _species_ploidy for i in range(1, 30)}
+_ploidy.update({"X": _species_ploidy, "MT": 1})

 _chromosomes = []


_chromosomes is no longer used

yes that's right. forgot to delete these lines. will do

stdpopsim/catalog/CaeEle/species.py

petrelharp · 2025-01-13T14:35:15Z

stdpopsim/catalog/CanFam/species.py

+_mutation_rate = 4e-9
+_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)}
+_mutation_rate_data["MT"] = (
+    _mutation_rate  # note this is likely incorrect but consistent with current setup


we could have this note on nearly all the species, right? no need here?

stdpopsim/catalog/CanFam/species.py

petrelharp · 2025-01-13T14:36:31Z

stdpopsim/catalog/EscCol/species.py

@@ -46,11 +45,24 @@
        )
    )

-_genome = stdpopsim.Genome(
-    chromosomes=_chromosomes,


_chromosomes no longer used?

in this case it's used for chrom ids

Okay - in that case delete _chromosomes and do like this:

_ploidy_data = {str(i): _species_ploidy for i in genome_data.data["chromosomes"]}

Leaving all that unused stuff around makes it look like you set the mutation rate in _chromosomes, but that's not used?

stdpopsim/genomes.py

petrelharp · 2025-01-13T14:38:24Z

stdpopsim/genomes.py

@@ -45,6 +45,10 @@ class Genome:
    :vartype assembly_name: str
    :ivar assembly_accession: The ID of the genome assembly accession.
    :vartype assembly_accession: str
+    :ivar assembly_source: The source of the genome assembly data.
+    :vartype assembly_source: str
+    :ivar assembly_build_version: The version of the genome assembly build.


Suggested change

:ivar assembly_build_version: The version of the genome assembly build.

:ivar assembly_build_version: The version of the genome assembly build, or `None`.

petrelharp

Generally looks good. Some newly extraneous code, and some clarification needed, I think?

Co-authored-by: Peter Ralph <[email protected]>

andrewkern · 2025-01-13T19:08:11Z

okay @petrelharp I've made all the requested changes. The biggest change was to alter the maintenance.py script to be less redundant, but I think the older version made a more clear read. Take a look.

andrewkern · 2025-01-13T19:08:48Z

also codecov is complaining about 99.72% coverage...

petrelharp

Two more things, I think...

petrelharp · 2025-01-14T05:55:05Z

stdpopsim/catalog/ApiMel/genome_data.py

@@ -3,22 +3,24 @@
    "assembly_accession": "GCA_003254395.2",
    "assembly_name": "Amel_HAv3.1",
    "chromosomes": {


Oh, this is why Codecov is complaining (no more synonyms, so this bit of code in ApiMel/species.py is not needed)

petrelharp · 2025-01-15T00:21:59Z

stdpopsim/catalog/ApiMel/species.py

+        # commented this out for now as this case
+        # is not occurring and it's messing with code coverage
+        # if syn not in chrom.synonyms:
+        #     chrom.synonyms.append(syn)


Ah, sorry - I misled you here. Note that two lines down, this function is called on chr_synonyms_dict, so this function is used. And, codecov is telling us not that this line isn't called it's that the other branch of the if is never evaluated - i.e., that it's always true that syn not in chrom.synonyms. So lines 135-145 are equivalent to

for chrom in _genome.chromosomes: chrom.synonyms.append(chr_synonyms_dict[chrom.id])

I think what's going on here is that usually we get synonyms from genome_data.py (ie ensembl); but here we wanted to add some synonyms manually; however some of these overlapped with ensembl.

And, deleting these lines will remove these synonyms (ie, we won't use chr_synonyms_dict any more)! So, we shouldn't do that.

How about this: delete add_if_unique and change lines 144-145 to

for chrom in _genome.chromosomes: syns = list(set(chrom.synonyms + chr_synonyms_dict[chrom.id]) chrom.synonyms = syns

okay made this change. it passed the ApiMel test. Let's see what codecov says

maintenance/main.py

petrelharp · 2025-01-16T04:33:21Z

Just clarification. If you agree with my change, accept it and then go ahead and squash & merge.

Co-authored-by: Peter Ralph <[email protected]>

changes to maintenance script / ensembl push

4e03aed

andrewkern requested review from petrelharp and nspope January 7, 2025 22:48

andrewkern mentioned this pull request Jan 8, 2025

Release 0.2.1 checklist #1565

Open

11 tasks

update ensembl tests

2fb9499

added assembly attributes to API; lots of associated changes

71751a0

added test template stubs for new attributes

1a9b881