-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes to maintenance script / ensembl push #1646
changes to maintenance script / ensembl push #1646
Conversation
an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well |
Okay, over in #1521 we said What about the other species, for which maintenance failed - what do you think we should do there? Hm - nothing of consequence seems to be changed here, other than adding two Interestingly, I see that we are maybe already using different ensembl releases:
|
And I guess we need to also change the tests:
I'll go have a look at the tests now an update them in the PR |
Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon
If we include this property, nothing will need to done, other than maintain the current build version I think.
that's correct.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1646 +/- ##
=======================================
Coverage 99.84% 99.84%
=======================================
Files 133 132 -1
Lines 4569 4584 +15
Branches 472 467 -5
=======================================
+ Hits 4562 4577 +15
Misses 3 3
Partials 4 4 ☔ View full report in Codecov by Sentry. |
a simple way forward for including species specific build info would look like this:
including a |
We should probably talk about the API, rather than how it gets in there via the helper function that parses But, assuming that you are suggesting that we add those slots also to the |
Actually, how about an attribute Or, just two attributes, Also, instead of |
actually i was suggesting that it would go not in the |
This is a good suggestion-- |
The |
okay @petrelharp -- added the attributes we talked about the |
maintenance/main.py
Outdated
@@ -177,6 +183,10 @@ def black_format(code): | |||
|
|||
|
|||
def ensembl_stdpopsim_id(ensembl_id): | |||
if ensembl_id == "canis_lupus_familiaris": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps insert a comment explaining what this is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, explaining why this is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris"
, so where does ensembl_id
equal "canis_familiaris"
? Just missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - could you either take this out, or say in the comment exactly what needs to happen to take it out? (It says "once we have moved to the new names" - but: moved what names? where? I am confused. =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a larger comment now explaining the context that says:
# below is do deal with name changes in Ensembl
# TODO: remove this once we have moved to the new names
# this was added as a temporary fix to allow the release to be
# updated to 113, where the names of these species were changed
# in the future we might keep this code block, but comment it out
# to show others how maintenance updates were performed in the case
# where the species name changed in Ensembl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what I understood from talking to you - I thought that (a) we could remove this code right now if we wanted, since this if
clause will never get triggered; but (b) we were going to leave the code in as an example of how to do this. Is that right?
maintenance/main.py
Outdated
data = self.ensembl_client.get_genome_data(ensembl_id) | ||
|
||
# Preserve existing assembly source or default to "ensembl" | ||
if genome_data_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, this duplicates the code above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and, it seems like it's doing the same thing, almost, as the code above?)
maintenance/main.py
Outdated
data["assembly_build_version"] = None | ||
|
||
# Check if existing genome data exists and compare chromosome names | ||
if genome_data_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?
maintenance/main.py
Outdated
for species_id, eid in embl_ids: | ||
try: | ||
result = writer.write_genome_data(eid) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how that function could return None
?
maintenance/main.py
Outdated
@@ -391,7 +520,7 @@ def add_species(ensembl_id, force): | |||
""" | |||
writer = DataWriter() | |||
writer.add_species(ensembl_id.lower(), force=force) | |||
writer.write_ensembl_release() | |||
# writer.write_ensembl_release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
stdpopsim/catalog/BosTau/species.py
Outdated
"MT": 1, | ||
} | ||
_ploidy = {str(i): _species_ploidy for i in range(1, 30)} | ||
_ploidy.update({"X": _species_ploidy, "MT": 1}) | ||
|
||
_chromosomes = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
is no longer used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's right. forgot to delete these lines. will do
stdpopsim/catalog/CanFam/species.py
Outdated
_mutation_rate = 4e-9 | ||
_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)} | ||
_mutation_rate_data["MT"] = ( | ||
_mutation_rate # note this is likely incorrect but consistent with current setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have this note on nearly all the species, right? no need here?
@@ -46,11 +45,24 @@ | |||
) | |||
) | |||
|
|||
_genome = stdpopsim.Genome( | |||
chromosomes=_chromosomes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
no longer used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case it's used for chrom ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - in that case delete _chromosomes
and do like this:
_ploidy_data = {str(i): _species_ploidy for i in genome_data.data["chromosomes"]}
Leaving all that unused stuff around makes it look like you set the mutation rate in _chromosomes
, but that's not used?
stdpopsim/genomes.py
Outdated
@@ -45,6 +45,10 @@ class Genome: | |||
:vartype assembly_name: str | |||
:ivar assembly_accession: The ID of the genome assembly accession. | |||
:vartype assembly_accession: str | |||
:ivar assembly_source: The source of the genome assembly data. | |||
:vartype assembly_source: str | |||
:ivar assembly_build_version: The version of the genome assembly build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ivar assembly_build_version: The version of the genome assembly build. | |
:ivar assembly_build_version: The version of the genome assembly build, or `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Some newly extraneous code, and some clarification needed, I think?
Co-authored-by: Peter Ralph <[email protected]>
okay @petrelharp I've made all the requested changes. The biggest change was to alter the |
also codecov is complaining about 99.72% coverage... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more things, I think...
@@ -3,22 +3,24 @@ | |||
"assembly_accession": "GCA_003254395.2", | |||
"assembly_name": "Amel_HAv3.1", | |||
"chromosomes": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stdpopsim/catalog/ApiMel/species.py
Outdated
# commented this out for now as this case | ||
# is not occurring and it's messing with code coverage | ||
# if syn not in chrom.synonyms: | ||
# chrom.synonyms.append(syn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry - I misled you here. Note that two lines down, this function is called on chr_synonyms_dict
, so this function is used. And, codecov is telling us not that this line isn't called it's that the other branch of the if
is never evaluated - i.e., that it's always true that syn not in chrom.synonyms
. So lines 135-145 are equivalent to
for chrom in _genome.chromosomes:
chrom.synonyms.append(chr_synonyms_dict[chrom.id])
I think what's going on here is that usually we get synonyms from genome_data.py
(ie ensembl); but here we wanted to add some synonyms manually; however some of these overlapped with ensembl.
And, deleting these lines will remove these synonyms (ie, we won't use chr_synonyms_dict
any more)! So, we shouldn't do that.
How about this: delete add_if_unique
and change lines 144-145 to
for chrom in _genome.chromosomes:
syns = list(set(chrom.synonyms + chr_synonyms_dict[chrom.id])
chrom.synonyms = syns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay made this change. it passed the ApiMel test. Let's see what codecov says
Just clarification. If you agree with my change, accept it and then go ahead and squash & merge. |
Co-authored-by: Peter Ralph <[email protected]>
This PR addresses (closes?) #1521.
think i've finally cleaned this up a bit. this PR overhauls the
update-genome-data
portion of the maintenance script such that:the final report looks like this:
as only some of the species are being updated here, we should arguably move from writing out
ensembl_info.py
file which has release data, to the release being held in a species specific slot in eachgenome_data.py
file of the catalog. I'm happy to make that edit if people agree.