Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about shifting ensembl build? #1521

Open
petrelharp opened this issue Oct 10, 2023 · 11 comments
Open

What to do about shifting ensembl build? #1521

petrelharp opened this issue Oct 10, 2023 · 11 comments
Assignees
Milestone

Comments

@petrelharp
Copy link
Contributor

Over in #1517, @ChristianHuber has got a PR for Gorilla (yay!). However, it's failing because the autogenerated file stdpopsim/catalog/ensembl_info.py has been changed:

- release = 103
+ release = 110

and in tests/test_ensembl.py, we have

# Make sure we don't update the release without realising it.
def test_version():
    release = stdpopsim.catalog.ensembl_info.release
    assert release == 103

So, looks like ensembl has moved along, and we need to either

  1. update all our genomes to build 110
  2. make ensembl build a property of Genome/Annotation/Species so that different species are on different ensembl builds
  3. decide it doesn't matter, or
  4. something else?

I suspect the reason for this is so that we can cite exactly where the info is coming from, but I am not on top of all the ramifications.

@jeromekelleher
Copy link
Member

It's probably worth trying the build update to 110, like as not nothing will have changed for most species. You'll see with the diff it generates, whether there's anything to worry about.

The main reason for including the check was to make sure stuff doesn't change without people realising.

@petrelharp
Copy link
Contributor Author

What do you mean "trying the build update"? Do you mean just run maintenance <species> for each species, then compare the diffs? (good idea!)

@petrelharp
Copy link
Contributor Author

That won't hit the annotations, though...

@reedacartwright
Copy link

I don't know if it helps, but Ensembl has archives. 103 is found here: http://feb2021.archive.ensembl.org/index.html Would it be possible to redo the Gorilla PR for 103?

@jeromekelleher
Copy link
Member

What do you mean "trying the build update"? Do you mean just run maintenance for each species, then compare the diffs? (good idea!)

Yeah, that's how it was intended. Occasionally update to the new build, and letting git tell you whether anything important has changed. Running update-genome-data without any arguments should do it for all species in the catalog.

@petrelharp
Copy link
Contributor Author

@andrewkern will give updating to 104 a shot

@andrewkern
Copy link
Member

ugh running this returns an HTTP error because Ensembl has changed the designation of canis_familiaris to canis_lupus_familiaris 😵‍💫

(annotation) adkern@sesame:~/popSim/stdpopsim$ python -m maintenance update-genome-data
2023-10-25 10:50:34,077 [1048815] INFO     maint: Writing genome data for AedAeg aedes_aegypti_lvpagwg
2023-10-25 10:50:34,077 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/aedes_aegypti_lvpagwg?synonyms=1
2023-10-25 10:50:38,973 [1048815] INFO     maint: Writing genome data for AnaPla anas_platyrhynchos
2023-10-25 10:50:38,973 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/anas_platyrhynchos?synonyms=1
2023-10-25 10:50:46,742 [1048815] INFO     maint: Writing genome data for AnoCar anolis_carolinensis
2023-10-25 10:50:46,743 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/anolis_carolinensis?synonyms=1
2023-10-25 10:51:04,287 [1048815] INFO     maint: Writing genome data for AnoGam anopheles_gambiae
2023-10-25 10:51:04,287 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/anopheles_gambiae?synonyms=1
2023-10-25 10:51:04,939 [1048815] INFO     maint: Writing genome data for ApiMel apis_mellifera
2023-10-25 10:51:04,939 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/apis_mellifera?synonyms=1
2023-10-25 10:51:05,939 [1048815] INFO     maint: Writing genome data for AraTha arabidopsis_thaliana
2023-10-25 10:51:05,939 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/arabidopsis_thaliana?synonyms=1
2023-10-25 10:51:06,323 [1048815] INFO     maint: Writing genome data for BosTau bos_taurus
2023-10-25 10:51:06,323 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/bos_taurus?synonyms=1
2023-10-25 10:51:11,017 [1048815] INFO     maint: Writing genome data for CaeEle caenorhabditis_elegans
2023-10-25 10:51:11,018 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/caenorhabditis_elegans?synonyms=1
2023-10-25 10:51:11,622 [1048815] INFO     maint: Writing genome data for CanFam canis_familiaris
2023-10-25 10:51:11,622 [1048815] INFO     ensembl: making request to http://rest.ensembl.org/info/assembly/canis_familiaris?synonyms=1
2023-10-25 10:51:11,954 [1048815] CRITICAL root: Traceback (most recent call last):
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/adkern/popSim/stdpopsim/maintenance/__main__.py", line 6, in <module>
    main.main()
  File "/home/adkern/popSim/stdpopsim/maintenance/main.py", line 404, in main
    cli()
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/adkern/popSim/stdpopsim/maintenance/main.py", line 362, in update_genome_data
    writer.write_genome_data(eid)
  File "/home/adkern/popSim/stdpopsim/maintenance/main.py", line 282, in write_genome_data
    data = self.ensembl_client.get_genome_data(ensembl_id)
  File "/home/adkern/popSim/stdpopsim/maintenance/ensembl.py", line 99, in get_genome_data
    output = self.get(
  File "/home/adkern/popSim/stdpopsim/maintenance/ensembl.py", line 68, in get
    response = urllib.request.urlopen(request)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/home/adkern/miniconda3/envs/annotation/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

I will submit a patch along with the update

@petrelharp
Copy link
Contributor Author

OMG

@andrewkern
Copy link
Member

...oh boy a whole mess of these have moved...

@andrewkern
Copy link
Member

okay i've submitted a PR at #1536

@igronau
Copy link
Contributor

igronau commented Nov 21, 2023

Looks like many of the builds changed. In most cases it's just chromosome names and synonyms. In these cases, we'll need to make sure that the chrom names we use in the that the chrom-averaged recombination rate and mutation rate lists work with the new names. Looks like they will, but I'm not 100% sure.

A more serious problem occurs when the chromosome lengths actually change, as happened e.g. in CanFam. New chromosome lengths can conflict with existing recombination maps, so we definitely don't want to automatically update them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants