Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding assembly_name attribute #1650

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

silastittes
Copy link
Contributor

@silastittes silastittes commented Jan 9, 2025

Here's a draft solution for #1252

@grahamgower made a patch dealing with genetic map intervals that don't match that of the assembly. All described further in #719

I think the concern remains that there can be mismatches between assemblies and intervals along the assemblies (annotations and genetic maps).

I've started the process of adding the assembly names to the Annotation and GeneticMap classes, which would facilitate checking that the resources all belong on the same assembly.

It's hard to write explicit tests for this using the interval information, especially for annotations, because the tests likely won't fail if the annotation intervals are still within the bounds of the (incorrect) assembly. I think this largely comes down to humans being careful to check that the annotation/genetic maps match the assembly. Adding the attribute really just serves as a helpful reminder during the QC process to verify this, and do a liftover if it's wrong.

If this is the direction we want to go a few more things need to be done.

  • Add assembly_name to stubs for generating new species in the catalogue
  • Finish populating the genetic maps and annotations (I think all annotations are done in this PR, but should verify)
  • Flesh out documentation about making sure the assemblies match between resources (I believe there's some stuff about this already concerning the liftover process).
  • A test or two checking that assembly name matches for Annotation, GeneticMap, and Genome.
    OR
  • Add some warnings when the assembly_name doesn't match between the components.

In this draft PR we can write simple checks like:

species.get_contig("chr20", genetic_map = "HapMapII_GRCh38").genetic_map.assembly_name == species.genome.assembly_name

That could be added as a unit test.

@petrelharp
Copy link
Contributor

Okay - we need @andrewkern's input here, I think. Observations:

  1. An annotation is associated with a given assembly; we ought to be able to know which one.
  2. Same for a genetic map.
  3. Similarly, we ought to be able to know that the genetic map and annotation being used together are with reference to the same assembly!
  4. So, maybe assembly should be a property of Genome (is it already?)?
  5. Right now @andrewkern is putting in a property of Genome that gives (for instance) ensembl build number; this is similar to but not the same as assembly. It seems to me that assembly is what we actually want, not ensembl build?

So: should we have assembly as a property of Genome? Then check that annotations and genetic maps match that assembly? To check, we don't need to have assembly as a property of Annotation and GeneticMap, but we could. Should we?

I feel like I'm over-complicating things, maybe?

@andrewkern
Copy link
Member

so in looking this over, do we need the assembly in the

Okay - we need @andrewkern's input here, I think. Observations:

  1. An annotation is associated with a given assembly; we ought to be able to know which one.
  2. Same for a genetic map.
  3. Similarly, we ought to be able to know that the genetic map and annotation being used together are with reference to the same assembly!
  4. So, maybe assembly should be a property of Genome (is it already?)?
  5. Right now @andrewkern is putting in a property of Genome that gives (for instance) ensembl build number; this is similar to but not the same as assembly. It seems to me that assembly is what we actually want, not ensembl build?

So: should we have assembly as a property of Genome? Then check that annotations and genetic maps match that assembly? To check, we don't need to have assembly as a property of Annotation and GeneticMap, but we could. Should we?

I feel like I'm over-complicating things, maybe?

so the assembly_name is already a property of Genome like here.

the GeneticMap files are named based on their assemblies, as are the Annotations, so maybe this is redundant?

Having said that, I do like the idea of a test that could yield a warning if the user chose to say use a genetic map from an assembly that was mismatched with the genome, but maybe we should tag this as a TODO and put it off until after the release we are working on?

@petrelharp
Copy link
Contributor

petrelharp commented Jan 10, 2025

I think I'm in the same place as @andrewkern here - this would be a nice thing to do, but let's put this off and not do it right now. There's a bunch of other things to think through (and implement) here, and we'd like to do that properly, but it's not urgent. I think the only things that we do want to do now are:

  1. make sure the assembly is mentioned in the description of the PhoSin annotation (as with the other annotations)
  2. make the assembly property of Genome visible in the catalog

I've opened an issue for those. #1651

@petrelharp
Copy link
Contributor

So - if you agree, let's close this, @silastittes? And, I'll close #1252 as well, but if you'd like to write up a summary in another issue for the future (which can refer to this PR), that'd be helpful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants