Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

Open
1 of 2 tasks
Tracked by #305
anna-parker opened this issue Dec 11, 2024 · 1 comment

Comments

@anna-parker
Copy link
Contributor

anna-parker commented Dec 11, 2024

Background:

Currently H5N1 sequences are identified using the fasta record description - this is annotated by NCBI for a majority of sequences, however it is not always available. Additionally, identifying subtype using nextclade sort will mean that the same sequences will be found in the general Influenza A database and the specific subtype databases. Nextclade sort has been shown to identify more subtypes than NCBI annotations.

Pre-work

I tried this in loculus-project/loculus#3407 - summary of my analysis findings: https://github.com/GenSpectrum/dashboards/blob/add_more_influenza/H5N1_annotation_assessment.ipynb but it is currently blocked due to:

  • Insufficient grouping capabilities: subtype annotation relies on the fact that segments are grouped with an NA and an HA segment per group. We currently do not group enough sequences due to lacking metadata. However, grouping information is available in the assembly. I am just currently unable to download the assemblies due to a bug: Unable to download the Influenza A taxon assembly ncbi/datasets#432.
  • Alignment/preprocessing issues: Not all segments identified by nextclade sort align to the final subtype reference - this is especially true for segment8 which is highly diverged. The preprocessing pipeline will error a sample if any of its segments do not align - meaning that multiple samples will not be uploaded. We need to either:
  1. Add an additional alignment step after nextclade sort (ideally now only on the segment/subtype that has been identified) - this will remove all segments that do not align from the submission - but this also removes data
  2. Allow some sequences to not align in preprocessing.
@anna-parker anna-parker changed the title Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions (often information is missing) Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions Dec 11, 2024
@anna-parker
Copy link
Contributor Author

Task 1 is now completed here: https://github.com/anna-parker/influenza-a-groupings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant