Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

anna-parker · 2024-12-11T18:25:25Z

Background:

Currently H5N1 sequences are identified using the fasta record description - this is annotated by NCBI for a majority of sequences, however it is not always available. Additionally, identifying subtype using nextclade sort will mean that the same sequences will be found in the general Influenza A database and the specific subtype databases. Nextclade sort has been shown to identify more subtypes than NCBI annotations.

Pre-work

I tried this in loculus-project/loculus#3407 - summary of my analysis findings: https://github.com/GenSpectrum/dashboards/blob/add_more_influenza/H5N1_annotation_assessment.ipynb but it is currently blocked due to:

Insufficient grouping capabilities: subtype annotation relies on the fact that segments are grouped with an NA and an HA segment per group. We currently do not group enough sequences due to lacking metadata. However, grouping information is available in the assembly. I am just currently unable to download the assemblies due to a bug: Unable to download the Influenza A taxon assembly ncbi/datasets#432.
Alignment/preprocessing issues: Not all segments identified by nextclade sort align to the final subtype reference - this is especially true for segment8 which is highly diverged. The preprocessing pipeline will error a sample if any of its segments do not align - meaning that multiple samples will not be uploaded. We need to either:

Add an additional alignment step after nextclade sort (ideally now only on the segment/subtype that has been identified) - this will remove all segments that do not align from the submission - but this also removes data
Allow some sequences to not align in preprocessing.

anna-parker · 2025-01-07T18:40:32Z

Task 1 is now completed here: https://github.com/anna-parker/influenza-a-groupings

anna-parker mentioned this issue Dec 11, 2024

Add more influenza dashboards #305

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

anna-parker commented Dec 11, 2024 •

edited

Loading

anna-parker commented Jan 7, 2025

Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions #415

Comments

anna-parker commented Dec 11, 2024 • edited Loading

Background:

Pre-work

anna-parker commented Jan 7, 2025

anna-parker commented Dec 11, 2024 •

edited

Loading