You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently H5N1 sequences are identified using the fasta record description - this is annotated by NCBI for a majority of sequences, however it is not always available. Additionally, identifying subtype using nextclade sort will mean that the same sequences will be found in the general Influenza A database and the specific subtype databases. Nextclade sort has been shown to identify more subtypes than NCBI annotations.
Insufficient grouping capabilities: subtype annotation relies on the fact that segments are grouped with an NA and an HA segment per group. We currently do not group enough sequences due to lacking metadata. However, grouping information is available in the assembly. I am just currently unable to download the assemblies due to a bug: Unable to download the Influenza A taxon assembly ncbi/datasets#432.
Alignment/preprocessing issues: Not all segments identified by nextclade sort align to the final subtype reference - this is especially true for segment8 which is highly diverged. The preprocessing pipeline will error a sample if any of its segments do not align - meaning that multiple samples will not be uploaded. We need to either:
Add an additional alignment step after nextclade sort (ideally now only on the segment/subtype that has been identified) - this will remove all segments that do not align from the submission - but this also removes data
Allow some sequences to not align in preprocessing.
The text was updated successfully, but these errors were encountered:
anna-parker
changed the title
Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions (often information is missing)
Identify influenza subtypes using nextclade sort for all influenza subtypes to not have to rely on fasta descriptions
Dec 11, 2024
Background:
Currently H5N1 sequences are identified using the fasta record description - this is annotated by NCBI for a majority of sequences, however it is not always available. Additionally, identifying subtype using nextclade sort will mean that the same sequences will be found in the general Influenza A database and the specific subtype databases. Nextclade sort has been shown to identify more subtypes than NCBI annotations.
Pre-work
I tried this in loculus-project/loculus#3407 - summary of my analysis findings: https://github.com/GenSpectrum/dashboards/blob/add_more_influenza/H5N1_annotation_assessment.ipynb but it is currently blocked due to:
The text was updated successfully, but these errors were encountered: