Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCHFV to loculus #1920

Merged
merged 81 commits into from
Jun 6, 2024
Merged
Changes from 1 commit
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
9277b4b
Add ccfv to yaml
anna-parker May 10, 2024
21d2877
Fix ingest for single segment case
anna-parker May 15, 2024
d65a076
Fix: values.yaml - nucleotideSequences need to be a list in prepro co…
anna-parker May 15, 2024
fef7ebe
Add correct genome annotations from NCBI
anna-parker May 15, 2024
c323743
Update configs to use githubusercontent for nextclade_datasets.
anna-parker May 16, 2024
309dfeb
Use new dataset link
anna-parker May 16, 2024
c94ba9f
Fix preprocessing issues after default values.yaml changes.
anna-parker May 22, 2024
4756ffe
Add segmented as a config param
anna-parker May 22, 2024
49ff8e2
Join segments based on isolate name.
anna-parker May 23, 2024
1d9df16
Fix some prepro issues
anna-parker May 23, 2024
e0f8801
Add default config changes
anna-parker May 23, 2024
9dea930
Update silo configs
anna-parker May 23, 2024
b3c7645
Remove preprocessing temp results file.
anna-parker May 23, 2024
530cb30
Fix cchfv table columns as metdata has now been renamed.
anna-parker May 23, 2024
5717cd4
Fix author_affiliations
anna-parker May 23, 2024
61bb4f7
Merge branch 'main' into ccfv
anna-parker May 23, 2024
7069e0b
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 23, 2024
9d1eb2a
Fix merge issues with instanceName.
anna-parker May 23, 2024
f375fb7
Merge branch 'main' into ccfv
anna-parker May 23, 2024
1d915a1
Fix prepare_metdata bug.
anna-parker May 23, 2024
320d1f3
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 23, 2024
9b299d4
Add back missing website metadata.
anna-parker May 23, 2024
303c630
Fix author list sorting, fix displayName.
anna-parker May 23, 2024
6e6e75c
Fix values.yaml
anna-parker May 24, 2024
ddbc4d9
Fix reingest.
anna-parker May 25, 2024
7afc4ed
Add segmented to ingest configs and make use in scripts consistent.
anna-parker May 25, 2024
60f5992
Update README.
anna-parker May 25, 2024
7e70b73
Fix little ingest bug
anna-parker May 25, 2024
1c6d8ea
Refactor ingest to make steps clearer.
anna-parker May 27, 2024
31d1af1
Fix webpage bug.
anna-parker May 27, 2024
b14e4ee
Small prepro fixes
anna-parker May 27, 2024
1c0c841
Remove unnecessary files from gitignore
anna-parker May 27, 2024
0fbb72c
Merge branch 'main' into ccfv
anna-parker May 28, 2024
38778f3
Small fixes
anna-parker May 28, 2024
80c52de
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
4fdfd1b
Clean up preprocessing
anna-parker May 28, 2024
fc6c7e4
add args
anna-parker May 28, 2024
f535691
Use links to sequences instead of full sequences in values.yaml.
anna-parker May 28, 2024
3cbbcc7
Merge branch 'main' into ccfv
anna-parker May 28, 2024
498a88d
Fix little bug
anna-parker May 28, 2024
3e1c6e0
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
8b85432
Fix length bug
anna-parker May 28, 2024
c177584
Merge branch 'main' into ccfv
anna-parker May 28, 2024
aacad98
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 28, 2024
4b48c46
Fix merge bug
anna-parker May 28, 2024
3e5377d
Make check stricter
anna-parker May 28, 2024
0591f09
Update docs
anna-parker May 28, 2024
16bb0aa
Merge remote-tracking branch 'origin/main' into ccfv
anna-parker May 31, 2024
eaba61f
Merge branch 'main' into ccfv
anna-parker May 31, 2024
b16026e
Fix prepro bug introduced by merge
anna-parker May 31, 2024
aab29a1
Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv
anna-parker May 31, 2024
7e512a7
Remove ncbi_length from defaults - this was removed from values.yaml …
anna-parker May 31, 2024
0186f52
Update READMEs with suggestions.
anna-parker May 31, 2024
a29ca19
Resolve some issues
anna-parker Jun 1, 2024
8ede192
Change `segmented` to `per_segment`.
anna-parker Jun 2, 2024
28c9402
Remove the requirement for adding `segmented:True` to the config.yaml
anna-parker Jun 3, 2024
9026276
Fix backend bug
anna-parker Jun 3, 2024
f77b447
Fix bug
anna-parker Jun 3, 2024
9c14af7
Second try to fix bug
anna-parker Jun 3, 2024
22c29dd
Merge branch 'main' into ccfv
corneliusroemer Jun 4, 2024
542606d
Add dag for segmented
corneliusroemer Jun 4, 2024
0914133
Simplify segmentation inference
corneliusroemer Jun 4, 2024
18db327
Remove unnecessary/confusing functions
corneliusroemer Jun 4, 2024
3cb089a
Simplify extraction script, DRYer
corneliusroemer Jun 4, 2024
ef375a4
Reorder to never have rules do forward references
corneliusroemer Jun 4, 2024
94b9706
Remove unused function
corneliusroemer Jun 4, 2024
fe20091
Keep top level dir clean by moving images to folder
corneliusroemer Jun 4, 2024
3fb9060
Review segment parsing script
corneliusroemer Jun 4, 2024
47afdea
Switch default log level to INFO, debug is very verbose and there's n…
corneliusroemer Jun 4, 2024
14d784d
Log a few important lines at INFO, not everything at debug only
corneliusroemer Jun 4, 2024
1fbc79c
Avoid a very broad try/except block, if necessary, use in more locali…
corneliusroemer Jun 4, 2024
c6d0ceb
Mention all config in `params:` blocks, so snakemake can rerun rule o…
corneliusroemer Jun 4, 2024
1012ecb
Use input.script consistently (the advantage of using the script as i…
corneliusroemer Jun 4, 2024
2cfdb12
All config files to be used by Python MUST use snake case, not camel …
corneliusroemer Jun 4, 2024
de3461b
Fix ruff lints and unnecessary indentations
corneliusroemer Jun 4, 2024
fac633e
Update documentation of group_segments
anna-parker Jun 6, 2024
8543f6c
Fix issues raised in get_segment_details
anna-parker Jun 6, 2024
df9c57d
Fix weird error I introduced when merging changes
anna-parker Jun 6, 2024
77119db
Go back to old regex as this catches more cases.
anna-parker Jun 6, 2024
bcefc7c
Update ingest config file
anna-parker Jun 6, 2024
cd19b1c
Merge branch 'main' into ccfv
anna-parker Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add correct genome annotations from NCBI
anna-parker committed May 22, 2024
commit fef7ebe6e6b7fbe6c7829473c6b6cb39f500529d
114 changes: 56 additions & 58 deletions preprocessing/nextclade/src/loculus_preprocessing/prepro.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,7 @@
import dataclasses
import json
import logging
import os
import subprocess # noqa: S404
import sys
import time
@@ -10,12 +11,11 @@
from pathlib import Path
from tempfile import TemporaryDirectory
from typing import Any, Optional
from distutils.dir_util import copy_tree
import os

import dpath
import requests
from Bio import SeqIO
from distutils.dir_util import copy_tree

from .backend import get_jwt
from .config import Config
@@ -47,8 +47,7 @@
def fetch_unprocessed_sequences(n: int, config: Config) -> Sequence[UnprocessedEntry]:
url = config.backend_host.rstrip("/") + "/extract-unprocessed-data"
logging.debug(f"Fetching {n} unprocessed sequences from {url}")
params = {"numberOfSequenceEntries": n,
"pipelineVersion": config.pipeline_version}
params = {"numberOfSequenceEntries": n, "pipelineVersion": config.pipeline_version}
headers = {"Authorization": "Bearer " + get_jwt(config)}
response = requests.post(url, data=params, headers=headers, timeout=10)
if not response.ok:
@@ -87,13 +86,16 @@ def parse_ndjson(ndjson_data: str) -> Sequence[UnprocessedEntry]:
def enrich_with_nextclade(
unprocessed: Sequence[UnprocessedEntry], dataset_dir: str, config: Config
) -> dict[AccessionVersion, UnprocessedAfterNextclade]:
unaligned_nucleotide_sequences: dict[AccessionVersion,
dict[str, Optional[NucleotideSequence]]] = {}
unaligned_nucleotide_sequences: dict[
AccessionVersion, dict[str, Optional[NucleotideSequence]]
] = {}
input_metadata: dict[AccessionVersion, dict[str, Any]] = {}
aligned_aminoacid_sequences: dict[AccessionVersion,
dict[GeneName, AminoAcidSequence | None]] = {}
aligned_nucleotide_sequences: dict[AccessionVersion,
dict[str, Optional[NucleotideSequence]]] = {}
aligned_aminoacid_sequences: dict[
AccessionVersion, dict[GeneName, AminoAcidSequence | None]
] = {}
aligned_nucleotide_sequences: dict[
AccessionVersion, dict[str, Optional[NucleotideSequence]]
] = {}
for entry in unprocessed:
id = entry.accessionVersion
input_metadata[id] = entry.data.metadata
@@ -105,32 +107,32 @@ def enrich_with_nextclade(
for segment in config.nucleotideSequences:
aligned_nucleotide_sequences[id][segment] = None
if segment in entry.data.unalignedNucleotideSequences:
unaligned_nucleotide_sequences[id][segment] = entry.data.unalignedNucleotideSequences[segment]
unaligned_nucleotide_sequences[id][segment] = (
entry.data.unalignedNucleotideSequences[segment]
)
else:
unaligned_nucleotide_sequences[id][segment] = None

nextclade_metadata: dict[AccessionVersion, dict[str, Any]] = {}
nucleotide_insertions: dict[AccessionVersion,
dict[str, list[NucleotideInsertion]]] = {}
amino_acid_insertions: dict[AccessionVersion,
dict[GeneName, list[AminoAcidInsertion]]] = {}
with TemporaryDirectory(delete=not config.keep_tmp_dir) as result_dir:
nucleotide_insertions: dict[AccessionVersion, dict[str, list[NucleotideInsertion]]] = {}
amino_acid_insertions: dict[AccessionVersion, dict[GeneName, list[AminoAcidInsertion]]] = {}
with TemporaryDirectory(delete=not config.keep_tmp_dir) as result_dir: # noqa: PLR1702
for segment in config.nucleotideSequences:
result_dir_seg = result_dir + "/" + segment
dataset_dir_seg = dataset_dir + "/" + segment
result_dir_seg = result_dir if segment == "main" else result_dir + "/" + segment
dataset_dir_seg = dataset_dir if segment == "main" else dataset_dir + "/" + segment
input_file = result_dir_seg + "/input.fasta"
os.makedirs(os.path.dirname(input_file), exist_ok=True)
with open(input_file, "w", encoding="utf-8") as f:
for id, seg_dict in unaligned_nucleotide_sequences.items():
if segment in seg_dict.keys():
if segment in seg_dict:
f.write(f">{id}\n")
f.write(f"{seg_dict[segment]}\n")

command = [
"nextclade3",
"run",
f"--output-all={result_dir}",
f"--input-dataset={dataset_dir}",
f"--output-all={result_dir_seg}",
f"--input-dataset={dataset_dir_seg}",
f"--output-translations={
result_dir}/nextclade.cds_translation.{{cds}}.fasta",
"--jobs=1",
@@ -145,26 +147,26 @@ def enrich_with_nextclade(
msg = f"nextclade failed with exit code {exit_code}"
raise Exception(msg)

logging.debug(f"Nextclade results available in {result_dir}")
logging.debug("Nextclade results available in %s", result_dir)

with open(result_dir_seg + "/nextclade.aligned.fasta", encoding="utf-8") as aligned_nucs:
with open(
result_dir_seg + "/nextclade.aligned.fasta", encoding="utf-8"
) as aligned_nucs:
aligned_nuc = SeqIO.parse(aligned_nucs, "fasta")
for aligned_sequence in aligned_nuc:
sequence_id: str = aligned_sequence.id
aligned_nucleotide_sequences[sequence_id][segment] = str(
aligned_sequence.seq)
aligned_nucleotide_sequences[sequence_id][segment] = str(aligned_sequence.seq)

for gene in config.genes:
translation_path = result_dir_seg + \
f"/nextclade.cds_translation.{gene}.fasta"
translation_path = result_dir_seg + f"/nextclade.cds_translation.{gene}.fasta"
try:
with open(translation_path, encoding="utf-8") as aligned_translations:
aligned_translation = SeqIO.parse(
aligned_translations, "fasta")
aligned_translation = SeqIO.parse(aligned_translations, "fasta")
for aligned_sequence in aligned_translation:
sequence_id = aligned_sequence.id
aligned_aminoacid_sequences[sequence_id][gene] = str(
aligned_sequence.seq)
aligned_sequence.seq
)
except FileNotFoundError:
# TODO: Add warning to each sequence
logging.info(
@@ -174,7 +176,8 @@ def enrich_with_nextclade(

parse_nextclade_json(result_dir_seg, nextclade_metadata)
parse_nextclade_tsv(
amino_acid_insertions, nucleotide_insertions, result_dir_seg, config, segment)
amino_acid_insertions, nucleotide_insertions, result_dir_seg, config, segment
)

return {
id: UnprocessedAfterNextclade(
@@ -190,18 +193,20 @@ def enrich_with_nextclade(
}


def parse_nextclade_tsv(amino_acid_insertions: dict[AccessionVersion, dict[GeneName, list[AminoAcidInsertion]]],
nucleotide_insertions: dict[AccessionVersion, dict[str, list[NucleotideInsertion]]],
result_dir: str, config: Config, segment: str
):
def parse_nextclade_tsv(
amino_acid_insertions: dict[AccessionVersion, dict[GeneName, list[AminoAcidInsertion]]],
nucleotide_insertions: dict[AccessionVersion, dict[str, list[NucleotideInsertion]]],
result_dir: str,
config: Config,
segment: str,
):
with open(result_dir + "/nextclade.tsv", encoding="utf-8") as nextclade_tsv:
reader = csv.DictReader(nextclade_tsv, delimiter="\t")
for row in reader:
id = row["seqName"]

nuc_ins_str = list(row["insertions"].split(","))
nucleotide_insertions[id] = {
segment: [] if nuc_ins_str == [""] else nuc_ins_str}
nucleotide_insertions[id] = {segment: [] if nuc_ins_str == [""] else nuc_ins_str}

aa_ins: dict[str, list[str]] = {gene: [] for gene in config.genes}
aa_ins_split = row["aaInsertions"].split(",")
@@ -254,11 +259,12 @@ def process_single(
warnings: list[ProcessingAnnotation] = []
len_dict: dict[str, str | int | float | None] = {}
for segment in config.nucleotideSequences:
theosanderson marked this conversation as resolved.
Show resolved Hide resolved
if unprocessed.unalignedNucleotideSequences[segment] is None:
len_dict["length_" + segment] = None
sequence = unprocessed.unalignedNucleotideSequences[segment]
key = "length" if segment == "main" else "length_" + segment
if sequence:
len_dict[key] = len(sequence)
else:
len_dict["length_" +
segment] = len(unprocessed.unalignedNucleotideSequences[segment])
len_dict[key] = None
output_metadata: ProcessedMetadata = len_dict

for output_field, spec_dict in config.processing_spec.items():
@@ -290,7 +296,7 @@ def process_single(
)
)
continue
sub_path = input_path[len(nextclade_prefix):]
sub_path = input_path[len(nextclade_prefix) :]
input_data[arg_name] = str(
dpath.get(
unprocessed.nextcladeMetadata,
@@ -356,8 +362,7 @@ def process_all(


def submit_processed_sequences(processed: Sequence[ProcessedEntry], config: Config) -> None:
json_strings = [json.dumps(dataclasses.asdict(sequence))
for sequence in processed]
json_strings = [json.dumps(dataclasses.asdict(sequence)) for sequence in processed]
with open("data.json", "w", encoding="utf-8") as f:
for seq in processed:
json.dump(dataclasses.asdict(seq), f)
@@ -368,11 +373,9 @@ def submit_processed_sequences(processed: Sequence[ProcessedEntry], config: Conf
"Authorization": "Bearer " + get_jwt(config),
}
params = {"pipelineVersion": config.pipeline_version}
response = requests.post(url, data=ndjson_string,
headers=headers, params=params, timeout=10)
response = requests.post(url, data=ndjson_string, headers=headers, params=params, timeout=10)
if not response.ok:
Path("failed_submission.json").write_text(
ndjson_string, encoding="utf-8")
Path("failed_submission.json").write_text(ndjson_string, encoding="utf-8")
msg = (
f"Submitting processed data failed. Status code: {
response.status_code}\n"
@@ -395,11 +398,9 @@ def download_nextclade_dataset(dataset_dir: str, config: Config) -> None:
]

if config.nextclade_dataset_tag is not None:
dataset_download_command.append(
f"--tag={config.nextclade_dataset_tag}")
dataset_download_command.append(f"--tag={config.nextclade_dataset_tag}")

logging.info("Downloading Nextclade dataset: %s",
dataset_download_command)
logging.info("Downloading Nextclade dataset: %s", dataset_download_command)
if subprocess.run(dataset_download_command, check=False).returncode != 0: # noqa: S603
msg = "Dataset download failed"
raise RuntimeError(msg)
@@ -414,12 +415,10 @@ def run(config: Config) -> None:
total_processed = 0
while True:
logging.debug("Fetching unprocessed sequences")
unprocessed = fetch_unprocessed_sequences(
config.batch_size, config)
unprocessed = fetch_unprocessed_sequences(config.batch_size, config)
if len(unprocessed) == 0:
# sleep 1 sec and try again
logging.debug(
"No unprocessed sequences found. Sleeping for 1 second.")
logging.debug("No unprocessed sequences found. Sleeping for 1 second.")
time.sleep(1)
continue
# Process the sequences, get result as dictionary
@@ -428,8 +427,7 @@ def run(config: Config) -> None:
try:
submit_processed_sequences(processed, config)
except RuntimeError as e:
logging.exception(
"Submitting processed data failed. Traceback : %s", e)
logging.exception("Submitting processed data failed. Traceback : %s", e)
continue
total_processed += len(processed)
logging.info("Processed %s sequences", len(processed))
8 changes: 7 additions & 1 deletion preprocessing/nextstrain/cchf/L/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_005301.3 1 12108
NC_005301.3 feature gene 77 11914 . + . gene_name="RdRp"
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052518
NC_005301.3 RefSeq region 1 12108 . + . ID=NC_005301.3:1..12108;Dbxref=taxon:3052518;Name=L;gbkey=Src;genome=genomic;mol_type=genomic RNA;old-name=Crimean-Congo hemorrhagic fever virus;segment=L;strain=IbAr10200
NC_005301.3 RefSeq gene 77 11914 . + . gene_name="RdRp";ID=gene-CCFHV_sLgp1;Dbxref=GeneID:2943075;Name=RdRp;gbkey=Gene;gene=RdRp;gene_biotype=protein_coding;locus_tag=CCFHV_sLgp1
NC_005301.3 RefSeq CDS 77 11914 . + 0 ID=cds-YP_325663.1;Parent=gene-CCFHV_sLgp1;Dbxref=GenBank:YP_325663.1,GeneID:2943075;Name=YP_325663.1;Note=putative polymerase and helicase function%3B putative RNA-dependent RNA polymerase%3B contains OTU-like cysteine protease motif;gbkey=CDS;gene=RdRp;locus_tag=CCFHV_sLgp1;product=putative polyprotein;protein_id=YP_325663.1

9 changes: 8 additions & 1 deletion preprocessing/nextstrain/cchf/M/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_005300.2 1 5366
NC_005300.2 feature gene 93 5147 . + . gene_name="GPC"
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052518
NC_005300.2 RefSeq region 1 5366 . + . ID=NC_005300.2:1..5366;Dbxref=taxon:3052518;Name=M;country=Nigeria: Sokoto;gbkey=Src;genome=genomic;mol_type=genomic RNA;nat-host=Hyalomma excavatum;note=from YARU%2C New Haven%2C Conn. SMB passed 13 time in suckling mouse brain followed by 2 passes in CDC Vero E6 cells;old-name=Crimean-Congo hemorrhagic fever virus;segment=M;strain=IbAr10200
NC_005300.2 RefSeq primer_binding_site 1 18 . + . ID=id-NC_005300.2:1..18;gbkey=primer_bind
NC_005300.2 RefSeq gene 93 5147 . + . gene_name="GPC";ID=gene-CCFHV_sMgp1;Dbxref=GeneID:2943074;Name=CCFHV_sMgp1;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CCFHV_sMgp1
NC_005300.2 RefSeq CDS 93 5147 . + 0 ID=cds-NP_950235.1;Parent=gene-CCFHV_sMgp1;Dbxref=GenBank:NP_950235.1,GeneID:2943074;Name=NP_950235.1;gbkey=CDS;locus_tag=CCFHV_sMgp1;product=glycoprotein precursor;protein_id=NP_950235.1
NC_005300.2 RefSeq primer_binding_site 5349 5366 . - . ID=id-NC_005300.2:5349..5366;gbkey=primer_bind
9 changes: 7 additions & 2 deletions preprocessing/nextstrain/cchf/S/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
##gff-version 3
##sequence-region NC_005302.1 1 1672
NC_005302.1 feature gene 56 1504 . + . gene_name="NP"
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_005302.1 1 1672
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052518
NC_005302.1 RefSeq region 1 1672 . + . ID=NC_005302.1:1..1672;Dbxref=taxon:3052518;Name=S;gbkey=Src;genome=genomic;mol_type=genomic RNA;old-name=Crimean-Congo hemorrhagic fever virus;segment=S;strain=10200
NC_005302.1 RefSeq gene 56 1504 . + . gene_name="NP";ID=gene-CCFHVsSgp1;Dbxref=GeneID:2943076;Name=CCFHVsSgp1;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CCFHVsSgp1
NC_005302.1 RefSeq CDS 56 1504 . + 0 ID=cds-NP_950237.1;Parent=gene-CCFHVsSgp1;Dbxref=GenBank:NP_950237.1,GeneID:2943076;Name=NP_950237.1;gbkey=CDS;locus_tag=CCFHVsSgp1;product=nucleoprotein;protein_id=NP_950237.1