Consensus genome and variant calling from Illumina only #141

pvanheus · 2020-05-05T08:25:48Z

Hi everyone

I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!

If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.

P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.

mvdbeek · 2020-05-05T09:56:34Z

Can you make the consensus part of you workflow a separate workflow ? Might be interesting to compare this to the samples that also appear in GISAID. I think we can add this to our current workflow (which is a subworkflow with a parallelized variant of the download step, and then runs the SE and PE workflows -- https://usegalaxy.org/u/sars-cov2-bot/w/download-and-sepe-illumina-covid-variation-workflow-imported-from-uploaded-file).

Can you describe the artic workflow in a bit more detail ? Why minimap, what's the bed file used for, why ivar, etc ? A bit like we do in https://covid19.galaxyproject.org/genomics/4-Variation/#how-do-we-call-variants ?

pvanheus · 2020-05-05T17:07:45Z

Hi @mvdbeek the only steps that are not involved in generating the consensus are the final "SnpSift Extract Fields" and "Collapse Collection" steps. Everything else goes into computing the consensus. So in the workflow you linked it would.

use the FASTA from Step 3: SnpEff build
use bedtools genomecov on Step 9: Realign reads
Filter to identify low coverage regions (low coverage should be a configurable parameter)
SnpSift Filter to filter variants by allelle frequency (also should be a configurable parameter)
use (1) and (4) as inputs to bcftools consensus
use (5) and (3) as inputs to bedtools MaskFastaBed to mask out low coverage regions
use sed on (6) to rename the FASTA to final consensus name

So it plugs quite naturally into what you are using. I imagine that you want to keep evolving the workflow to incorporate new insights into filtering (e.g. some of what is being discussed here).

pvanheus · 2020-05-05T17:14:21Z

Commenting separately on the ivar / amplicon story. The majority of genomes out there are being generated using ARTIC amplicon protocol detailed here for Illumina and here for ONT. My focus is the Illumina version.

The BED file is the location of ARTIC primers (there are 2 sets in common use - v.1 and v.3). This feeds into ivar trim to trim these out of the reads. ivar is specifically designed for this kind of amplicon sequencing, thus its use in the workflow. The choice of minimap2 and bwa-mem is somewhat arbitrary - mine was based on an approach Torsten Seemann at Doherty Institute in Melbourne was using, using minimap2. Thanh used bwa-mem in his version of the workflow.

I can write up a section for the web page but thought I should start the discussion here to iron out issues like the ones you've raised first.

wm75 · 2020-06-27T22:26:18Z

@pvanheus just proposed an update (#185) of the ARTIC WF in its dedicated section.
I've extended its core part somewhat, but discarded the consensus calling part to highlight the similarity to the regular variation WFs.
A separate WF showing consensus calling (not necessarily specific to ARTIC data) may still be good to have. We just need to decide on a section to add it to.

Slugger70 · 2020-06-28T11:53:24Z

Hi @wm75, I took the output of zip collection to pass to the rest of the WF instead of the individual read datasets from fastp and also put a flatten collection between Qualimap BAMQC and MultiQC so multiqc only runs once. My version of the WF is at:https://usegalaxy.org.au/u/simongladman/w/covid-19-variation-analysis-on-artic-pe-data

wm75 · 2020-06-28T12:00:23Z

@Slugger70 I think by passing a list:paired collection to bwa-mem, you may run into the same blocking WF scheduling issue that you're trying to avoid at the fastp step. At least that's my recollection.
Have you checked that this parallelizes correctly?

wm75 · 2020-06-28T12:02:57Z

Thanks for spotting that MultiQC issue! I'll look into it for EU.

wm75 · 2020-06-28T15:09:29Z

I've updated the EU WF with the Flatten Collection step of the Qualimap output just as you were doing it for AU. Thanks again @Slugger70!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consensus genome and variant calling from Illumina only #141

Consensus genome and variant calling from Illumina only #141

pvanheus commented May 5, 2020

mvdbeek commented May 5, 2020

pvanheus commented May 5, 2020

pvanheus commented May 5, 2020

wm75 commented Jun 27, 2020 •

edited

Loading

Slugger70 commented Jun 28, 2020 •

edited

Loading

wm75 commented Jun 28, 2020

wm75 commented Jun 28, 2020

wm75 commented Jun 28, 2020

Consensus genome and variant calling from Illumina only #141

Consensus genome and variant calling from Illumina only #141

Comments

pvanheus commented May 5, 2020

mvdbeek commented May 5, 2020

pvanheus commented May 5, 2020

pvanheus commented May 5, 2020

wm75 commented Jun 27, 2020 • edited Loading

Slugger70 commented Jun 28, 2020 • edited Loading

wm75 commented Jun 28, 2020

wm75 commented Jun 28, 2020

wm75 commented Jun 28, 2020

wm75 commented Jun 27, 2020 •

edited

Loading

Slugger70 commented Jun 28, 2020 •

edited

Loading