Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Consensus genome and variant calling from Illumina only #141

Open
pvanheus opened this issue May 5, 2020 · 8 comments
Open

Consensus genome and variant calling from Illumina only #141

pvanheus opened this issue May 5, 2020 · 8 comments

Comments

@pvanheus
Copy link
Contributor

pvanheus commented May 5, 2020

Hi everyone

I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!

If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.

P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.

@mvdbeek
Copy link
Member

mvdbeek commented May 5, 2020

Can you make the consensus part of you workflow a separate workflow ? Might be interesting to compare this to the samples that also appear in GISAID. I think we can add this to our current workflow (which is a subworkflow with a parallelized variant of the download step, and then runs the SE and PE workflows -- https://usegalaxy.org/u/sars-cov2-bot/w/download-and-sepe-illumina-covid-variation-workflow-imported-from-uploaded-file).

Can you describe the artic workflow in a bit more detail ? Why minimap, what's the bed file used for, why ivar, etc ? A bit like we do in https://covid19.galaxyproject.org/genomics/4-Variation/#how-do-we-call-variants ?

@pvanheus
Copy link
Contributor Author

pvanheus commented May 5, 2020

Hi @mvdbeek the only steps that are not involved in generating the consensus are the final "SnpSift Extract Fields" and "Collapse Collection" steps. Everything else goes into computing the consensus. So in the workflow you linked it would.

  1. use the FASTA from Step 3: SnpEff build
  2. use bedtools genomecov on Step 9: Realign reads
  3. Filter to identify low coverage regions (low coverage should be a configurable parameter)
  4. SnpSift Filter to filter variants by allelle frequency (also should be a configurable parameter)
  5. use (1) and (4) as inputs to bcftools consensus
  6. use (5) and (3) as inputs to bedtools MaskFastaBed to mask out low coverage regions
  7. use sed on (6) to rename the FASTA to final consensus name

So it plugs quite naturally into what you are using. I imagine that you want to keep evolving the workflow to incorporate new insights into filtering (e.g. some of what is being discussed here).

@pvanheus
Copy link
Contributor Author

pvanheus commented May 5, 2020

Commenting separately on the ivar / amplicon story. The majority of genomes out there are being generated using ARTIC amplicon protocol detailed here for Illumina and here for ONT. My focus is the Illumina version.

The BED file is the location of ARTIC primers (there are 2 sets in common use - v.1 and v.3). This feeds into ivar trim to trim these out of the reads. ivar is specifically designed for this kind of amplicon sequencing, thus its use in the workflow. The choice of minimap2 and bwa-mem is somewhat arbitrary - mine was based on an approach Torsten Seemann at Doherty Institute in Melbourne was using, using minimap2. Thanh used bwa-mem in his version of the workflow.

I can write up a section for the web page but thought I should start the discussion here to iron out issues like the ones you've raised first.

@wm75
Copy link
Collaborator

wm75 commented Jun 27, 2020

@pvanheus just proposed an update (#185) of the ARTIC WF in its dedicated section.
I've extended its core part somewhat, but discarded the consensus calling part to highlight the similarity to the regular variation WFs.
A separate WF showing consensus calling (not necessarily specific to ARTIC data) may still be good to have. We just need to decide on a section to add it to.

@Slugger70
Copy link
Contributor

Slugger70 commented Jun 28, 2020

Hi @wm75, I took the output of zip collection to pass to the rest of the WF instead of the individual read datasets from fastp and also put a flatten collection between Qualimap BAMQC and MultiQC so multiqc only runs once. My version of the WF is at:https://usegalaxy.org.au/u/simongladman/w/covid-19-variation-analysis-on-artic-pe-data

@wm75
Copy link
Collaborator

wm75 commented Jun 28, 2020

@Slugger70 I think by passing a list:paired collection to bwa-mem, you may run into the same blocking WF scheduling issue that you're trying to avoid at the fastp step. At least that's my recollection.
Have you checked that this parallelizes correctly?

@wm75
Copy link
Collaborator

wm75 commented Jun 28, 2020

Thanks for spotting that MultiQC issue! I'll look into it for EU.

@wm75
Copy link
Collaborator

wm75 commented Jun 28, 2020

I've updated the EU WF with the Flatten Collection step of the Qualimap output just as you were doing it for AU. Thanks again @Slugger70!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants