Core Pipeline

Jump to bottom Edit New page

Isaac Overcast edited this page Nov 21, 2015 · 6 revisions

Important files generated and specific config used by each step.

Step 1 - Demultiplex

In - Raw reads from 'raw_fastq_path', must be in fastq format (can be gzip compressed).

Out - Demultiplexed individuals to <work>/fastq/*.gz & info to /fastq/s1_demultiplex_stats.txt

Step 2 - Filter based on Phred Q score

Out - <work>/edits/*.fasta & info to <work>/edits/s2_rawedit_stats.txt

Step 3 - Cluster reads within individuals

Out

Dereplicated and sorted reads: <work>/edits/*.derep
<work>/clust_<tolerance>/
*.htemp - FASTA file of unmatched searches (vsearch)
*.utemp - user defined output stats from vsearch
*.clust.gz - unaligned clusters
*.clustS.gz - Aligned clusters (post-muscle)
*s3_cluster_stats.txt
IFF reference sequence mapping
<work>/refmapping/
*.sam - Raw output of smalt mapping
*.<mapped/unmapped>.bam - bam files for mapped and unmapped reads
*.sorted-<mapped/umapped>.bam - sorted bam files for mapped and unmapped reads
<work>/edits/*.fastq - Updated fastq files in the edits dir to contain only unmapped reads
<work>/clust_<tolerance>/
*.clustsS.gz - Merged denovo clusters (post-muscle) and reference sequence aligned pileups.
Info to <work>/edits/clust_<tolerance>/s3_cluster_stats.txt

Step 4 - Joint estimation of H and E

Out

Info to <work>/edits/clust_<tolerance>/s4_Pi_E_estimate.txt

Step 5 - Consensus sequences and HDF5 database with coverages

Out

consensus reads: <work>/edits/clust_<tolerance>/consens_<outprefix>/
*.consens - FASTA file of consensus reads
*...hd5f... - database storage (maybe) of read depths

Step 6 - Cluster across samples

Out

ordered consensus reads: <work>/edits/clust_<tolerance>/consens_<outprefix>/cat...
vsearch matching output: <work>/edits/clust_<tolerance>/consens_<outprefix>/cat.utemp
database of all clusters containing depth data: ...