-
Notifications
You must be signed in to change notification settings - Fork 41
TODO Step3
Requirements: All these binaries are included in the default install of iPyrad.
-
smalt
: This is the aligner we'll use to map reads to the reference sequence. -
samtools
: Manipulate sam/bam files. Sorting, indexing, exporting bam to fq (for unmapped reads), and outputting alignments (mpileup) for mapped reads. -
bedtools
: We use this to find all overlapping reads per individual so we can output explicit alignments for sequence mapped reads.
- NB: Reference sequence mapping pipeline overwrites <work>/edits/*.fq.
- Here we copy the original post-step2() *.fq files to <work>/edits/*.fq. To recover these, and route around refseq mapping set
step3(force=True)
which will recover all reads from the original .fq file.
- Index the reference sequence.
- Where is this done
- Per individual, map all reads to the reference sequence
- Where is this done
-
smalt
outputs a sam file of all reads, with info about mapping success, genomic region to which reads are mapped, and quality score.
- For unmapped reads
- where is this done
-
samtools
sort, index to bam, and write out to fastq format. Drop these back into the pyrad pipeline at the start of step 3.
- For mapped reads
- where is this done
-
samtools
sort, and index to bam format. -
bedtools merge
identifies all reads from overlapping regions. bedtools output described by: Field 1 genomic region(i.e. chromosome), fields 2 and 3 are start and end regions.
1 45230754 45230783
1 74956568 74956596
1 116202035 116202060
1 122618380 122618408
-
samtools mpileup
for each fully overlapping region identified bybedtools merge
, make a pilelup of all sequences within this region. Output looks like this:
- F1 is genomic region
- F2 is Base position
- F3 is reference base
- F4 is # of aligned reads at that position
- F5 base at that position of aligned reads
- F6 is base quality
1 116202049 G 22 ,,,,,,,,,,,,,,,,,,,,,, iiiiiiiiiiiiiiiiiiiiii
1 116202050 C 22 ,,,,,,,,,,,,,,,,,,,,,, iiiiiiiiiiiiiiiiiiiiii
1 116202051 C 22 ,,,,,,,,,,,,,,,,,,,,,, iiiiiiiiiiiiiiiiiiiiii
1 116202052 T 22 ,,,,,,,,,,,,,,,,,,,,,, oooooooooooooooooooooo
1 116202053 G 22 ,,,,,,,,,,,,,,,,,,,,,, qqqqqqqqqqqqqqqqqqqqqq
- Do we generate this somehow or use a fragment from some other assembly? I'm going to test with a moderate sized zebra finch chromosome with hard masked repeats. Might want to think about what to actually do with low-complexity/repetitive regions.
This is how it works:
- Filter reads (demultiplex and rawedit)
- Enter step3
- Take raw fastq
- smalt map -n -o
- convert to bam: samtools view -b wat.sam > wat.bam
- sort bam: samtools sort -T wat -O bam > <same_input>
- index bam: samtools index #by default creates a <input.idx> file
- write out fastq file: samtools bam2fq wat.bam > wat.fq
- This returns each file
Currently this writes out all the sequences read in. It should be possible to filter only those reads that map, as well as unmapped reads. It should also be desirable to output some location information, right now it only outputs sample name in the fq.
This is how the de novo handles PE/GBS: "The 1st and 2nd reads of PE ddRAD will always be the same, since the first Illumina primer will only ligate to the sequence end with the first cutter overhang, and vice-versa. By "GBS", in my terminology, I always mean "two cut sites but only one cutter", and thus the two ends of a sequence are interchangeable, and therefore 1st and second reads are interchangeably forward or reverse stranded. So in PE-ddRAD you just have to revcomp R2 and they will both be on the same strand. In GBS, on the other hand, you should revcomp R2, but then the pair (R1, R2) on this strand could still match with a different pair (R2, R1) on the complementary strand. Does that make sense? For vsearch, this means I concatenate (R1, rv(R2)) and search only for single strand hits in ddRAD, whereas I concatenate (R1, rv(R2)) and search with --strand=both for pairGBS."
Manakin Genome (1.12Gb) Skip - wordlen - time to index - percent mapped at 85% -s 13 -k 13 - 1:00 - 2.04% -s 13 -k 16 - 2:45 - 2.17% -s 6 -k 13 - 1:30 - 1.15% -s 2 -k 13 - 3:17 - 0.00% -s 16 -k 16 - 2:21 - 3.27% -s 4 -k 16 - 7:23 - 1.71%
Zebra Finch (1.22GB)
-
Smalt
-a
flag will output explicit alignments and indicate deviation from reference. This could be useful. It outputs a mangled .sam file with positional information like this:QUERY: 77 AATTGATACAAATATTCC 94 REFERENCE: 181717161 AATTGATACAAATATTCC 181717178
.sam file it writes is not parseable by samtools though.
- The reference_sequence_path should be moved up in the params file ordering. I'm thinking it could be param 5. But maybe we should wait on reordering until after refseq branch is merged.
- Step3 could have an assembly_method toggle with three options: de_novo (default), reference, and hybrid (if we make a hybrid approach). Indexing only needs to be done for the second two approaches.
- Step3 reference assembly could create a new dir/ called reference, e.g. data1.dirs.reference
- I suppose the index is a property of the Assembly object, but not of Sample objects... we could make the reference files available as an attribute of the Assembly object, e.g., data1.