TODO Step3

Reference Sequence Mapping

Requirements: All these binaries are included in the default install of iPyrad.

smalt: This is the aligner we'll use to map reads to the reference sequence.
samtools: Manipulate sam/bam files. Sorting, indexing, exporting bam to fq (for unmapped reads), and outputting alignments (mpileup) for mapped reads.
bedtools: We use this to find all overlapping reads per individual so we can output explicit alignments for sequence mapped reads.

Reference Sequence Mapping Process

NB: Reference sequence mapping pipeline overwrites <work>/edits/*.fq.
Here we copy the original post-step2() *.fq files to <work>/edits/*.fq. To recover these, and route around refseq mapping set step3(force=True) which will recover all reads from the original .fq file.

Index the reference sequence.

Where is this done

Per individual, map all reads to the reference sequence

Where is this done
smalt outputs a sam file of all reads, with info about mapping success, genomic region to which reads are mapped, and quality score.

For unmapped reads

where is this done
samtools sort, index to bam, and write out to fastq format. Drop these back into the pyrad pipeline at the start of step 3.

For mapped reads

where is this done
samtools sort, and index to bam format.
bedtools merge identifies all reads from overlapping regions. bedtools output described by: Field 1 genomic region(i.e. chromosome), fields 2 and 3 are start and end regions.

     1       45230754        45230783
     1       74956568        74956596
     1       116202035       116202060
     1       122618380       122618408

samtools mpileup for each fully overlapping region identified by bedtools merge, make a pilelup of all sequences within this region. Output looks like this:

F1 is genomic region
F2 is Base position
F3 is reference base
F4 is # of aligned reads at that position
F5 base at that position of aligned reads
F6 is base quality

1       116202049       G       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202050       C       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202051       C       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202052       T       22      ,,,,,,,,,,,,,,,,,,,,,,  oooooooooooooooooooooo
1       116202053       G       22      ,,,,,,,,,,,,,,,,,,,,,,  qqqqqqqqqqqqqqqqqqqqqq

Test reference sequence file

Do we generate this somehow or use a fragment from some other assembly? I'm going to test with a moderate sized zebra finch chromosome with hard masked repeats. Might want to think about what to actually do with low-complexity/repetitive regions.

This is how it works:

Filter reads (demultiplex and rawedit)
Enter step3
Take raw fastq
smalt map -n -o
convert to bam: samtools view -b wat.sam > wat.bam
sort bam: samtools sort -T wat -O bam > <same_input>
index bam: samtools index #by default creates a <input.idx> file
write out fastq file: samtools bam2fq wat.bam > wat.fq
This returns each file

Currently this writes out all the sequences read in. It should be possible to filter only those reads that map, as well as unmapped reads. It should also be desirable to output some location information, right now it only outputs sample name in the fq.

Handling PE data

This is how the de novo handles PE/GBS: "The 1st and 2nd reads of PE ddRAD will always be the same, since the first Illumina primer will only ligate to the sequence end with the first cutter overhang, and vice-versa. By "GBS", in my terminology, I always mean "two cut sites but only one cutter", and thus the two ends of a sequence are interchangeable, and therefore 1st and second reads are interchangeably forward or reverse stranded. So in PE-ddRAD you just have to revcomp R2 and they will both be on the same strand. In GBS, on the other hand, you should revcomp R2, but then the pair (R1, R2) on this strand could still match with a different pair (R2, R1) on the complementary strand. Does that make sense? For vsearch, this means I concatenate (R1, rv(R2)) and search only for single strand hits in ddRAD, whereas I concatenate (R1, rv(R2)) and search with --strand=both for pairGBS."

Performance

Indexing

Manakin Genome (1.12Gb) Skip - wordlen - time to index - percent mapped at 85% -s 13 -k 13 - 1:00 - 2.04% -s 13 -k 16 - 2:45 - 2.17% -s 6 -k 13 - 1:30 - 1.15% -s 2 -k 13 - 3:17 - 0.00% -s 16 -k 16 - 2:21 - 3.27% -s 4 -k 16 - 7:23 - 1.71%

Zebra Finch (1.22GB)

Mapping

NB

Smalt -a flag will output explicit alignments and indicate deviation from reference. This could be useful. It outputs a mangled .sam file with positional information like this:
```
  QUERY:         77 AATTGATACAAATATTCC 94

  REFERENCE:  181717161 AATTGATACAAATATTCC 181717178
```

.sam file it writes is not parseable by samtools though.

Samtools mpileup debugging

Thoughts on organization/objects

The reference_sequence_path should be moved up in the params file ordering. I'm thinking it could be param 5. But maybe we should wait on reordering until after refseq branch is merged.
Step3 could have an assembly_method toggle with three options: de_novo (default), reference, and hybrid (if we make a hybrid approach). Indexing only needs to be done for the second two approaches.
Step3 reference assembly could create a new dir/ called reference, e.g. data1.dirs.reference
I suppose the index is a property of the Assembly object, but not of Sample objects... we could make the reference files available as an attribute of the Assembly object, e.g., data1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly