Skip to content
Isaac Overcast edited this page Jan 22, 2016 · 15 revisions

Reference Sequence Mapping

Requirements: All these binaries are included in the default install of iPyrad.

  • smalt: This is the aligner we'll use to map reads to the reference sequence.
  • samtools: Manipulate sam/bam files. Sorting, indexing, exporting bam to fq (for unmapped reads), and outputting alignments (mpileup) for mapped reads.
  • bedtools: We use this to find all overlapping reads per individual so we can output explicit alignments for sequence mapped reads.

Reference Sequence Mapping Process

  • NB: Reference sequence mapping pipeline overwrites <work>/edits/*.fq.
  • Here we copy the original post-step2() *.fq files to <work>/edits/*.fq. To recover these, and route around refseq mapping set step3(force=True) which will recover all reads from the original .fq file.
  1. Index the reference sequence.
  • Where is this done
  1. Per individual, map all reads to the reference sequence
  • Where is this done
  • smalt outputs a sam file of all reads, with info about mapping success, genomic region to which reads are mapped, and quality score.
  1. For unmapped reads
  • where is this done
  • samtools sort, index to bam, and write out to fastq format. Drop these back into the pyrad pipeline at the start of step 3.
  1. For mapped reads
  • where is this done
  • samtools sort, and index to bam format.
  • bedtools merge identifies all reads from overlapping regions. bedtools output described by: Field 1 genomic region(i.e. chromosome), fields 2 and 3 are start and end regions.
     1       45230754        45230783
     1       74956568        74956596
     1       116202035       116202060
     1       122618380       122618408
  • samtools mpileup for each fully overlapping region identified by bedtools merge, make a pilelup of all sequences within this region. Output looks like this:
  • F1 is genomic region
  • F2 is Base position
  • F3 is reference base
  • F4 is # of aligned reads at that position
  • F5 base at that position of aligned reads
  • F6 is base quality
1       116202049       G       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202050       C       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202051       C       22      ,,,,,,,,,,,,,,,,,,,,,,  iiiiiiiiiiiiiiiiiiiiii
1       116202052       T       22      ,,,,,,,,,,,,,,,,,,,,,,  oooooooooooooooooooooo
1       116202053       G       22      ,,,,,,,,,,,,,,,,,,,,,,  qqqqqqqqqqqqqqqqqqqqqq

Test reference sequence file

  • Do we generate this somehow or use a fragment from some other assembly? I'm going to test with a moderate sized zebra finch chromosome with hard masked repeats. Might want to think about what to actually do with low-complexity/repetitive regions.

This is how it works:

  • Filter reads (demultiplex and rawedit)
  • Enter step3
  • Take raw fastq
  • smalt map -n -o
  • convert to bam: samtools view -b wat.sam > wat.bam
  • sort bam: samtools sort -T wat -O bam > <same_input>
  • index bam: samtools index #by default creates a <input.idx> file
  • write out fastq file: samtools bam2fq wat.bam > wat.fq
  • This returns each file

Currently this writes out all the sequences read in. It should be possible to filter only those reads that map, as well as unmapped reads. It should also be desirable to output some location information, right now it only outputs sample name in the fq.

Handling PE data

This is how the de novo handles PE/GBS: "The 1st and 2nd reads of PE ddRAD will always be the same, since the first Illumina primer will only ligate to the sequence end with the first cutter overhang, and vice-versa. By "GBS", in my terminology, I always mean "two cut sites but only one cutter", and thus the two ends of a sequence are interchangeable, and therefore 1st and second reads are interchangeably forward or reverse stranded. So in PE-ddRAD you just have to revcomp R2 and they will both be on the same strand. In GBS, on the other hand, you should revcomp R2, but then the pair (R1, R2) on this strand could still match with a different pair (R2, R1) on the complementary strand. Does that make sense? For vsearch, this means I concatenate (R1, rv(R2)) and search only for single strand hits in ddRAD, whereas I concatenate (R1, rv(R2)) and search with --strand=both for pairGBS."

Performance

Indexing

Manakin Genome (1.12Gb) Skip - wordlen - time to index - percent mapped at 85% -s 13 -k 13 - 1:00 - 2.04% -s 13 -k 16 - 2:45 - 2.17% -s 6 -k 13 - 1:30 - 1.15% -s 2 -k 13 - 3:17 - 0.00% -s 16 -k 16 - 2:21 - 3.27% -s 4 -k 16 - 7:23 - 1.71%

Zebra Finch (1.22GB)

Mapping

NB

  • Smalt -a flag will output explicit alignments and indicate deviation from reference. This could be useful. It outputs a mangled .sam file with positional information like this:

      QUERY:         77 AATTGATACAAATATTCC 94
    
      REFERENCE:  181717161 AATTGATACAAATATTCC 181717178
    

.sam file it writes is not parseable by samtools though.

Thoughts on organization/objects

  • The reference_sequence_path should be moved up in the params file ordering. I'm thinking it could be param 5. But maybe we should wait on reordering until after refseq branch is merged.
  • Step3 could have an assembly_method toggle with three options: de_novo (default), reference, and hybrid (if we make a hybrid approach). Indexing only needs to be done for the second two approaches.
  • Step3 reference assembly could create a new dir/ called reference, e.g. data1.dirs.reference
  • I suppose the index is a property of the Assembly object, but not of Sample objects... we could make the reference files available as an attribute of the Assembly object, e.g., data1.