You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this bug when using Piranha to analyse eCLIP data. I have already filtered for the second read with samtools (which is the read used to determine the crosslink site), and was using the BAM file from that filtering as input to Piranha. However, I found that the read binning was giving inaccurate results. Through a bit of exploration, I found that this was caused by the SEQ and QUAL fields.
I've uploaded two BAM files, which were generated as follows:
# retrieve the header
samtools view -H data/HepG2_RBFOX2.bam > output/header
# get the first 10,000 reads from chr1
samtools view data/HepG2_RBFOX2.bam | grep "chr1\t" | head -10000 > output/reads
# replace the SEQ and QUAL fields with *
cat output/reads | awk 'BEGIN{OFS="\t"}{$10="*"; $11="*"; print}' > output/reads_no_seq
# combine the header and the reads and convert to BAM files
cat output/header output/reads | samtools view -b > output/reads.bam
cat output/header output/reads_no_seq | samtools view -b > output/reads_no_seq.bam
# run Piranha for each bam file
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
Piranha -v -s -b 10 -o /mount/output/reads.bed /mount/output/reads.bam
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
Piranha -v -s -b 10 -o /mount/output/reads_no_seq.bed /mount/output/reads_no_seq.bam
$ head output/reads.bed
chr1 630710 630720 X 26 + 2.0749e-06
chr1 632620 632630 X 11 + 0.00152753
chr1 846530 846540 X 33 + 4.3324e-08
chr1 960080 960090 X 16 + 0.000144504
chr1 960420 960430 X 11 + 0.00152753
chr1 960440 960450 X 13 + 0.000637789
chr1 960890 960900 X 16 + 0.000144504
chr1 961020 961030 X 11 + 0.00152753
chr1 961090 961100 X 14 + 0.000398054
chr1 961150 961160 X 12 + 0.000983474
$ head output/reads_no_seq.bed
chr1 630710 630720 X 97 + 1.17462e-14
chr1 633990 634000 X 22 + 0.000243025
chr1 846520 846540 X 29 + 6.64929e-05
chr1 960420 960430 X 18 + 0.000666816
chr1 960440 960450 X 18 + 0.000666816
chr1 960810 960830 X 19 + 0.000536713
chr1 960870 960880 X 20 + 0.000406609
chr1 960890 960920 X 24.6667 + 0.000146072
chr1 962160 962170 X 19 + 0.000529893
chr1 1038700 1038710 X 19 + 0.000529893
The read counts in reads_no_seq.bed are consistently higher, and seem to correspond to the correct number of reads in that window (from eyeballing the reads in a genome browser).
The first window, 630710-20 shows a particularly large disparity, so I looked at the reads in that window in more detail (attached). There were 97, as indicated in the reads_no_seq.bed sample, However, I couldn't work out how 26 reads were counted for that region!
I came across this bug when using Piranha to analyse eCLIP data. I have already filtered for the second read with
samtools
(which is the read used to determine the crosslink site), and was using the BAM file from that filtering as input to Piranha. However, I found that the read binning was giving inaccurate results. Through a bit of exploration, I found that this was caused by the SEQ and QUAL fields.I've uploaded two BAM files, which were generated as follows:
The BAM files look as follows:
And the outputs from Piranha are:
The read counts in
reads_no_seq.bed
are consistently higher, and seem to correspond to the correct number of reads in that window (from eyeballing the reads in a genome browser).The first window, 630710-20 shows a particularly large disparity, so I looked at the reads in that window in more detail (attached). There were 97, as indicated in the
reads_no_seq.bed
sample, However, I couldn't work out how 26 reads were counted for that region!reads_no_seq.bam.gz
reads.bam.gz
chr1_630710_20_reads.txt
The text was updated successfully, but these errors were encountered: