Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results depending on whether SEQ and QUAL fields are set #17

Open
marc-jones opened this issue Sep 30, 2021 · 0 comments
Open

Comments

@marc-jones
Copy link

I came across this bug when using Piranha to analyse eCLIP data. I have already filtered for the second read with samtools (which is the read used to determine the crosslink site), and was using the BAM file from that filtering as input to Piranha. However, I found that the read binning was giving inaccurate results. Through a bit of exploration, I found that this was caused by the SEQ and QUAL fields.

I've uploaded two BAM files, which were generated as follows:

# retrieve the header
samtools view -H data/HepG2_RBFOX2.bam > output/header
# get the first 10,000 reads from chr1
samtools view data/HepG2_RBFOX2.bam | grep "chr1\t" | head -10000 > output/reads
# replace the SEQ and QUAL fields with *
cat output/reads | awk 'BEGIN{OFS="\t"}{$10="*"; $11="*"; print}' > output/reads_no_seq

# combine the header and the reads and convert to BAM files
cat output/header output/reads | samtools view -b > output/reads.bam
cat output/header output/reads_no_seq | samtools view -b > output/reads_no_seq.bam

# run Piranha for each bam file
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
    Piranha -v -s -b 10 -o /mount/output/reads.bed /mount/output/reads.bam
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
    Piranha -v -s -b 10 -o /mount/output/reads_no_seq.bed /mount/output/reads_no_seq.bam

The BAM files look as follows:

$ samtools view output/reads.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989    147     chr1    14771   255     43M     =       14681   -133    CACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGT     B<FFFFFB<0<<<IIFBF<07FFFBFIFFFFFBB<B<BBFFFB     NH:i:1  HI:i:1  AS:i:80 nM:i:0  NM:i:0  MD:Z:43 jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292   147     chr1    16211   255     43M     =       16137   -117    TAGCCATGCTCTGACAGTCTCAGTTGTACACACGAGCCAGCAG     FBBFFBBFBFBBFFFFFBFIFFFFFFBFFFFBFFFFFFFFFFB     NH:i:1  HI:i:1  AS:i:76 nM:i:1  NM:i:1  MD:Z:26C16      jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
$ samtools view reads_no_seq.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989    147     chr1    14771   255     43M     =       14681   -133    *       *       NH:i:1  HI:i:1  AS:i:80 nM:i:0  NM:i:0  MD:Z:43 jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292   147     chr1    16211   255     43M     =       16137   -117    *       *       NH:i:1  HI:i:1  AS:i:76 nM:i:1  NM:i:1  MD:Z:26C16      jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990

And the outputs from Piranha are:

$ head output/reads.bed 
chr1    630710  630720  X       26      +       2.0749e-06
chr1    632620  632630  X       11      +       0.00152753
chr1    846530  846540  X       33      +       4.3324e-08
chr1    960080  960090  X       16      +       0.000144504
chr1    960420  960430  X       11      +       0.00152753
chr1    960440  960450  X       13      +       0.000637789
chr1    960890  960900  X       16      +       0.000144504
chr1    961020  961030  X       11      +       0.00152753
chr1    961090  961100  X       14      +       0.000398054
chr1    961150  961160  X       12      +       0.000983474
$ head output/reads_no_seq.bed 
chr1    630710  630720  X       97      +       1.17462e-14
chr1    633990  634000  X       22      +       0.000243025
chr1    846520  846540  X       29      +       6.64929e-05
chr1    960420  960430  X       18      +       0.000666816
chr1    960440  960450  X       18      +       0.000666816
chr1    960810  960830  X       19      +       0.000536713
chr1    960870  960880  X       20      +       0.000406609
chr1    960890  960920  X       24.6667 +       0.000146072
chr1    962160  962170  X       19      +       0.000529893
chr1    1038700 1038710 X       19      +       0.000529893

The read counts in reads_no_seq.bed are consistently higher, and seem to correspond to the correct number of reads in that window (from eyeballing the reads in a genome browser).

The first window, 630710-20 shows a particularly large disparity, so I looked at the reads in that window in more detail (attached). There were 97, as indicated in the reads_no_seq.bed sample, However, I couldn't work out how 26 reads were counted for that region!

reads_no_seq.bam.gz
reads.bam.gz
chr1_630710_20_reads.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant