Redundant computation - unneccesary "trimfastq.py" step? #218

jmfrank · 2021-03-22T22:59:10Z

Describe the bug

I've been looking at how the pipeline runs, and I noticed something strange. I'm doing something pretty simple, just running chip-seq analysis of existing data on a new reference genome. In the "call-align_R1" folder, I noticed that the pipeline calls "trimfastq.py" to trim the fastq data with an default value of 50bp. This is strange because I didn't see any option to specify this value or step in the input.md file.

This is different than the trimming for trimmomatic. It looks like a script to just trim all reads to some fixed value. The weirdness is that the output file "R1_trimmed/xxxx.trim_50bp.fastq.gz" is not actually trimmed, it's an exact copy of the original fastq file "R1/xxxx.fastq.gz". The pipeline goes on to use the R1_trimmed file for alignment.

It seems really unclear why there's a call to trimfastq with default 50bp, but it doesn't actually trim the file. Maybe it's a bug, or just a left over result of something not used anymore? I was initially surprised to see that file name containing "trim_50bp.fastq.gz" because I didn't specify such trimming. The result itself doesn't seem to be a problem, but this could be a source of other issues or undesired behavior.

I see that there's a parameter: "chip.xcor_trim_bp" with default 50bp. I'm guessing the "call-align_R1" is the step before the cross-correlation? If that's the case, then there is indeed a bug because my reads don't get trimmed!

Any clarification would be helpful! Thanks.

OS/Platform

OS/Platform: Ubuntu 20.04
Conda version: 4.9.2
Pipeline version: v1.7.1
Caper version: 1.4.2

Caper configuration file

Paste contents of ~/.caper/default.conf.
backend=local

local-hash-strat=path+modtime
local-loc-dir=/media/matt/fast_data_storage/ngs_sandbox/caper/

cromwell=/home/matt/.caper/cromwell_jar/cromwell-52.jar
womtool=/home/matt/.caper/womtool_jar/womtool-52.jar

Input JSON file

{
"chip.title" : "H1 hESC primed chip seq.",
"chip.description" : "h3k27ac chip-seq.",

"chip.pipeline_type" : "histone",
"chip.aligner" : "bowtie2",
"chip.align_only" : false,
"chip.true_rep_only" : false,

"chip.genome_tsv" : "/media/ngs/data/genomes/v13/v13.tsv",

"chip.paired_end" : true,
"chip.ctl_paired_end" : true,

"chip.always_use_pooled_ctl" : true,

"chip.fastqs_rep1_R1" : [ "SRR8983693/SRR8983693.1_1.fastq.gz"],
"chip.fastqs_rep1_R2" : [ "SRR8983693/SRR8983693.1_2.fastq.gz"],
"chip.fastqs_rep2_R1" : [ "SRR8983695/SRR8983695.1_1.fastq.gz"],
"chip.fastqs_rep2_R2" : [ "SRR8983695/SRR8983695.1_2.fastq.gz"],

"chip.ctl_fastqs_rep1_R1" : [ "SRR8983705/SRR8983705.1_1.fastq.gz"],
"chip.ctl_fastqs_rep1_R2" : [ "SRR8983705/SRR8983705.1_2.fastq.gz"]

}

The text was updated successfully, but these errors were encountered:

leepc12 · 2021-03-29T17:17:22Z

Yes, such 50bp trimming is an additional alignment step for the cross-correlation analysis only, which is not used for any other analyses. For all other analyses, FASTQs will be trimmed with Trimmomatic (if chip.crop_length is defined > 0)

jmfrank · 2021-03-30T23:07:35Z

Yes, I had seen that after I posted. Sorry, I think I buried the actual problem in my post.

The issue is that the 50bp trimming does not happen. When I compared "R1_trimmed/xxxx.trim_50bp.fastq.gz" and "R1/xxxx.fastq.gz", the files are exactly the same according to size and when I run bash diff.

I ran the individual command: encode_task_trim_fastq.py to confirm my result.
$ python $(which encode_task_trim_fastq.py) SRR8983693.1_1.fastq.gz --trim-bp 50 --out-dir ./
Below is the output. In bold, you can see there's a forced copy of the original fastq onto the trimmed fastq file. This occurs when there's only 1 or more lines in the fastq with a read shorter than the trim length. You can see in the output my fastq only has 1 read less than 50bp, so it performs the copy. This seems like a strange decision. It would make sense if most or many reads were less than the trim length. But if it's one read out of millions, it seems like we should remove it rather than default to the original data. I suppose this is a minor detail, but I wanted to bring it up because it confused me at first.

[2021-03-30 15:49:16,536 INFO] ['/home/matt/anaconda3/envs/encode-chip-seq-pipeline/bin/encode_task_trim_fastq.py', 'SRR8983693.1_1.fastq.gz', '--trim-bp', '50', '--out-dir', './']
[2021-03-30 15:49:16,536 INFO] Initializing and making output directory...
[2021-03-30 15:49:16,536 INFO] Trimming fastqs (50 bp)...
[2021-03-30 15:49:16,537 INFO] run_shell_cmd: PID=610685, PGID=610685, CMD=python $(which trimfastq.py) SRR8983693.1_1.fastq.gz 50 | gzip -nc > ./SRR8983693.1_1.trim_50bp.fastq.gz
[2021-03-30 15:52:17,583 INFO] PID=610685, PGID=610685, RC=0, DURATION_SEC=181.0
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=
[2021-03-30 15:52:17,584 INFO] run_shell_cmd: PID=611272, PGID=611272, CMD=zcat -f ./SRR8983693.1_1.trim_50bp.fastq.gz | (grep 'sequences shorter than desired length' || true) | wc -l
[2021-03-30 15:52:30,792 INFO] PID=611272, PGID=611272, RC=0, DURATION_SEC=13.2
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=1
[2021-03-30 15:52:30,793 INFO] run_shell_cmd: PID=611313, PGID=611313, CMD=cp -f SRR8983693.1_1.fastq.gz ./SRR8983693.1_1.trim_50bp.fastq.gz
[2021-03-30 15:52:32,126 INFO] PID=611313, PGID=611313, RC=0, DURATION_SEC=1.3
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=
[2021-03-30 15:52:32,130 INFO] run_shell_cmd: PID=611315, PGID=611315, CMD=zcat -f ./SRR8983693.1_1.trim_50bp.fastq.gz | wc -l
[2021-03-30 15:52:53,978 INFO] PID=611315, PGID=611315, RC=0, DURATION_SEC=21.8
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=94946280
[2021-03-30 15:52:53,979 INFO] List all files in output directory...
[2021-03-30 15:52:53,980 INFO] run_shell_cmd: PID=611340, PGID=611340, CMD=ls -l ./
[2021-03-30 15:52:53,982 INFO] PID=611340, PGID=611340, RC=0, DURATION_SEC=0.0
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=total 6873768
-rw-rw-r-- 1 matt matt 1191878036 Mar 22 02:20 SRR8983693.1_1.fastq.gz
-rw-rw-r-- 1 matt matt 1191878036 Mar 30 15:52 SRR8983693.1_1.trim_50bp.fastq.gz
-rw-rw-r-- 1 matt matt 1212496349 Mar 22 02:20 SRR8983693.1_2.fastq.gz
-rw-rw-r-- 1 matt matt 3442462779 Mar 30 15:39 tmp.fastq
[2021-03-30 15:52:53,982 INFO] All done.

leepc12 · 2021-04-02T22:34:13Z

/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)

Can you install sudo apt-get install libtinfo5 and try again?

Also, can you run trimfastq.py separately and see what happens to outputs?

$ source activate encode-chip-seq-pipeline
$ python $(which trimfastq.py) SRR8983693.1_1.fastq.gz 50 | gzip -nc > ./SRR8983693.1_1.trim_50bp.fastq.gz

Please check contents of the output SRR8983693.1_1.trim_50bp.fastq.gz. e.g. number of lines, first 10 lines.

mxw010 · 2021-05-19T19:52:18Z

I'll continue this because I am running into problems with xcor. on ENCODE 1.7.1 and I do not see any changelog regarding this issue.

Yes, such 50bp trimming is an additional alignment step for the cross-correlation analysis only, which is not used for any other analyses. For all other analyses, FASTQs will be trimmed with Trimmomatic (if chip.crop_length is defined > 0)

I know what call-align_R1 and call-xor are supposed to do, but where are the trimmed fastq/bam/tagAlign files saved? In the call-align_R1 folder, the read length estimation (*trim_50bp.read_length.txt) is still the same as our original read length, not 50. In call-xcor, the input tagAlign file has a similar read length, instead of the 50 it's supposed to have.

Furthermore, our fragment length average is quite often very similar to our sequencing length(both around 150). Without trimming (in the older ENCODE pipeline), xcor will inevitably fail because the phantompeak overlaps the actual peak. Trimming R1 to 50bps should solve that since the read length wouldn't be 150 anymore, but our analysis will always fail to estimate fragment size due to only having one peak which is recognized as the phantom peak. Forcing chip.xcor_exlcusion_range_max to be smaller than 100 will solve this issue and return the correct fragment size, which from my understanding is allowing actual peak to include phantom peaks.

I ran trimfastq.py separately on a raw fastq, and I do obtain a trimmed fastq file with 50bp reads. So the issue isn't there.

Looking closer, why is there this line
[2021-05-18 17:34:22,766 INFO] run_shell_cmd: PID=2170032, PGID=2170032, CMD=cp -f R1/SEG0161_S9_L002_R1_001_val_1.fastq.gz R1_trimmed/SEG0161_S9_L002_R1_001_val_1.trim_50bp.fastq.gz

in stdout inside call-align_R1/execution?

Further down the line,

[2021-05-18 17:34:47,849 INFO] run_shell_cmd: PID=2170057, PGID=2170057, CMD=ls -l R1_trimmed [2021-05-18 17:34:47,872 INFO] PID=2170057, PGID=2170057, RC=0, DURATION_SEC=0.0 STDERR= STDOUT=total 998240 -rw-r--r-- 1 mxw010 gdstantonlab 1022182282 May 18 17:34 SEG0161_S9_L002_R1_001_val_1.trim_50bp.fastq.gz

Our original fastq,

-rw-r--r-- 1 mxw010 gdstantonlab 1023271307 May 18 10:43 SEG0161_S9_L002_R1_001_val_1.fq.gz

Those two are basically the same size. So there is something in the pipeline that broke the trim step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant computation - unneccesary "trimfastq.py" step? #218

Redundant computation - unneccesary "trimfastq.py" step? #218

jmfrank commented Mar 22, 2021 •

edited

Loading

leepc12 commented Mar 29, 2021

jmfrank commented Mar 30, 2021

leepc12 commented Apr 2, 2021

mxw010 commented May 19, 2021 •

edited

Loading

Redundant computation - unneccesary "trimfastq.py" step? #218

Redundant computation - unneccesary "trimfastq.py" step? #218

Comments

jmfrank commented Mar 22, 2021 • edited Loading

Describe the bug

OS/Platform

Caper configuration file

Input JSON file

leepc12 commented Mar 29, 2021

jmfrank commented Mar 30, 2021

leepc12 commented Apr 2, 2021

mxw010 commented May 19, 2021 • edited Loading

jmfrank commented Mar 22, 2021 •

edited

Loading

mxw010 commented May 19, 2021 •

edited

Loading