Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant computation - unneccesary "trimfastq.py" step? #218

Open
jmfrank opened this issue Mar 22, 2021 · 4 comments
Open

Redundant computation - unneccesary "trimfastq.py" step? #218

jmfrank opened this issue Mar 22, 2021 · 4 comments

Comments

@jmfrank
Copy link

jmfrank commented Mar 22, 2021

Describe the bug

I've been looking at how the pipeline runs, and I noticed something strange. I'm doing something pretty simple, just running chip-seq analysis of existing data on a new reference genome. In the "call-align_R1" folder, I noticed that the pipeline calls "trimfastq.py" to trim the fastq data with an default value of 50bp. This is strange because I didn't see any option to specify this value or step in the input.md file.

This is different than the trimming for trimmomatic. It looks like a script to just trim all reads to some fixed value. The weirdness is that the output file "R1_trimmed/xxxx.trim_50bp.fastq.gz" is not actually trimmed, it's an exact copy of the original fastq file "R1/xxxx.fastq.gz". The pipeline goes on to use the R1_trimmed file for alignment.

It seems really unclear why there's a call to trimfastq with default 50bp, but it doesn't actually trim the file. Maybe it's a bug, or just a left over result of something not used anymore? I was initially surprised to see that file name containing "trim_50bp.fastq.gz" because I didn't specify such trimming. The result itself doesn't seem to be a problem, but this could be a source of other issues or undesired behavior.

I see that there's a parameter: "chip.xcor_trim_bp" with default 50bp. I'm guessing the "call-align_R1" is the step before the cross-correlation? If that's the case, then there is indeed a bug because my reads don't get trimmed!

Any clarification would be helpful! Thanks.

OS/Platform

  • OS/Platform: Ubuntu 20.04
  • Conda version: 4.9.2
  • Pipeline version: v1.7.1
  • Caper version: 1.4.2

Caper configuration file

Paste contents of ~/.caper/default.conf.
backend=local

local-hash-strat=path+modtime
local-loc-dir=/media/matt/fast_data_storage/ngs_sandbox/caper/

cromwell=/home/matt/.caper/cromwell_jar/cromwell-52.jar
womtool=/home/matt/.caper/womtool_jar/womtool-52.jar

Input JSON file

{
"chip.title" : "H1 hESC primed chip seq.",
"chip.description" : "h3k27ac chip-seq.",

"chip.pipeline_type" : "histone",
"chip.aligner" : "bowtie2",
"chip.align_only" : false,
"chip.true_rep_only" : false,

"chip.genome_tsv" : "/media/ngs/data/genomes/v13/v13.tsv",

"chip.paired_end" : true,
"chip.ctl_paired_end" : true,

"chip.always_use_pooled_ctl" : true,

"chip.fastqs_rep1_R1" : [ "SRR8983693/SRR8983693.1_1.fastq.gz"],
"chip.fastqs_rep1_R2" : [ "SRR8983693/SRR8983693.1_2.fastq.gz"],
"chip.fastqs_rep2_R1" : [ "SRR8983695/SRR8983695.1_1.fastq.gz"],
"chip.fastqs_rep2_R2" : [ "SRR8983695/SRR8983695.1_2.fastq.gz"],

"chip.ctl_fastqs_rep1_R1" : [ "SRR8983705/SRR8983705.1_1.fastq.gz"],
"chip.ctl_fastqs_rep1_R2" : [ "SRR8983705/SRR8983705.1_2.fastq.gz"]

}

@leepc12
Copy link
Contributor

leepc12 commented Mar 29, 2021

Yes, such 50bp trimming is an additional alignment step for the cross-correlation analysis only, which is not used for any other analyses. For all other analyses, FASTQs will be trimmed with Trimmomatic (if chip.crop_length is defined > 0)

@jmfrank
Copy link
Author

jmfrank commented Mar 30, 2021

Yes, I had seen that after I posted. Sorry, I think I buried the actual problem in my post.

The issue is that the 50bp trimming does not happen. When I compared "R1_trimmed/xxxx.trim_50bp.fastq.gz" and "R1/xxxx.fastq.gz", the files are exactly the same according to size and when I run bash diff.

I ran the individual command: encode_task_trim_fastq.py to confirm my result.
$ python $(which encode_task_trim_fastq.py) SRR8983693.1_1.fastq.gz --trim-bp 50 --out-dir ./
Below is the output. In bold, you can see there's a forced copy of the original fastq onto the trimmed fastq file. This occurs when there's only 1 or more lines in the fastq with a read shorter than the trim length. You can see in the output my fastq only has 1 read less than 50bp, so it performs the copy. This seems like a strange decision. It would make sense if most or many reads were less than the trim length. But if it's one read out of millions, it seems like we should remove it rather than default to the original data. I suppose this is a minor detail, but I wanted to bring it up because it confused me at first.

[2021-03-30 15:49:16,536 INFO] ['/home/matt/anaconda3/envs/encode-chip-seq-pipeline/bin/encode_task_trim_fastq.py', 'SRR8983693.1_1.fastq.gz', '--trim-bp', '50', '--out-dir', './']
[2021-03-30 15:49:16,536 INFO] Initializing and making output directory...
[2021-03-30 15:49:16,536 INFO] Trimming fastqs (50 bp)...
[2021-03-30 15:49:16,537 INFO] run_shell_cmd: PID=610685, PGID=610685, CMD=python $(which trimfastq.py) SRR8983693.1_1.fastq.gz 50 | gzip -nc > ./SRR8983693.1_1.trim_50bp.fastq.gz
[2021-03-30 15:52:17,583 INFO] PID=610685, PGID=610685, RC=0, DURATION_SEC=181.0
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=
[2021-03-30 15:52:17,584 INFO] run_shell_cmd: PID=611272, PGID=611272, CMD=zcat -f ./SRR8983693.1_1.trim_50bp.fastq.gz | (grep 'sequences shorter than desired length' || true) | wc -l
[2021-03-30 15:52:30,792 INFO] PID=611272, PGID=611272, RC=0, DURATION_SEC=13.2
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=1
[2021-03-30 15:52:30,793 INFO] run_shell_cmd: PID=611313, PGID=611313, CMD=cp -f SRR8983693.1_1.fastq.gz ./SRR8983693.1_1.trim_50bp.fastq.gz
[2021-03-30 15:52:32,126 INFO] PID=611313, PGID=611313, RC=0, DURATION_SEC=1.3
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=
[2021-03-30 15:52:32,130 INFO] run_shell_cmd: PID=611315, PGID=611315, CMD=zcat -f ./SRR8983693.1_1.trim_50bp.fastq.gz | wc -l
[2021-03-30 15:52:53,978 INFO] PID=611315, PGID=611315, RC=0, DURATION_SEC=21.8
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=94946280
[2021-03-30 15:52:53,979 INFO] List all files in output directory...
[2021-03-30 15:52:53,980 INFO] run_shell_cmd: PID=611340, PGID=611340, CMD=ls -l ./
[2021-03-30 15:52:53,982 INFO] PID=611340, PGID=611340, RC=0, DURATION_SEC=0.0
STDERR=/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)
STDOUT=total 6873768
-rw-rw-r-- 1 matt matt 1191878036 Mar 22 02:20 SRR8983693.1_1.fastq.gz
-rw-rw-r-- 1 matt matt 1191878036 Mar 30 15:52 SRR8983693.1_1.trim_50bp.fastq.gz
-rw-rw-r-- 1 matt matt 1212496349 Mar 22 02:20 SRR8983693.1_2.fastq.gz
-rw-rw-r-- 1 matt matt 3442462779 Mar 30 15:39 tmp.fastq
[2021-03-30 15:52:53,982 INFO] All done.

@leepc12
Copy link
Contributor

leepc12 commented Apr 2, 2021

/bin/bash: /home/matt/anaconda3/envs/encode-chip-seq-pipeline/lib/libtinfo.so.6: no version information available (required by /bin/bash)

Can you install sudo apt-get install libtinfo5 and try again?

Also, can you run trimfastq.py separately and see what happens to outputs?

$ source activate encode-chip-seq-pipeline
$ python $(which trimfastq.py) SRR8983693.1_1.fastq.gz 50 | gzip -nc > ./SRR8983693.1_1.trim_50bp.fastq.gz

Please check contents of the output SRR8983693.1_1.trim_50bp.fastq.gz. e.g. number of lines, first 10 lines.

@mxw010
Copy link

mxw010 commented May 19, 2021

I'll continue this because I am running into problems with xcor. on ENCODE 1.7.1 and I do not see any changelog regarding this issue.

Yes, such 50bp trimming is an additional alignment step for the cross-correlation analysis only, which is not used for any other analyses. For all other analyses, FASTQs will be trimmed with Trimmomatic (if chip.crop_length is defined > 0)

I know what call-align_R1 and call-xor are supposed to do, but where are the trimmed fastq/bam/tagAlign files saved? In the call-align_R1 folder, the read length estimation (*trim_50bp.read_length.txt) is still the same as our original read length, not 50. In call-xcor, the input tagAlign file has a similar read length, instead of the 50 it's supposed to have.

Furthermore, our fragment length average is quite often very similar to our sequencing length(both around 150). Without trimming (in the older ENCODE pipeline), xcor will inevitably fail because the phantompeak overlaps the actual peak. Trimming R1 to 50bps should solve that since the read length wouldn't be 150 anymore, but our analysis will always fail to estimate fragment size due to only having one peak which is recognized as the phantom peak. Forcing chip.xcor_exlcusion_range_max to be smaller than 100 will solve this issue and return the correct fragment size, which from my understanding is allowing actual peak to include phantom peaks.

I ran trimfastq.py separately on a raw fastq, and I do obtain a trimmed fastq file with 50bp reads. So the issue isn't there.

Looking closer, why is there this line
[2021-05-18 17:34:22,766 INFO] run_shell_cmd: PID=2170032, PGID=2170032, CMD=cp -f R1/SEG0161_S9_L002_R1_001_val_1.fastq.gz R1_trimmed/SEG0161_S9_L002_R1_001_val_1.trim_50bp.fastq.gz

in stdout inside call-align_R1/execution?

Further down the line,

[2021-05-18 17:34:47,849 INFO] run_shell_cmd: PID=2170057, PGID=2170057, CMD=ls -l R1_trimmed [2021-05-18 17:34:47,872 INFO] PID=2170057, PGID=2170057, RC=0, DURATION_SEC=0.0 STDERR= STDOUT=total 998240 -rw-r--r-- 1 mxw010 gdstantonlab 1022182282 May 18 17:34 SEG0161_S9_L002_R1_001_val_1.trim_50bp.fastq.gz

Our original fastq,

-rw-r--r-- 1 mxw010 gdstantonlab 1023271307 May 18 10:43 SEG0161_S9_L002_R1_001_val_1.fq.gz

Those two are basically the same size. So there is something in the pipeline that broke the trim step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants