-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant computation - unneccesary "trimfastq.py" step? #218
Comments
Yes, such 50bp trimming is an additional alignment step for the cross-correlation analysis only, which is not used for any other analyses. For all other analyses, FASTQs will be trimmed with Trimmomatic (if |
Yes, I had seen that after I posted. Sorry, I think I buried the actual problem in my post. The issue is that the 50bp trimming does not happen. When I compared "R1_trimmed/xxxx.trim_50bp.fastq.gz" and "R1/xxxx.fastq.gz", the files are exactly the same according to size and when I run bash diff. I ran the individual command: encode_task_trim_fastq.py to confirm my result. [2021-03-30 15:49:16,536 INFO] ['/home/matt/anaconda3/envs/encode-chip-seq-pipeline/bin/encode_task_trim_fastq.py', 'SRR8983693.1_1.fastq.gz', '--trim-bp', '50', '--out-dir', './'] |
Can you install Also, can you run trimfastq.py separately and see what happens to outputs?
Please check contents of the output |
I'll continue this because I am running into problems with xcor. on ENCODE 1.7.1 and I do not see any changelog regarding this issue.
I know what call-align_R1 and call-xor are supposed to do, but where are the trimmed fastq/bam/tagAlign files saved? In the call-align_R1 folder, the read length estimation (*trim_50bp.read_length.txt) is still the same as our original read length, not 50. In call-xcor, the input tagAlign file has a similar read length, instead of the 50 it's supposed to have. Furthermore, our fragment length average is quite often very similar to our sequencing length(both around 150). Without trimming (in the older ENCODE pipeline), xcor will inevitably fail because the phantompeak overlaps the actual peak. Trimming R1 to 50bps should solve that since the read length wouldn't be 150 anymore, but our analysis will always fail to estimate fragment size due to only having one peak which is recognized as the phantom peak. Forcing chip.xcor_exlcusion_range_max to be smaller than 100 will solve this issue and return the correct fragment size, which from my understanding is allowing actual peak to include phantom peaks. I ran trimfastq.py separately on a raw fastq, and I do obtain a trimmed fastq file with 50bp reads. So the issue isn't there. Looking closer, why is there this line in stdout inside call-align_R1/execution? Further down the line,
Our original fastq,
Those two are basically the same size. So there is something in the pipeline that broke the trim step. |
Describe the bug
I've been looking at how the pipeline runs, and I noticed something strange. I'm doing something pretty simple, just running chip-seq analysis of existing data on a new reference genome. In the "call-align_R1" folder, I noticed that the pipeline calls "trimfastq.py" to trim the fastq data with an default value of 50bp. This is strange because I didn't see any option to specify this value or step in the input.md file.
This is different than the trimming for trimmomatic. It looks like a script to just trim all reads to some fixed value. The weirdness is that the output file "R1_trimmed/xxxx.trim_50bp.fastq.gz" is not actually trimmed, it's an exact copy of the original fastq file "R1/xxxx.fastq.gz". The pipeline goes on to use the R1_trimmed file for alignment.
It seems really unclear why there's a call to trimfastq with default 50bp, but it doesn't actually trim the file. Maybe it's a bug, or just a left over result of something not used anymore? I was initially surprised to see that file name containing "trim_50bp.fastq.gz" because I didn't specify such trimming. The result itself doesn't seem to be a problem, but this could be a source of other issues or undesired behavior.
I see that there's a parameter: "chip.xcor_trim_bp" with default 50bp. I'm guessing the "call-align_R1" is the step before the cross-correlation? If that's the case, then there is indeed a bug because my reads don't get trimmed!
Any clarification would be helpful! Thanks.
OS/Platform
Caper configuration file
Paste contents of
~/.caper/default.conf
.backend=local
local-hash-strat=path+modtime
local-loc-dir=/media/matt/fast_data_storage/ngs_sandbox/caper/
cromwell=/home/matt/.caper/cromwell_jar/cromwell-52.jar
womtool=/home/matt/.caper/womtool_jar/womtool-52.jar
Input JSON file
{
"chip.title" : "H1 hESC primed chip seq.",
"chip.description" : "h3k27ac chip-seq.",
}
The text was updated successfully, but these errors were encountered: