fastq processing #17

MardahlM · 2022-06-10T09:49:17Z

Hi, Thanks for a quick tool.
I've been using this to UMICollapse my bam files.
Now I want to utilize it on fastq files.
I am confused by the statement:

"fastq: the input is a FASTQ file. This deduplicates the entire FASTQ file based on each entire read sequence. Note that the "UMI" would be the entire read sequence."

Does this mean that the UMIs are collapsed and the actual reads are not looked at?
For perspective: I have 100M read depth with UMI of 12 nt length. I want to be sure there is no collapsing of reads that are not identical but by random chance have same UMIs. Can you elaborate?

I am also curious about the --tag option.
I am looking for miRNAs, and my plan is that after collapsing of the UMIs, I would NOT use the --tag option, and instead proceed directly to fastx collapser to get an abundance table.
For this I reckon I don't need to know how many UMIs were part of one group, or?

I hope you can find the time to give me some input.

I plan on implementing this tool in my future UMI analyses at either fastq and bam level.

Cheers,
Maibritt

Daniel-Liu-c0deb0t · 2022-06-10T22:06:10Z

For the fastq mode, reads are compared by both the read sequence and the UMI. So two reads will only be grouped together and collapsed if they have the same sequence and UMI. For bam mode, the read sequences do not need to be compared because only reads that map to the same location (they should have similar sequences) are deduplicated by UMIs. Sorry about the confusion.

The tag option essentially just outputs the same reads, but with the clusters labelled in the read names. If you just want the deduplicated reads (eg., you just want to remove all PCR duplicates), you don't need to use this option.

MardahlM · 2022-06-22T20:18:31Z

Hi again,

I tried UMICollapse on fastq files instead of bam files, and the result is a drastic reduction in read numbers.

For one file after UMI extract, Cutadapt, and Bowtie it looks like this:

bowtie -v 2 -a --norc --best --strata --threads 16

Bowtie log

reads processed: 35032773
reads with at least one reported alignment: 14794819 (42.23%)
reads that failed to align: 20237954 (57.77%)
Reported 16539498 alignments to 1 output stream(s)

UMICollapse log

Number of input reads 16539498
Number of removed unmapped reads 0
Number of unremoved reads 16539498
Number of unique alignment positions 357
Average number of UMIs per alignment position 28842.619047619046
Max number of UMIs over all alignment positions 1294837
Number of reads after deduplicating 4475513
UMI collapsing finished in 141.077 seconds!

However, on the fastq file after UMI extract and Cutadapt it looks like:

Arguments [fastq, -i, ~/data/0L1-directUMIextracted-min18max30L.fastq, -o, ~/results/0L1/0L1-UMI_collapsed.fastq]

Done reading input file into memory!
Number of input reads 35032773
Number of unique reads 647457
Number of reads after deduplicating 471837
UMI collapsing finished in 34.516 seconds!

Can you help me uncover what is going on? How come the reads are drastically lower when UMICollapsing on fastq versus bam files?

If anything, I'd expect lower number of deduplicated reads after alignment.

Cheers, Maibritt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastq processing #17

fastq processing #17

MardahlM commented Jun 10, 2022 •

edited

Loading

Daniel-Liu-c0deb0t commented Jun 10, 2022

MardahlM commented Jun 22, 2022

fastq processing #17

fastq processing #17

Comments

MardahlM commented Jun 10, 2022 • edited Loading

Daniel-Liu-c0deb0t commented Jun 10, 2022

MardahlM commented Jun 22, 2022

Bowtie log

UMICollapse log

MardahlM commented Jun 10, 2022 •

edited

Loading