Incorrect compression of N bases #2

fwip · 2014-07-08T01:25:14Z

From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk

Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.

This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases also sometimes have a quality score of 2. No observed bases have a quality below 2.

As-is, the error only becomes evident on decompression:
fastqz error: unexpected end of .fxb

The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).

Possible fixes, in order of increasing estimated difficulty:

Pre & post process our data outside of fastqz - convert N qualities to 0 ("!") before compression, and convert 0s back to 2 after decompression.
Bundle the above into fastqz, possibly with customizable "offset" value.
Change the encoding schema to store N values.

fwip added the bug label Jul 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect compression of N bases #2

Incorrect compression of N bases #2

fwip commented Jul 8, 2014

Incorrect compression of N bases #2

Incorrect compression of N bases #2

Comments

fwip commented Jul 8, 2014