Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect compression of N bases #2

Open
fwip opened this issue Jul 8, 2014 · 0 comments
Open

Incorrect compression of N bases #2

fwip opened this issue Jul 8, 2014 · 0 comments
Labels

Comments

@fwip
Copy link
Owner

fwip commented Jul 8, 2014

From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk

Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.

This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases also sometimes have a quality score of 2. No observed bases have a quality below 2.

As-is, the error only becomes evident on decompression:
fastqz error: unexpected end of .fxb

The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).

Possible fixes, in order of increasing estimated difficulty:

  1. Pre & post process our data outside of fastqz - convert N qualities to 0 ("!") before compression, and convert 0s back to 2 after decompression.
  2. Bundle the above into fastqz, possibly with customizable "offset" value.
  3. Change the encoding schema to store N values.
@fwip fwip added the bug label Jul 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant