Skip to content

Commit

Permalink
Add CRAM codecs test data (PR #740)
Browse files Browse the repository at this point in the history
These are mostly CRAM 3.1, bar the original CRAM 3.0 rANS 4x8 method.

The files are intended for validation of codec implementation and form
a prerequisite to whole-CRAM file validation.
  • Loading branch information
jkbonfield committed Mar 21, 2024
1 parent 25ae51f commit d481356
Show file tree
Hide file tree
Showing 214 changed files with 175 additions and 0 deletions.
175 changes: 175 additions & 0 deletions test/cram/codecs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
CRAM codecs test files
======================

The data here corresponds to the new compression codecs implemented in
CRAM 3.0 and 3.1. Rather than entire CRAM files, these tests
represent individual files compressed using a CRAM codec, with none of
the CRAM blocks, slices or container wrappings present.

Each directory corresponds to a codec, with both the CRAM compressed
file and a corresponding gzip compressed version of the same file.
The CRAM encoded version may have multiple variants, demonstrating
different capabilities of that codec. For example the CRAM 3.0 rANS
may have an order-0 and order-1 encoding, while the CRAM 3.1 rANS adds
additional bit-packing and RLE methods.

To validate decoder implementations with these files, you should
attempt to decode the CRAM codec-compressed file and demonstrate
identity with the original ungzipped copy after file specific
processing (see below).

To validate an encoder, you can perform a round-trip using each
specific codec option. While this is not a guarantee of accuracy of
implementation, if the round-trip works while the decoder on the
supplied data also works, then it is a demonstration of correct data
layout. Note it may not be possible to simply compare compression
outputs as two implementations may choose different frequency tables
where rounding is required. Both are valid, but different.

Note the origin of these files is the htscodecs self test directory.

https://github.com/samtools/htscodecs/


Processing
----------

In CRAM the quality values are just encoded together into a single
block with no separators, such as newlines. The rANS codec just
compresses as-is, while the fqzcomp codec may utilise the sequence
length in order to improve compression. The name tokeniser doesn't
care what the separator is, but if BYTE_ARRAY_STOP has been used in
CRAM then they could be newline, nul or tab terminated.

For ease of performing round-trip tests, all the data files here are
textual and newline separated (with the exception of the 32-bit
unsigned integer file "u32").

Therefore some transformations are required when performing equality
checks.

As an example, we will use htscodecs' tests.

$ gzip -cd gzip/q40+dir.gz | perl -ae 'print $F[0]' | md5sum
ea2e88c7a117c3989203f6987058d548 -

$ rans4x16pr -d -r ransNx16/q40+dir.1 | md5sum
ea2e88c7a117c3989203f6987058d548 -

$ fqzcomp_qual -d -r fqzcomp/q40+dir.3 | tr -d '\012' | md5sum
ea2e88c7a117c3989203f6987058d548 -

The perl command outputs the first column in the uncompressed test
data, without newlines. (The htscodecs fqzcomp_qual program does
however output newlines, so we are removing these for consistency,
but this is specific to the htscodecs implementation of that test tool.)

$ gzip -cd gzip/10.names | md5sum
3a45331216b626bce255d12e83a615f3 -

$ tokenise_name3 -d -r < tok3/10.names.9 | tr '\000' '\012' | md5sum
3a45331216b626bce255d12e83a615f3 -

Similarly for the name tokeniser, the htscodecs test tool is
using nul bytes, so we transform these to newlines before comparing.

The important thing here is to compare the contents of the files
rather than the specific layout. Whatever processing you do will be
implementation tool specific.


Specific test data sets
-----------------------

The gzip subdirectory has the original data with names below.

### q4

Fixed length Illumina quality values binned to 4 discrete values.
Good for evaluating the PACK format of rANS-Nx16 and range.

### q8

Fixed length Illumina quality values binned to 8 discrete values.
Good for evaluating the PACK format of rANS-Nx16 and range.

### q40+dir

Fixed length Unbinned Illumina quality values with 40 discrete values.
This file has a second field indicating the strand, 0 for original and
1 for reversed. This ancillary information could be used by fqzcomp
as a selector bit to improve compression ratios.

Note only the first column will be decoded when using the pre-made
CRAM codec compressed files (although this second column may be
embedded in some of the fqzcomp encodings).

### qvar

Long read quality values with variable length records and many
discrete quality values.

### *.names

Multiple sets of read names produced by a variety of platforms and in
a variety of orders, including single and paired data and both name
sorted and position sorted (which essentially randomises names, but
may have pairings close by).

Note some of the name files are in FASTQ format with `@name` and
potentially ending on /1 or /2. These won't be the format used
internally with CRAM, but are still designed as a test for coping with
arbitrary name formatting.

The "rr.names" file contains a merge of two different name styles and
may be a useful test for optimising the selection of the previous name
to delta against for a name tokeniser encoder.

### u32

This is a binary file with 32-bit little endian integers. It is a
useful test of the STRIPE data interleaving capability of the rANS-Nx16
and range codecs.

Note this file is not line-wrapped and should be compared as-is.


Notes on specific codec files
=============================

The fqzcomp, range, rans4x8, ransNx16 and tok3 directories have the
CRAM codec compressed copies of the above files. There will be
multiple versions of each file with a suffix indicating the encoding
used, as below.

### rans4x8

".0" and ".1" used for Order-0 and Order-1 entropy encoding
respectively.

### ransNx16

The "Order" byte is extended to contain additional bits, with +8 for
STRIPE, +64 for RLE and +128 for PACK. The suffixes are the decimal
version of these combinations, so ".192" for example is PACK + RLE +
order-0.

### range

As per ransNx16, encoding Order, STRIPE, RLE and PACK.

### fqzcomp

Fqzcomp is different to the range and rans codecs in that it doesn't
have a fixed set of data transformations. Instead it has an internal
description of various operations and their positions, bit-sizes and
shifts. We don't describe the precise configuration, and instead have
".0" to ".3" for 4 arbitrary configuration settings.

### tok3

".1" to ".9" can be viewed as a compression level, like gzip -1 to
gzip -9, with more exploration of different compression techniques.
All of these use the rANS-Nx16 entropy encoder.

".11" to ".19" are similar, but using the range coder instead.
Binary file added test/cram/codecs/fqzcomp/q4.0
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q4.1
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q4.2
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q4.3
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q40+dir.0
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q40+dir.1
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q40+dir.2
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q40+dir.3
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q8.0
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q8.1
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q8.2
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/q8.3
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/qvar.0
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/qvar.1
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/qvar.2
Binary file not shown.
Binary file added test/cram/codecs/fqzcomp/qvar.3
Binary file not shown.
Binary file added test/cram/codecs/gzip/01.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/02.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/03.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/05.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/08.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/09.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/10.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/20.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/nv.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/nv2.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/q4.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/q40+dir.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/q8.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/qvar.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/rr.names.gz
Binary file not shown.
Binary file added test/cram/codecs/gzip/u32.gz
Binary file not shown.
Binary file added test/cram/codecs/range/q4.0
Binary file not shown.
Binary file added test/cram/codecs/range/q4.1
Binary file not shown.
Binary file added test/cram/codecs/range/q4.128
Binary file not shown.
Binary file added test/cram/codecs/range/q4.129
Binary file not shown.
Binary file added test/cram/codecs/range/q4.192
Binary file not shown.
Binary file added test/cram/codecs/range/q4.193
Binary file not shown.
Binary file added test/cram/codecs/range/q4.64
Binary file not shown.
Binary file added test/cram/codecs/range/q4.65
Binary file not shown.
Binary file added test/cram/codecs/range/q4.8
Binary file not shown.
Binary file added test/cram/codecs/range/q4.9
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.0
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.1
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.64
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.65
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.8
Binary file not shown.
Binary file added test/cram/codecs/range/q40+dir.9
Binary file not shown.
Binary file added test/cram/codecs/range/q8.0
Binary file not shown.
Binary file added test/cram/codecs/range/q8.1
Binary file not shown.
Binary file added test/cram/codecs/range/q8.128
Binary file not shown.
Binary file added test/cram/codecs/range/q8.129
Binary file not shown.
Binary file added test/cram/codecs/range/q8.192
Binary file not shown.
Binary file added test/cram/codecs/range/q8.193
Binary file not shown.
Binary file added test/cram/codecs/range/q8.64
Binary file not shown.
Binary file added test/cram/codecs/range/q8.65
Binary file not shown.
Binary file added test/cram/codecs/range/qvar.0
Binary file not shown.
Binary file added test/cram/codecs/range/qvar.1
Binary file not shown.
Binary file added test/cram/codecs/range/qvar.64
Binary file not shown.
Binary file added test/cram/codecs/range/qvar.65
Binary file not shown.
Binary file added test/cram/codecs/range/u32.1
Binary file not shown.
Binary file added test/cram/codecs/range/u32.4
Binary file not shown.
Binary file added test/cram/codecs/range/u32.65
Binary file not shown.
Binary file added test/cram/codecs/range/u32.9
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q4.0
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q4.1
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q40+dir.0
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q40+dir.1
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q8.0
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/q8.1
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/qvar.0
Binary file not shown.
Binary file added test/cram/codecs/rans4x8/qvar.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.0
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.128
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.129
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.192
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.193
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.4
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.5
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.64
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q4.65
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q40+dir.0
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q40+dir.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q40+dir.4
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q40+dir.5
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q40+dir.8
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.0
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.128
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.129
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.192
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.193
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.4
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.5
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.64
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/q8.65
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/qvar.0
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/qvar.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/qvar.4
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/qvar.5
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/u32.1
Binary file not shown.
Binary file added test/cram/codecs/ransNx16/u32.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/01.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/02.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/03.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/05.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/08.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/09.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/10.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/20.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/nv2.names.9
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.1
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.11
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.13
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.15
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.17
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.19
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.3
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.5
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.7
Binary file not shown.
Binary file added test/cram/codecs/tok3/rr.names.9
Binary file not shown.

0 comments on commit d481356

Please sign in to comment.