Skip to content

Commit

Permalink
Clarify the name tokeniser uncomp_len calculation (PR samtools#803)
Browse files Browse the repository at this point in the history
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
  • Loading branch information
jkbonfield committed Jan 7, 2025
1 parent 836fb61 commit 4982e03
Showing 1 changed file with 8 additions and 4 deletions.
12 changes: 8 additions & 4 deletions CRAMcodecs.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2450,10 +2450,14 @@ \section{Name tokenisation codec}
a format within a format, as the multiple byte streams $B_{pos,type}$
are serialised into a single byte stream.

The serialised data stream starts with two unsigned little endiand 32-bit
integers holding the total size of uncompressed name buffer and the
number of read names. This is followed the array elements
themselves.
The serialised data stream starts with two unsigned little endian
32-bit integers holding the total size of uncompressed name buffer and
the number of read names. This is followed the array elements
themselves. Note the uncompressed size is calculated as the sum of
all name lengths including a termination byte per name (e.g. the nul
char). This is irrespective of whether the implementation produces
data in this form or whether it returns separate name and name-length
arrays.

Token types, $ttype$ holds one of the token ID values listed above
in the list above, plus special values to indicate certain additional
Expand Down

0 comments on commit 4982e03

Please sign in to comment.