Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal storage order #11

Open
rob-p opened this issue Nov 23, 2021 · 10 comments
Open

Internal storage order #11

rob-p opened this issue Nov 23, 2021 · 10 comments

Comments

@rob-p
Copy link
Contributor

rob-p commented Nov 23, 2021

So after trying to write some code interfacing kmers with some k-mers parsed using needletail, I realized we are storing the k-mers in the opposite order. In kmers the leftmost nucleotide is stored in the highest order bit, in needletail and some of the other (C++) libraries I have used in the past it's the opposite.

Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?

@natir
Copy link
Contributor

natir commented Nov 25, 2021

I didn't have any argument to make a choice between big endian or little endian.

We can choose one or let the user choose what they want (another generic parameter, submodule, features). But let user choose what they want can generate a huge work.

@rob-p
Copy link
Contributor Author

rob-p commented Nov 25, 2021

Thanks @natir - I agree it would be a pain to genericize over. How hard would it be to switch the endianness to match the existing library representations in the space?

@natir
Copy link
Contributor

natir commented Nov 26, 2021

I think change could be easy.

My question is should we made this change? Needletail and other C++ libraries choose little endian, why? can we found argument? It's really the most popular choices?

@rob-p
Copy link
Contributor Author

rob-p commented Nov 26, 2021

@natir — I think that's a totally fair question. As you bring up in your initial response, I think the right answer is simply that the choice is arbitrary. I can't think of a compelling (performance or simplicity) reason, offhand, to choose one representation over the other.

Ideally what I'd like to do is to have a survey of how popular the 2 different encodings are. In the lack of a compelling reason to choose one over the other, popularity is an argument in and of itself (to maximize compatibility). Of course, I think getting a widespread and accurate survey will be a challenge.

Do you know of other libraries that use big endian? Do you have any idea of the prevalence of these encodings?

Edit : made a twitter poll in case it might be useful : https://twitter.com/nomad421/status/1464259140767948810.

@d-cameron
Copy link

d-cameron commented Nov 28, 2021

Having both written my own kmer encoding scheme, and been burnt using libraries with different scheme, I thought I'd weigh in. There are actually 4 independent choices one needs to make when encoding kmers:

Base encoding scheme

The first is how to encode each base. 4-bit encoding is possible but the only place I've seen it used is in htslib to encode sequence bases - never in a kmer library.

Of the two bit encoding schemes, the two most common are ACGT=0123, and TCAG=0123. The latter is the scheme used for UCSC 2bit encoding and has the advantage that the two bits are biologically meaningful and complementing is just an XOR.

Base endianness

The second choice is what you're asking here. The conceptual difference is whether we consider the bases as a string, or as a number.

In a string interpretation, the bases are placed based on their string index (kmer offset = 2 * string offset). So TCAG is:

LSbit    01  23  45  67
         T   C   A   G

In a numeric interpretation, the bases are placed by interpreting the sequence string as a number. When representing numbers, the left-most (first) digit is the most significant. Interpreting the nucleotide sequence as a base-4 number we get:

MSbit    76  54  32  01
         T   C   A   G

There are actually two more choices a kmer encoding scheme can make that you haven't addressed:

Padding endianness

If the bases don't fill the entire data type, then you need to decide where the padding goes. If you have the kmer CAG, which bits are meaningful?

These ones?

LSbit    01  23  45  67
             ----------

Or these one?

LSbit    01  23  45  67
         ----------

The padding endianness doesn't necessarily correspond to the base endianness.

Array endianness

I've found that almost every kmer library is also a 2-bit encoding sequence library. Typically they're some variant on [u8], or [u64], but if you want to convert between two libraries, then you need to know the endianness of your array: does the first array element contain the first few bases of the sequence or the last?

I don't think I've encountered a final-bases-in-first-array-element encoding scheme here: at this level the consensus seems to be a string-like representation where the first element of the array corresponds to the first N bases of the sequence.

Actual endianness

One final consideration is the on-disk format. If you're using not using a multi-byte representation format then it's not a concern, but if you're using multi-byte data type and you need to interop with other libraries, then you do need to specify the endianness - even if that answer is just 'platform-native'.

Concluding remarks

In many way, most of these choices don't matter that much. There's some performance and implementation complexity pros and cons for each but they all a few bit manipulation operations away from each other.

For what it's worth, my preferred scheme is the one that most closely corresponds to the .2bit format: USCS base encoding; numeric byte endianness; padding in LSBs; string-like arrays; big-endian on disk. This allows you to directly subset arbitrary length kmers from a .2bit encoded file (conversion to platform endianness when using multi-byte data types notwithstanding).

Edit: rephased to not explicitly state big/little endianness except for on-disk ordering. Endianness is a numeric concept and, a kmer is really a bit-packed string, not a number so the concept of 'endianness' doesn't technically apply.

@d-cameron
Copy link

d-cameron commented Nov 29, 2021

The approach I'm advocating has a few nice properties that aren't present in other kmer encoding schemes.

  • An encoded 2-bit stream/array of length k represents a big-endian number of length 2k
  • The byte sequence can be directly interpreted as a big-endian [u8], [u16], [u32], [u64], or [u128].
    • No extra transforms required to get the kmer at position i. Just grab the integer primitive that contains your key, translate to platform-endianness, and bitshift out the relevant bases in the 1-2 words the kmer overlaps.
    • Works for all word sizes and alignments.
    • No need for the library to chose a word size or have different representation for different word size. They all happily interop (unsafe word-alignment of pointers also works if you're happy to read a few bytes outside your array bounds).
  • Appending to a long 2-bit encoded sequence is just appending to the sequence.
    • If the padding bits were in the MSBs, you'd have to bit shift your entire sequence.

In short, A format compatible with the UCSC .2bit format has a number of advantages when dealing with 2-bit encoded sequences. They're not particularly helpful/concerning if purely with kmers but, in my experience, every kmer library is either secretly or explicitly a library for 2-bit encoding of sequences. It makes a lot of sense to make this explicit in the library scope and actually support arbitrary length sequences. From this perspective, a format that plays nicely with both fixed and variable k makes the most sense.

TDLR: treating the first base as the most signficant bits allows for big-endian consistency across bytes, words, and entire sequences.

@d-cameron
Copy link

d-cameron commented Nov 29, 2021

Performance-wise, u64::from_be() is just bswap(x86)/rev(ARM) so I doubt it'd make any meaningful impact as it's a tiny overhead that hit with 2-bit sequence -> kmer conversions.

@natir
Copy link
Contributor

natir commented Nov 29, 2021

Thanks a lot @d-cameron for all this information.

I will try to give my opinion on all point but I probably miss some stuff.

Base encoding scheme

Kmers already support, with not the best code I write in my all life, any encoding. If user want use specificity of some encoding, user can choose faster encoder, or implement another one. So for me this trouble is solve.

Base endianness

I think we can say the pool didn't help to much, each solution seems share same popularity. But I could add another project to the MSbit list, KFF choose Big Endian.

I think Base endianess and Padding endianness choices must be taken in common.

After some reflection I can say I prefer simplify the indexing of kmer. I think Big Endian base endianness with right align Padding endianness is the best choices to simplify access of one base in kmer. As example we can take 6-mer TCAGTC with encoding ACTG = 0123 in u8 slice will produce (if I didn't made any mistake)

198         | 6
11 00 01 10 | XX XX 01 10
G  A  C  T  | _  _   C  T  

This not natural ordering simplify indexing. if you want access to the i-base of kmer you run a division to found position of the correct u8 block in slice, and with modulo you can get the correct two bits. Function get_nuc could look like:

fn get_nuc<P>(kmer: &[P], index: uslice) -> P
where P: bit_field::BitField
{
    (kmer[index / P::BIT_LENGTH] >> (index % P::BIT_LENGTH * 2)) & 0b11
}

Array endianness

Actual user of library can choose which type of slice (u8, u16, u32, u64, u128) is used to store Kmer, so for me it's solve.

Actual endianness (can we rename it Storing endianness ?)

For me isn't the trouble of Kmers library we didn't write a serialization format. But I agree that we need to pay attention to this issue as well, but it is clearly not a priority for me.

@d-cameron
Copy link

Actual user of library can choose which type of slice (u8, u16, u32, u64, u128) is used to store Kmer, so for me it's solve.

The main advantage of having the base, byte, and padding all for big-endian is that switching between storage words is trivial. That is, a [u8,2] and a u16 are identical, similarly, [u8,4] and u64. You can subset a Vec with any word size and you'll get the kmer starting at that memory position.

The fact that this only works for big-endian and both x86 and ARM are little-endian is somewhat annoying as it means that whatever choice is made, it's a trade-off of something.

@natir
Copy link
Contributor

natir commented Dec 1, 2021

The fact that this only works for big-endian and both x86 and ARM are little-endian is somewhat annoying as it means that whatever choice is made, it's a trade-off of something.

For me there is no good solution but only a less bad solution. (I'm not sure we can define a less bad solution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants