-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal storage order #11
Comments
I didn't have any argument to make a choice between big endian or little endian. We can choose one or let the user choose what they want (another generic parameter, submodule, features). But let user choose what they want can generate a huge work. |
Thanks @natir - I agree it would be a pain to genericize over. How hard would it be to switch the endianness to match the existing library representations in the space? |
I think change could be easy. My question is should we made this change? Needletail and other C++ libraries choose little endian, why? can we found argument? It's really the most popular choices? |
@natir — I think that's a totally fair question. As you bring up in your initial response, I think the right answer is simply that the choice is arbitrary. I can't think of a compelling (performance or simplicity) reason, offhand, to choose one representation over the other. Ideally what I'd like to do is to have a survey of how popular the 2 different encodings are. In the lack of a compelling reason to choose one over the other, popularity is an argument in and of itself (to maximize compatibility). Of course, I think getting a widespread and accurate survey will be a challenge. Do you know of other libraries that use big endian? Do you have any idea of the prevalence of these encodings? Edit : made a twitter poll in case it might be useful : https://twitter.com/nomad421/status/1464259140767948810. |
Having both written my own kmer encoding scheme, and been burnt using libraries with different scheme, I thought I'd weigh in. There are actually 4 independent choices one needs to make when encoding kmers: Base encoding scheme The first is how to encode each base. 4-bit encoding is possible but the only place I've seen it used is in htslib to encode sequence bases - never in a kmer library. Of the two bit encoding schemes, the two most common are ACGT=0123, and TCAG=0123. The latter is the scheme used for UCSC 2bit encoding and has the advantage that the two bits are biologically meaningful and complementing is just an XOR. Base endianness The second choice is what you're asking here. The conceptual difference is whether we consider the bases as a string, or as a number. In a string interpretation, the bases are placed based on their string index (
In a numeric interpretation, the bases are placed by interpreting the sequence string as a number. When representing numbers, the left-most (first) digit is the most significant. Interpreting the nucleotide sequence as a base-4 number we get:
There are actually two more choices a kmer encoding scheme can make that you haven't addressed: Padding endianness If the bases don't fill the entire data type, then you need to decide where the padding goes. If you have the kmer CAG, which bits are meaningful? These ones?
Or these one?
The padding endianness doesn't necessarily correspond to the base endianness. Array endianness I've found that almost every kmer library is also a 2-bit encoding sequence library. Typically they're some variant on I don't think I've encountered a final-bases-in-first-array-element encoding scheme here: at this level the consensus seems to be a string-like representation where the first element of the array corresponds to the first N bases of the sequence. Actual endianness One final consideration is the on-disk format. If you're using not using a multi-byte representation format then it's not a concern, but if you're using multi-byte data type and you need to interop with other libraries, then you do need to specify the endianness - even if that answer is just 'platform-native'. Concluding remarks In many way, most of these choices don't matter that much. There's some performance and implementation complexity pros and cons for each but they all a few bit manipulation operations away from each other. For what it's worth, my preferred scheme is the one that most closely corresponds to the .2bit format: USCS base encoding; numeric byte endianness; padding in LSBs; string-like arrays; big-endian on disk. This allows you to directly subset arbitrary length kmers from a .2bit encoded file (conversion to platform endianness when using multi-byte data types notwithstanding). Edit: rephased to not explicitly state big/little endianness except for on-disk ordering. Endianness is a numeric concept and, a kmer is really a bit-packed string, not a number so the concept of 'endianness' doesn't technically apply. |
The approach I'm advocating has a few nice properties that aren't present in other kmer encoding schemes.
In short, A format compatible with the UCSC .2bit format has a number of advantages when dealing with 2-bit encoded sequences. They're not particularly helpful/concerning if purely with kmers but, in my experience, every kmer library is either secretly or explicitly a library for 2-bit encoding of sequences. It makes a lot of sense to make this explicit in the library scope and actually support arbitrary length sequences. From this perspective, a format that plays nicely with both fixed and variable k makes the most sense. TDLR: treating the first base as the most signficant bits allows for big-endian consistency across bytes, words, and entire sequences. |
Performance-wise, |
Thanks a lot @d-cameron for all this information. I will try to give my opinion on all point but I probably miss some stuff. Base encoding scheme
Base endianness I think we can say the pool didn't help to much, each solution seems share same popularity. But I could add another project to the MSbit list, KFF choose Big Endian. I think Base endianess and Padding endianness choices must be taken in common. After some reflection I can say I prefer simplify the indexing of kmer. I think Big Endian base endianness with right align Padding endianness is the best choices to simplify access of one base in kmer. As example we can take 6-mer TCAGTC with encoding ACTG = 0123 in
This not natural ordering simplify indexing. if you want access to the i-base of kmer you run a division to found position of the correct
Array endianness Actual user of library can choose which type of slice ( Actual endianness (can we rename it Storing endianness ?) For me isn't the trouble of |
The main advantage of having the base, byte, and padding all for big-endian is that switching between storage words is trivial. That is, a [u8,2] and a u16 are identical, similarly, [u8,4] and u64. You can subset a Vec with any word size and you'll get the kmer starting at that memory position. The fact that this only works for big-endian and both x86 and ARM are little-endian is somewhat annoying as it means that whatever choice is made, it's a trade-off of something. |
For me there is no good solution but only a less bad solution. (I'm not sure we can define a less bad solution) |
So after trying to write some code interfacing
kmers
with some k-mers parsed usingneedletail
, I realized we are storing the k-mers in the opposite order. Inkmers
the leftmost nucleotide is stored in the highest order bit, inneedletail
and some of the other (C++) libraries I have used in the past it's the opposite.Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?
The text was updated successfully, but these errors were encountered: