-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding Type #3
Comments
I agree there are some nice properties to this encoding. Given that, it sort of bothers me that we went with the in-order encoding in most of our C++ software (e.g. here). Given that we'll have to interact with that it may be useful to have a utility function for (quickly) initializing a k-mer from a bit-packed k-mer using another encoding scheme. Frankly, I don't know if there are others apart from A=0, C=1, G=2, T=3 and the one you proposed, but it's worth discussing. Regarding SIMD capability, I actually think that that may provide some useful benefits when k-mers no longer fit in a single word. For example, in cuttlefish we use a class similar to what I imagine this one being like, and when we want to use something like k=61 or k=75, we would probably be able to fill some SIMD registers. |
Hi @rob-p and @Daniel-Liu-c0deb0t , The proposed encoding here is a nice one. Albeit, the A = 0, C = 1, G = 2, and T = 3 encoding does make the lexicographic comparisons simple and straightforward — facilitating fast canonical k-mers computation, which are heavily used in the lab softwares e.g. in cuttlefish. |
Hi @jamshed-k, Your point is important when we consider interoperating with other encodings. However, for many purposes, all that matters about canonicalization is self-consistency. From that perspective, any bijective mapping from the nucleotides to their encoded integers will work. For example, we could easily define the canonical k-mer to be the max rather than the min of the encoding and it's reverse complement. Likewise, we could define a reversible hash for k-mers and define the canonical k-mer to be the one with the minimum (or maximum) hash value. Really, all that matters is that we chose the canonical form consistently, I think. |
Hi @rob-p, Sure, obviously we can define the canonical k-mer with self-consistency under any bijective mapping from the nucleotide-bases to integers. My point is that if defining the canonicalization function for a k-mer |
Hi all, I think it could be nice to support any encoding, by add an enum type in If I didn't made any mistake we have only 24 possible encoding. |
The UCSC 2bit format uses a different encoding:
https://genome.ucsc.edu/FAQ/FAQformat.html#format7
This would be my preference. The encoding only matters when converting to an unpacked representation or complementing the bases. |
I made some work around this. I create a trait encoding::Encoder, use by Kmer to populate storage. I also created a struct encoding::Naive that can convert a DNA sequence to 2bit with any kind of encoding. I still need work on complement part, before create a PR for this. The trait solution is nice because we can create any Encoder we want and the user can create one for his specific usage. |
I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest
There's a couple of benefits:
On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.
The text was updated successfully, but these errors were encountered: