Encoding Type #3

Daniel-Liu-c0deb0t · 2021-04-28T22:03:01Z

I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00
C -> 01
T -> 10
G -> 11

There's a couple of benefits:

These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

The text was updated successfully, but these errors were encountered:

rob-p · 2021-04-29T00:44:15Z

Hi @Daniel-Liu-c0deb0t,

I agree there are some nice properties to this encoding. Given that, it sort of bothers me that we went with the in-order encoding in most of our C++ software (e.g. here). Given that we'll have to interact with that it may be useful to have a utility function for (quickly) initializing a k-mer from a bit-packed k-mer using another encoding scheme. Frankly, I don't know if there are others apart from A=0, C=1, G=2, T=3 and the one you proposed, but it's worth discussing.

Regarding SIMD capability, I actually think that that may provide some useful benefits when k-mers no longer fit in a single word. For example, in cuttlefish we use a class similar to what I imagine this one being like, and when we want to use something like k=61 or k=75, we would probably be able to fill some SIMD registers.

jamshed · 2021-04-29T15:35:34Z

Hi @rob-p and @Daniel-Liu-c0deb0t ,

The proposed encoding here is a nice one. Albeit, the A = 0, C = 1, G = 2, and T = 3 encoding does make the lexicographic comparisons simple and straightforward — facilitating fast canonical k-mers computation, which are heavily used in the lab softwares e.g. in cuttlefish.

rob-p · 2021-04-29T16:57:14Z

Hi @jamshed-k,

Your point is important when we consider interoperating with other encodings. However, for many purposes, all that matters about canonicalization is self-consistency. From that perspective, any bijective mapping from the nucleotides to their encoded integers will work. For example, we could easily define the canonical k-mer to be the max rather than the min of the encoding and it's reverse complement. Likewise, we could define a reversible hash for k-mers and define the canonical k-mer to be the one with the minimum (or maximum) hash value. Really, all that matters is that we chose the canonical form consistently, I think.

jamshed · 2021-04-29T17:37:51Z

Hi @rob-p,

Sure, obviously we can define the canonical k-mer with self-consistency under any bijective mapping from the nucleotide-bases to integers. My point is that if defining the canonicalization function for a k-mer x as the lexicographical minimum of x and x_bar (or the max) — as prevalent in the lab tools and literature — then the in-order mapping should be more efficient for the function computation.

natir · 2021-07-31T19:51:14Z

Hi all,

I think it could be nice to support any encoding, by add an enum type in Kmer definition we can support any encoding.

If I didn't made any mistake we have only 24 possible encoding.

d-cameron · 2021-10-10T05:21:25Z

Frankly, I don't know if there are others

The UCSC 2bit format uses a different encoding:

T - 00, C - 01, A - 10, G - 11

https://genome.ucsc.edu/FAQ/FAQformat.html#format7

I think it could be nice to support any encoding

This would be my preference. The encoding only matters when converting to an unpacked representation or complementing the bases.

natir · 2021-10-11T07:50:20Z

I made some work around this. I create a trait encoding::Encoder, use by Kmer to populate storage.

I also created a struct encoding::Naive that can convert a DNA sequence to 2bit with any kind of encoding. I still need work on complement part, before create a PR for this.

The trait solution is nice because we can create any Encoder we want and the user can create one for his specific usage.

rob-p mentioned this issue Oct 10, 2021

Roadmap / TODO #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding Type #3

Encoding Type #3

Daniel-Liu-c0deb0t commented Apr 28, 2021 •

edited

Loading

rob-p commented Apr 29, 2021

jamshed commented Apr 29, 2021

rob-p commented Apr 29, 2021

jamshed commented Apr 29, 2021

natir commented Jul 31, 2021

d-cameron commented Oct 10, 2021

natir commented Oct 11, 2021 •

edited

Loading

Encoding Type #3

Encoding Type #3

Comments

Daniel-Liu-c0deb0t commented Apr 28, 2021 • edited Loading

rob-p commented Apr 29, 2021

jamshed commented Apr 29, 2021

rob-p commented Apr 29, 2021

jamshed commented Apr 29, 2021

natir commented Jul 31, 2021

d-cameron commented Oct 10, 2021

natir commented Oct 11, 2021 • edited Loading

Daniel-Liu-c0deb0t commented Apr 28, 2021 •

edited

Loading

natir commented Oct 11, 2021 •

edited

Loading