Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding Type #3

Open
Daniel-Liu-c0deb0t opened this issue Apr 28, 2021 · 7 comments
Open

Encoding Type #3

Daniel-Liu-c0deb0t opened this issue Apr 28, 2021 · 7 comments

Comments

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Daniel-Liu-c0deb0t commented Apr 28, 2021

I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00
C -> 01
T -> 10
G -> 11

There's a couple of benefits:

  1. These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
  2. Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

@rob-p
Copy link
Contributor

rob-p commented Apr 29, 2021

Hi @Daniel-Liu-c0deb0t,

I agree there are some nice properties to this encoding. Given that, it sort of bothers me that we went with the in-order encoding in most of our C++ software (e.g. here). Given that we'll have to interact with that it may be useful to have a utility function for (quickly) initializing a k-mer from a bit-packed k-mer using another encoding scheme. Frankly, I don't know if there are others apart from A=0, C=1, G=2, T=3 and the one you proposed, but it's worth discussing.

Regarding SIMD capability, I actually think that that may provide some useful benefits when k-mers no longer fit in a single word. For example, in cuttlefish we use a class similar to what I imagine this one being like, and when we want to use something like k=61 or k=75, we would probably be able to fill some SIMD registers.

@jamshed
Copy link
Member

jamshed commented Apr 29, 2021

Hi @rob-p and @Daniel-Liu-c0deb0t ,

The proposed encoding here is a nice one. Albeit, the A = 0, C = 1, G = 2, and T = 3 encoding does make the lexicographic comparisons simple and straightforward — facilitating fast canonical k-mers computation, which are heavily used in the lab softwares e.g. in cuttlefish.

@rob-p
Copy link
Contributor

rob-p commented Apr 29, 2021

Hi @jamshed-k,

Your point is important when we consider interoperating with other encodings. However, for many purposes, all that matters about canonicalization is self-consistency. From that perspective, any bijective mapping from the nucleotides to their encoded integers will work. For example, we could easily define the canonical k-mer to be the max rather than the min of the encoding and it's reverse complement. Likewise, we could define a reversible hash for k-mers and define the canonical k-mer to be the one with the minimum (or maximum) hash value. Really, all that matters is that we chose the canonical form consistently, I think.

@jamshed
Copy link
Member

jamshed commented Apr 29, 2021

Hi @rob-p,

Sure, obviously we can define the canonical k-mer with self-consistency under any bijective mapping from the nucleotide-bases to integers. My point is that if defining the canonicalization function for a k-mer x as the lexicographical minimum of x and x_bar (or the max) — as prevalent in the lab tools and literature — then the in-order mapping should be more efficient for the function computation.

@natir
Copy link
Contributor

natir commented Jul 31, 2021

Hi all,

I think it could be nice to support any encoding, by add an enum type in Kmer definition we can support any encoding.

If I didn't made any mistake we have only 24 possible encoding.

@d-cameron
Copy link

Frankly, I don't know if there are others

The UCSC 2bit format uses a different encoding:

T - 00, C - 01, A - 10, G - 11

https://genome.ucsc.edu/FAQ/FAQformat.html#format7

I think it could be nice to support any encoding

This would be my preference. The encoding only matters when converting to an unpacked representation or complementing the bases.

@rob-p rob-p mentioned this issue Oct 10, 2021
@natir
Copy link
Contributor

natir commented Oct 11, 2021

I made some work around this. I create a trait encoding::Encoder, use by Kmer to populate storage.

I also created a struct encoding::Naive that can convert a DNA sequence to 2bit with any kind of encoding. I still need work on complement part, before create a PR for this.

The trait solution is nice because we can create any Encoder we want and the user can create one for his specific usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants