Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to define a different k-mer? What way to chop up the reads? #13

Open
yingeddi2008 opened this issue Mar 11, 2016 · 0 comments

Comments

@yingeddi2008
Copy link

Hi,

From the RDP classifier paper I read about, it says the word size is 8 (Just to make sure I am understanding it right, the size here should be the length of word, such as ATCTGGTC, right?), which is the optimal, because the other word size of 6,7 or 9 is not accurate enough comparing to size 8 according to preliminary experiments.

- Is there an option for me to pick another word size when I am training my own classifier with a customized database?

Also, I want to know how do you chop up the reads in the database? It says all the words should be non-overlapping, which is to satisfy the assumption for Bayes Rule that all features are independent (correct me if I am understanding it incorrectly). Say I have a sequence in the database:

SeqA: AAAAAAAA TTTTTTTT GGGGGGGG TTTTTTTT

If I chop up from the very first nt, then I should get the 8-size word:

AAAAAAA X1, TTTTTTTT X2, and GGGGGGGG X1, and this will be recorded as the features for this particular genus.

But what if I have a test sequence:

SeqB: ATTTTTTT TGG, clearly you can tell it's a subset from SeqA (I make the subset bold in SeqA), but if I chop up from the very first nt, it won't give me the same feature word as you could get from SeqA. I will get ATTTTTTT, and whatever the leftover: TGG. I am curious, what do you do with the leftover nt? Just throw them away?

- I think I need a little insight about how to chop up the database into kmers, and how you define the features?

I am a beginner in Machine Learning algorithms, and still trying to learn more about RDP classifier. If my understanding is wrong, I am welcome to any suggestion.

Thanks a lot!

Eddi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant