-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markov_DNA version 1 #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. No problem with the installation. One important thing could be to allow lower case input sequences. I left a suggestion to fix that. I also suggested we could have a generator, but I'm not sure we want to do that, just an idea.
Looks great! 👍
src/Markov_DNA/MCM.py
Outdated
""" | ||
Generates auxiliar transition probabilities with the specified | ||
method by the self.mode parameter. | ||
|
||
Args: | ||
state (string): The current state to In case of using the auxiliar Markov model. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a little bit more the AUXILIAR mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some explanations in the constructor and also in the function, tell me if it is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much, Erik!
- Let's see if I'm interpreting things correctly:
The auxiliary mode is simply tuning downn
down tom
wherem
is the larger order such that there are no missing transitions. The m-order transitions are the "auxiliary transitions". Whenever a "missing-transitions-state" is encountered, the auxiliary transitions kick in. I.e., the four probabilities for the next symbol will be based on an m-order model instead of an n-order model. Is that correct? - It would help to have some lines of comments where you anticipate what is the data structure of the
transition
andaux_transition
dictionaries (what are the keys and values, what are they for). - Why
if len(self.aux_transition[k]) == 4**(k+1):
and notif len(self.aux_transition[k]) == 4**(k):
? If the order is k+1, meaning we are preserving k+2-mer frequencies, what is k? Sorry, I always get confused about the order. Maybe you can clarify with some comments. What's the model order, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there's a print
in the line above if len(self.aux_transition[k]) == 4**(k+1):
that we probably want to comment out in the final version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, but 4**(k+1) was OK, because k referred to the index of the list where the models were stored and as I was storing in index 0 the model of order 1, the order of the model would be written as k+1. This is because I was not computing the model of order 0, I have modified the code to do it and that k corresponds with the model order, so now it would be 4**(k).
a = random.randint() | ||
c = random.randint() | ||
g = random.randint() | ||
t = random.randint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would cause an error, because randint
requires arguments (min and max). Not ideal for simulating the domain of probabilities, which is continuos. I think you wanted to do
a = random.randint() | |
c = random.randint() | |
g = random.randint() | |
t = random.randint() | |
a = random.random() | |
c = random.random() | |
g = random.random() | |
t = random.random() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder what the advantage of choosing random transition probabilities would be. If I'm not mistaken, this is part of the training. If by chance a certain transition is favored, it will be favored in all the generated sequences. So, when using the generated sequences as a control set, there could be extra bias (e.g. the original sequence stands out from the background just because of the preference for certain transitions that the background has and the original sequence doesn't have). It would be a signal that is consistently present in all the generated sequences, so it doesn't go away by increasing sample size. I think to minimize these effects we may prefer to use an uninformed prior. Like setting them all to 0.25, or setting them to the four observed nucleotide frequencies in the original sequence. What do you think, Erik? Am I misunderstanding your code? @ivanerill , comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in both cases, using randint or random, it would be the same, we would have to normalize the probabilities so that the sum of them would give 1.
This part of the code is not part of the training, but of the sequence generation. This function is executed in the case that in the generation of the sequence is reached a state without transitions, so that random transitions would be generated, but only for that specific moment, if by chance the same state is reached again, the transition probabilities would be generated again. Therefore, if at any time a transition is greatly favored, it would not be reflected in the rest of the sequences, nor in the same sequence.
If you think it is better to put them all at 0.25, or to put them at the four nucleotide frequencies observed in the original sequence we can change it without problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use randint
you must specify the bounds, otherwise you get
TypeError: randint() missing 2 required positional arguments: 'a' and 'b'
Let's say you do randint(1,4)
. There are 4 possible outcomes per base. So you are sampling from a set of 4^4=256 possible distributions. There are infinitely many, but you are now restricted to only 256 choices. So randint
would simulate the continuous space of possible distributions as accurately as random
only in the limit of a really big value for the upper bound you provide to randint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to the choice between random probabilities or fixed probabilities.
Are you saying that the choice happens independently on each sequence that is being generated?
If yes:
This means that on average you will introduce A with probability 0.25. It's easy to prove "by symmetry": there's no bias toward any specific letter in your random
approach, so the four expected values of the number of times they get chosen for the four letters must be the same. And so the frequencies will all be 0.25. If that's the case, the only difference between what you are doing and setting the four probabilities directly to 0.25 is the distribution of the variance over the sequences in the set. As a whole, they would approach 0.25 per base, but single sequences can be more biased than expected by chance (because their randomly chosen transitions are far from 0.25 each base). Even in this case, I don't see good reasons for injecting this type of noise (let me know if I'm not considering some benefits from this approach). Especially if the control set of generated sequences is small. In that case, the frequencies may be biased even when looking at the whole set of generated sequences.
Alternative (using fixed probabilities)
As an alternative algorithm for the "random" mode, I suggest that the next symbol would be chosen based on its frequency in the original sequence (not necessarily a probability=0.25, but a fixed probability nonetheless).
No description provided.