Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markov_DNA version 1 #1
Markov_DNA version 1 #1
Changes from 4 commits
eeee3d5
401bfff
c4c94fb
6aead55
2dea8f7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would cause an error, because
randint
requires arguments (min and max). Not ideal for simulating the domain of probabilities, which is continuos. I think you wanted to doThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder what the advantage of choosing random transition probabilities would be. If I'm not mistaken, this is part of the training. If by chance a certain transition is favored, it will be favored in all the generated sequences. So, when using the generated sequences as a control set, there could be extra bias (e.g. the original sequence stands out from the background just because of the preference for certain transitions that the background has and the original sequence doesn't have). It would be a signal that is consistently present in all the generated sequences, so it doesn't go away by increasing sample size. I think to minimize these effects we may prefer to use an uninformed prior. Like setting them all to 0.25, or setting them to the four observed nucleotide frequencies in the original sequence. What do you think, Erik? Am I misunderstanding your code? @ivanerill , comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in both cases, using randint or random, it would be the same, we would have to normalize the probabilities so that the sum of them would give 1.
This part of the code is not part of the training, but of the sequence generation. This function is executed in the case that in the generation of the sequence is reached a state without transitions, so that random transitions would be generated, but only for that specific moment, if by chance the same state is reached again, the transition probabilities would be generated again. Therefore, if at any time a transition is greatly favored, it would not be reflected in the rest of the sequences, nor in the same sequence.
If you think it is better to put them all at 0.25, or to put them at the four nucleotide frequencies observed in the original sequence we can change it without problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use
randint
you must specify the bounds, otherwise you getLet's say you do
randint(1,4)
. There are 4 possible outcomes per base. So you are sampling from a set of 4^4=256 possible distributions. There are infinitely many, but you are now restricted to only 256 choices. Sorandint
would simulate the continuous space of possible distributions as accurately asrandom
only in the limit of a really big value for the upper bound you provide torandint
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to the choice between random probabilities or fixed probabilities.
Are you saying that the choice happens independently on each sequence that is being generated?
If yes:
This means that on average you will introduce A with probability 0.25. It's easy to prove "by symmetry": there's no bias toward any specific letter in your
random
approach, so the four expected values of the number of times they get chosen for the four letters must be the same. And so the frequencies will all be 0.25. If that's the case, the only difference between what you are doing and setting the four probabilities directly to 0.25 is the distribution of the variance over the sequences in the set. As a whole, they would approach 0.25 per base, but single sequences can be more biased than expected by chance (because their randomly chosen transitions are far from 0.25 each base). Even in this case, I don't see good reasons for injecting this type of noise (let me know if I'm not considering some benefits from this approach). Especially if the control set of generated sequences is small. In that case, the frequencies may be biased even when looking at the whole set of generated sequences.Alternative (using fixed probabilities)
As an alternative algorithm for the "random" mode, I suggest that the next symbol would be chosen based on its frequency in the original sequence (not necessarily a probability=0.25, but a fixed probability nonetheless).