Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markov_DNA version 1 #1

Merged
merged 5 commits into from
Jul 9, 2024
Merged

Markov_DNA version 1 #1

merged 5 commits into from
Jul 9, 2024

Conversation

ErikBlaz
Copy link
Collaborator

No description provided.

@ErikBlaz ErikBlaz requested a review from eliamascolo June 26, 2024 17:47
Copy link
Member

@eliamascolo eliamascolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. No problem with the installation. One important thing could be to allow lower case input sequences. I left a suggestion to fix that. I also suggested we could have a generator, but I'm not sure we want to do that, just an idea.
Looks great! 👍

Comment on lines 143 to 149
"""
Generates auxiliar transition probabilities with the specified
method by the self.mode parameter.

Args:
state (string): The current state to In case of using the auxiliar Markov model.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a little bit more the AUXILIAR mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some explanations in the constructor and also in the function, tell me if it is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, Erik!

  • Let's see if I'm interpreting things correctly:
    The auxiliary mode is simply tuning down n down to m where m is the larger order such that there are no missing transitions. The m-order transitions are the "auxiliary transitions". Whenever a "missing-transitions-state" is encountered, the auxiliary transitions kick in. I.e., the four probabilities for the next symbol will be based on an m-order model instead of an n-order model. Is that correct?
  • It would help to have some lines of comments where you anticipate what is the data structure of the transition and aux_transition dictionaries (what are the keys and values, what are they for).
  • Why if len(self.aux_transition[k]) == 4**(k+1):
    and not if len(self.aux_transition[k]) == 4**(k):? If the order is k+1, meaning we are preserving k+2-mer frequencies, what is k? Sorry, I always get confused about the order. Maybe you can clarify with some comments. What's the model order, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there's a print in the line above if len(self.aux_transition[k]) == 4**(k+1): that we probably want to comment out in the final version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, but 4**(k+1) was OK, because k referred to the index of the list where the models were stored and as I was storing in index 0 the model of order 1, the order of the model would be written as k+1. This is because I was not computing the model of order 0, I have modified the code to do it and that k corresponds with the model order, so now it would be 4**(k).

Comment on lines +183 to +186
a = random.randint()
c = random.randint()
g = random.randint()
t = random.randint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would cause an error, because randint requires arguments (min and max). Not ideal for simulating the domain of probabilities, which is continuos. I think you wanted to do

Suggested change
a = random.randint()
c = random.randint()
g = random.randint()
t = random.randint()
a = random.random()
c = random.random()
g = random.random()
t = random.random()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder what the advantage of choosing random transition probabilities would be. If I'm not mistaken, this is part of the training. If by chance a certain transition is favored, it will be favored in all the generated sequences. So, when using the generated sequences as a control set, there could be extra bias (e.g. the original sequence stands out from the background just because of the preference for certain transitions that the background has and the original sequence doesn't have). It would be a signal that is consistently present in all the generated sequences, so it doesn't go away by increasing sample size. I think to minimize these effects we may prefer to use an uninformed prior. Like setting them all to 0.25, or setting them to the four observed nucleotide frequencies in the original sequence. What do you think, Erik? Am I misunderstanding your code? @ivanerill , comments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in both cases, using randint or random, it would be the same, we would have to normalize the probabilities so that the sum of them would give 1.

This part of the code is not part of the training, but of the sequence generation. This function is executed in the case that in the generation of the sequence is reached a state without transitions, so that random transitions would be generated, but only for that specific moment, if by chance the same state is reached again, the transition probabilities would be generated again. Therefore, if at any time a transition is greatly favored, it would not be reflected in the rest of the sequences, nor in the same sequence.

If you think it is better to put them all at 0.25, or to put them at the four nucleotide frequencies observed in the original sequence we can change it without problems.

Copy link
Member

@eliamascolo eliamascolo Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use randint you must specify the bounds, otherwise you get

TypeError: randint() missing 2 required positional arguments: 'a' and 'b'

Let's say you do randint(1,4). There are 4 possible outcomes per base. So you are sampling from a set of 4^4=256 possible distributions. There are infinitely many, but you are now restricted to only 256 choices. So randint would simulate the continuous space of possible distributions as accurately as random only in the limit of a really big value for the upper bound you provide to randint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the choice between random probabilities or fixed probabilities.

Are you saying that the choice happens independently on each sequence that is being generated?
If yes:
This means that on average you will introduce A with probability 0.25. It's easy to prove "by symmetry": there's no bias toward any specific letter in your random approach, so the four expected values of the number of times they get chosen for the four letters must be the same. And so the frequencies will all be 0.25. If that's the case, the only difference between what you are doing and setting the four probabilities directly to 0.25 is the distribution of the variance over the sequences in the set. As a whole, they would approach 0.25 per base, but single sequences can be more biased than expected by chance (because their randomly chosen transitions are far from 0.25 each base). Even in this case, I don't see good reasons for injecting this type of noise (let me know if I'm not considering some benefits from this approach). Especially if the control set of generated sequences is small. In that case, the frequencies may be biased even when looking at the whole set of generated sequences.

Alternative (using fixed probabilities)

As an alternative algorithm for the "random" mode, I suggest that the next symbol would be chosen based on its frequency in the original sequence (not necessarily a probability=0.25, but a fixed probability nonetheless).

@ErikBlaz ErikBlaz merged commit 2dea8f7 into main Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants