Markov_DNA version 1 #1

ErikBlaz · 2024-06-26T17:47:42Z

No description provided.

eliamascolo

I left some comments. No problem with the installation. One important thing could be to allow lower case input sequences. I left a suggestion to fix that. I also suggested we could have a generator, but I'm not sure we want to do that, just an idea.
Looks great! 👍

README.md

src/Markov_DNA/MCM.py

eliamascolo · 2024-06-26T20:37:46Z

src/Markov_DNA/MCM.py

+        """
+        Generates auxiliar transition probabilities with the specified
+        method by the self.mode parameter.
+
+        Args:
+            state (string): The current state to In case of using the auxiliar Markov model.
+        """


Could you explain a little bit more the AUXILIAR mode?

I added some explanations in the constructor and also in the function, tell me if it is enough.

Thank you so much, Erik!

Let's see if I'm interpreting things correctly:
The auxiliary mode is simply tuning down n down to m where m is the larger order such that there are no missing transitions. The m-order transitions are the "auxiliary transitions". Whenever a "missing-transitions-state" is encountered, the auxiliary transitions kick in. I.e., the four probabilities for the next symbol will be based on an m-order model instead of an n-order model. Is that correct?

It would help to have some lines of comments where you anticipate what is the data structure of the transition and aux_transition dictionaries (what are the keys and values, what are they for).

Why if len(self.aux_transition[k]) == 4**(k+1):
and not if len(self.aux_transition[k]) == 4**(k):? If the order is k+1, meaning we are preserving k+2-mer frequencies, what is k? Sorry, I always get confused about the order. Maybe you can clarify with some comments. What's the model order, etc.

BTW, there's a print in the line above if len(self.aux_transition[k]) == 4**(k+1): that we probably want to comment out in the final version.

You are correct, but 4**(k+1) was OK, because k referred to the index of the list where the models were stored and as I was storing in index 0 the model of order 1, the order of the model would be written as k+1. This is because I was not computing the model of order 0, I have modified the code to do it and that k corresponds with the model order, so now it would be 4**(k).

eliamascolo · 2024-07-04T16:28:04Z

src/Markov_DNA/MCM.py

+            a = random.randint()
+            c = random.randint()
+            g = random.randint()
+            t = random.randint()


I think this would cause an error, because randint requires arguments (min and max). Not ideal for simulating the domain of probabilities, which is continuos. I think you wanted to do

Suggested change

a = random.randint()

c = random.randint()

g = random.randint()

t = random.randint()

a = random.random()

c = random.random()

g = random.random()

t = random.random()

I also wonder what the advantage of choosing random transition probabilities would be. If I'm not mistaken, this is part of the training. If by chance a certain transition is favored, it will be favored in all the generated sequences. So, when using the generated sequences as a control set, there could be extra bias (e.g. the original sequence stands out from the background just because of the preference for certain transitions that the background has and the original sequence doesn't have). It would be a signal that is consistently present in all the generated sequences, so it doesn't go away by increasing sample size. I think to minimize these effects we may prefer to use an uninformed prior. Like setting them all to 0.25, or setting them to the four observed nucleotide frequencies in the original sequence. What do you think, Erik? Am I misunderstanding your code? @ivanerill , comments?

I think that in both cases, using randint or random, it would be the same, we would have to normalize the probabilities so that the sum of them would give 1.

This part of the code is not part of the training, but of the sequence generation. This function is executed in the case that in the generation of the sequence is reached a state without transitions, so that random transitions would be generated, but only for that specific moment, if by chance the same state is reached again, the transition probabilities would be generated again. Therefore, if at any time a transition is greatly favored, it would not be reflected in the rest of the sequences, nor in the same sequence.

If you think it is better to put them all at 0.25, or to put them at the four nucleotide frequencies observed in the original sequence we can change it without problems.

If you use randint you must specify the bounds, otherwise you get

TypeError: randint() missing 2 required positional arguments: 'a' and 'b'

Let's say you do randint(1,4). There are 4 possible outcomes per base. So you are sampling from a set of 4^4=256 possible distributions. There are infinitely many, but you are now restricted to only 256 choices. So randint would simulate the continuous space of possible distributions as accurately as random only in the limit of a really big value for the upper bound you provide to randint.

Going back to the choice between random probabilities or fixed probabilities.

Are you saying that the choice happens independently on each sequence that is being generated?
If yes:
This means that on average you will introduce A with probability 0.25. It's easy to prove "by symmetry": there's no bias toward any specific letter in your random approach, so the four expected values of the number of times they get chosen for the four letters must be the same. And so the frequencies will all be 0.25. If that's the case, the only difference between what you are doing and setting the four probabilities directly to 0.25 is the distribution of the variance over the sequences in the set. As a whole, they would approach 0.25 per base, but single sequences can be more biased than expected by chance (because their randomly chosen transitions are far from 0.25 each base). Even in this case, I don't see good reasons for injecting this type of noise (let me know if I'm not considering some benefits from this approach). Especially if the control set of generated sequences is small. In that case, the frequencies may be biased even when looking at the whole set of generated sequences.

Alternative (using fixed probabilities)

As an alternative algorithm for the "random" mode, I suggest that the next symbol would be chosen based on its frequency in the original sequence (not necessarily a probability=0.25, but a fixed probability nonetheless).

Markov_DNA version 1

eeee3d5

ErikBlaz requested a review from eliamascolo June 26, 2024 17:47

eliamascolo requested changes Jun 26, 2024

View reviewed changes

ErikBlaz added 3 commits July 3, 2024 20:46

Add generator and support lowercase

401bfff

Add support for conda

c4c94fb

Resolve Elia's suggestions

6aead55

eliamascolo reviewed Jul 4, 2024

View reviewed changes

Fixes in auxiliar mode

2dea8f7

ErikBlaz merged commit 2dea8f7 into main Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markov_DNA version 1 #1

Markov_DNA version 1 #1

ErikBlaz commented Jun 26, 2024

eliamascolo left a comment

eliamascolo Jun 26, 2024

ErikBlaz Jul 4, 2024

eliamascolo Jul 4, 2024

eliamascolo Jul 4, 2024

ErikBlaz Jul 5, 2024

eliamascolo Jul 4, 2024

eliamascolo Jul 4, 2024

ErikBlaz Jul 5, 2024

eliamascolo Jul 5, 2024 •

edited

Loading

eliamascolo Jul 5, 2024

Markov_DNA version 1 #1

Markov_DNA version 1 #1

Conversation

ErikBlaz commented Jun 26, 2024

eliamascolo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliamascolo Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Going back to the choice between random probabilities or fixed probabilities.

Alternative (using fixed probabilities)

eliamascolo Jul 5, 2024 •

edited

Loading