Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Selective forces on a Gene with "L" status #201

Open
vinitamehlawat opened this issue Jan 24, 2025 · 4 comments
Open

Different Selective forces on a Gene with "L" status #201

vinitamehlawat opened this issue Jan 24, 2025 · 4 comments

Comments

@vinitamehlawat
Copy link

Greetings Dr. Hiller @MichaelHiller

I wanted to clarify some of my queries regrading L and UL genes that I also asked in previous thread(#183).

For UL gene loss you suggested me to check out both the transcriptome and RELAX test.

I performed RELAX test (HyPhy) by providing all the in-group as test branches and all out-group as reference branches. To HyPhy I provided codon alignment after removal of all the genes those do not have sequnces at all (---).
codon.fasta have sequnces with transcripts ids as header with gene name like this : ENSDART0000004453.fgf10a | CODON | REFERENCE (for ref) and ENSDART0000004453.fgf10a | CODON | QUERY (for query). I only grep for query species by taking help from resources from TOGA discussion page (cat codon.fasta | grep "CODON | QUERY" -w -A 1 | grep "^\-\-$" -v | awk '{if ($1 ~ /^>/) printf $1"\t"; else print $0}' | sed 's/-//g' | awk -F "\t" '{if ($2 != "") print $1"\n"$2}' > Sp1Codon.fasta)

On my focal branch of tree , where in all in-group species a gene X is lost (TOGA status is "L" in every species). This gene have two transcript, t1 and t2. t1 is under RELAX selection and t2 is under Evidence for intensification of selection and out of both t1 have Likelihood ratio test p = 0.0000 and t2 had Likelihood ratio test p = 0.0007, both with p<=0.05

Its a very confusing situation for me .

  1. I did check all the focal species loss_sum.tsv files and in every focal species all three status PROJECTION, TRANSCRIPT AND GENE are clearly "L". How become something like possible that gene is loss but its both transcripts are under different selective constraints. In such case what would be the status of gene ?

I am also adding a picture of mutation plot of this gene for both transcript and I found strange thing that in all my 11 focal species all mutations are same (Exact identical mutation shared in exons)

I am not 100% sure if I did any mistake during this whole process , please help me to understand this case.

Looking forward to hear from you

Thank you
Vinita

Image
@MichaelHiller
Copy link
Collaborator

Hi Vinita,

  1. pls run RELAX 10 times independently with the same input. We had cases where we got say 3 times intensification and 7 times relaxation. RELAX is not very stable. We now go for the majority vote.
    You can of course also include the reference species (once) in these alignments.

  2. This gene looks pretty much lost as there are several mutations. But 3 of them are splice site mutations. If you have RNA-seq data for any one of the 11 ingroups, I would check if the gene is expressed and if so, if TOGA couldn't find the splice sites.
    If the splice sites really seem mutated and these mutations are shared among 11 species (which also rules out base errors in the assembly), you then have a situation like for FBP2 in hummingbirds https://www.science.org/doi/10.1126/science.abn7050 where the gene is lost in the common ancestor.

@vinitamehlawat
Copy link
Author

Greetings Dr. Hiller @MichaelHiller

Thank you for your prompt reply.

Following your suggestion
1.
I ran RELAX 10 times independently with the same input including Reference species

For the transcript where initially I got RELAX selection, after 10 times run I got following outcomes: 8 times RELAX was significant & 2 times Intensification was significant

For the transcript where I got the initial selection Intensifying I got following outcomes after 10 runs: 6 times RELAX was significant; 4 times intensification was significant

In all 10 times Likelihood ratio test p = 0.0000 (except at one run this was 0.037) at p<=0.05

Overall this shows that both transcript under RELAX selection.

BUT the interesting part comes:

  1. I do have RNA-seq data for 3 species out of 11 in-groups and I blasted against expression data and found that gene is present ( both transcripts) in all 3 species RNA-seq data with 98-100% pidentity. So this means we can not say gene is lost: With such cases if I have gene which is lost and have multiple splice sites Do I need to check such genes within expression dataset ?

For example I have finalized that in my focal branch there are total 30 genes which are with clear "L" in every single in-group species, Do I need to check for every gene in transcriptome to cross check?

BTW, another thing I wanted to point out here that, when I tested an overlap of RELAX gene with gene loss data this is only single gene with status "L" I got, which is present in both RELAX and loss list. All other overlap genes between gene loss and RELAX datasets are all under "UL" category in all my in-group data. (I did this because I was curious to see if I have list of gene that we are saying loss including UL as well, how many are under RELAX)

Looking forward to know your best suggestion to solve such cases from TOGA and possible way to explain this scientifically. Because previously we were expecting this candidate gene as an Adaptive loss at focal branch but now it seems confusing to us.

Your guidance will help us to go forward

Thanks again
Vinita

@MichaelHiller
Copy link
Collaborator

Regarding 1), Yes. The first gene (8 of 10) is obviously more supported than the second (6 of 10).

Regarding 2), no. After a gene lost its protein-coding capacity, meaning it cannot be translated anymore into a fct protein, it can still be expressed for some time. After all, if the transcript that now is non-coding (or produces only a crippled protein) doesn't harm the cell, there is no pressure to shutoff transcription.
I would in this case ask whether your RNA-seq reads support the frameshifts and stop codons you do find. If so, this is transcription of a former protein-coding and now non-coding gene. If not, you may have polymorphic inactivating mutations.

Also, you can check if the RNA-seq supports different splice sites than the ones indicated by TOGA. If so, the splice sites may be intact and the gene may not be lost, as you then have only a frameshift left in the exon 5 (which may also be a splice site mutation).
--> Check if the exon-intron structure derived from the RNA-seq is translatable. If so, you may still have relaxed selection, but I would then be careful calling this a loss.

Transcriptome based validation is only necessary for genes that have few mutation or mostly splice site mutations.

  1. UL can be early stages of gene loss, or simply exon losses while preserving the rest of the exons. It makes sense that some of them are relaxed. You will also find relaxation on genes that are intact of course.

@vinitamehlawat
Copy link
Author

Thank you very much Dr. @MichaelHiller,

This helps a lot!

I will check the alignments with RNA seq data and look for exon-intron structure. I will write back to you regarding this issue if we unable to solve at sequence level.

Again, appreciated your thoughtful feedbacks

Best Regards
Vinita

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants