Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to add AAI? #185

Closed
nascimentofx opened this issue May 1, 2020 · 8 comments
Closed

Is it possible to add AAI? #185

nascimentofx opened this issue May 1, 2020 · 8 comments

Comments

@nascimentofx
Copy link

nascimentofx commented May 1, 2020

Dear widdowquinn

is it possible to add AAI to the current program?

Just by changing your genbank_get_genomes_by_taxon i could easily download faa files from assemblies instead of fna, so the AAI script should be straightforward to use.

Thanks

Francisco

@widdowquinn
Copy link
Owner

Hi Francisco,

Thank you for your interest in pyani.

It's possible, but it still needs to be coded by someone. There are additional steps to AAI over and above genome alignment, such as identifying (putative) orthologues for comparison, and handling paralogy. That is not entirely trivial (see #16). It's on the list of things to do, but it's not my number one priority. I do take pull requests, if you'd like to get involved.

As this is a duplicate of #16, I'm closing this issue.

Cheers,

L.

@nascimentofx
Copy link
Author

Thank you very much for your fast response and sorry for the repetition.
Couldn't we just perform BLAST (1vs1) of the total proteomes, using an e-value that is strong enough to remove noise (10-50) and then use an average of that value? Then there could be another value such as the number of hits vs non-hits.
I known this could be misleading, but there is always an error associated with this kind of analysis, similar to ANI which currently uses a low threshold for calculating identity. But that could be a start.

@widdowquinn
Copy link
Owner

I don't agree that the proposal is a good way to use BLAST. E-values are dependent not only on the alignment resulting from a match in the database, but also the size of the database (in terms of number of letters) used for the search. It is not appropriate to set a single E-value (or, equivalently, bit score) for all searches - proteomes vary in size between organisms. I wouldn't let that through as a method. A rigorous way to proceed would to be to identify putative orthologues (e.g. with RBBH) and go from there. If you want to remove "noise" in the form of, say, matches of domains rather than full proteins, then you'd need to do something else, like filter RBBHs with %identity and %coverage, rather than E-value. See, for instance, this blog post.

As you're probably aware, there are multiple approaches for calculating ANI - I wouldn't consider any of them to have a low threshold for calculating identity. Homologous regions are identified, and then the average sequence identity of the aligned homologous regions is calculated. The opportunity to parameterise is in the identification of homologous regions; the nature of DNA sequences is that only regions with >70-80% identity are likely to be identified as "homologous." - as you can quickly verify from querying dissimilar genomes against each other with BLAST (or other tool) and looking for the lowest %identity in the tabular output.

Cheers,

L.

@nascimentofx
Copy link
Author

nascimentofx commented May 1, 2020 via email

@widdowquinn
Copy link
Owner

widdowquinn commented May 1, 2020

It's as complicated as it needs to be. The premise is simple:

  1. identify things that are equivalent
  2. calculate their sequence identity

But "equivalent" has multiple interpretations. You call them "equal proteins" - but what is your precise, reproducible, and - to be blunt - programmable definition of "equal"? This needs also to be biologically meaningful for the analysis to be worthwhile.

Apologies if this sounds patronising, but you might find this introduction to some of the complexity of identifying whether proteins are "equivalent" to be useful.

@nascimentofx
Copy link
Author

nascimentofx commented May 2, 2020 via email

@widdowquinn
Copy link
Owner

Hi Francisco,

It is true that some genera/species have surprisingly few members that share more than 95% genome identity. I would suggest that you may find value in considering the practical details of how taxonomy has been assigned, historically, when interpreting this.

It is not true that actinobacteria have been evolving for longer than proteobacteria. All bacteria have been evolving for the same period of time, since their most recent common ancestor.

You may enjoy reading about LINBase, and the correspondence of ANI to taxonomic classes in bacteria, especially in the context of discontinuities in similarity scores/identities.

L.

@nascimentofx
Copy link
Author

nascimentofx commented May 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants