-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425
Conversation
- note that MUMmer's AlignedBases count does not correct for indels which are not aligned to the other sequence in the pairwise comparison.
This commit adds a jupyter notebook and helper script that should facilitate debugging/development. Also, changes were made that add the reverse nucmer run to the joblist for pyani ANIm comparisons. The reverse nucmer run outputs will likely be incorporated into the database, but possibly not correctly, as shortcuts/simplifications were made, assuming a symmetrical matrix.
indent pyani_orm log output to show it belongs to previous log
The sym keyword was removed by @kiepczi as we are now doing forward/reverse comparisons in MUMmer.
The original test still expected two values to be returned from parse_delta() - the tuple returned now has four entries. In addition, the content of the tuple has changed. It now returns the count of aligned bases from reference and query sequences, then the calculated percentage identity, and the count of similarity errors.
The ANIm calculation has been changed to more closely resemble dnadiff.pl output from the MUMmer package. Also, we now run the pairwise comparison in two directions (A v B, B v A), so the resulting matrix is no longer symmatrical. The old deltadir_result.csv file was generated using the former ANIm %ID calculation method, and assumed that the matrix was symmetrical. This has been replaced by results under the new method, and that have been manually checked to be close to dnadiff.pl output. NOTE: the percentage identities found by pyani anim are not identical to dnadiff.pl output. This is at least in part because dnadiff.pl averages percentage identities from an output .mcoords file, where the percentages are rounded to two decimal places. Over hundreds of fragments, the rounding errors mount up and we consider that our approach - maintaining a single count (across exactly the same alignment fragments) and reporting a single identity value - is less prone to these errors
After updating the method for calculating %ID and genome coverage, we now run forward and reverse comparisions. However, the logger did not accurately reflect the number of jobs to run. This was adressed in this commit by using permutations instead of combinations for the list of genomes.
Support files for specific issues that are not part of the codebase should not be included in the pull request.
This should avoid some accidental inclusions of non-core files to the repo for PRs, etc.
I have begun review locally.
|
pyani/anim.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why l.162 was reintroduced (logger argument) - have removed this.
@@ -255,10 +255,10 @@ def subcmd_report(args: Namespace) -> int: | |||
"Query description", | |||
"Subject ID", | |||
"Subject description", | |||
"% identity", | |||
"% identity (weighted)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the indication of (weighted)
as this is not true for all comparison methods. I've done this in a new commit.
"% query coverage", | ||
"% subject coverage", | ||
"alignment length", | ||
"query alignment length", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, the alignment length may not be query alignment length for all methods. We should document what these columns mean for different analysis types.
This config skips assert issues in test files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kiepczi - and apologies for the delay. That's quite a large number of lines added. These seem mostly to be in the new test output/fixture files. Everything else seems OK and the tests pass.
This was fixed in subcmd_plot.py during widdowquinn#425
See also widdowquinn#425 for subcmd_plot.py and widdowquinn#437 for pyani/scripts/subcommands/subcmd_report.py
This pull request resolves issue #421. After discovering that
pyANI
occasionally reports genome coverage values exceeding 1 and non-symmetrical behavior innucmer
, it was agreed to revise the method for calculating both the %ID and genome coverage. The changes implemented in the base code include:nucmer
.Type of change
Action Checklist
pyani
repository under your own account (please allow write access for repository maintainers)CONTRIBUTING.md
)pytest -v
non-passing code will not be mergedorigin/master
flake8
andblack
before submissionPull requests
section in thepyani
repository