Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

Merged
merged 42 commits into from
Apr 8, 2024

Conversation

kiepczi
Copy link
Collaborator

@kiepczi kiepczi commented Mar 25, 2024

This pull request resolves issue #421. After discovering that pyANI occasionally reports genome coverage values exceeding 1 and non-symmetrical behavior in nucmer, it was agreed to revise the method for calculating both the %ID and genome coverage. The changes implemented in the base code include:

  1. Conducting forward and reverse comparisons.
  2. Calculating the weighted average identity and reporting values for the query rather than the subject.
  3. Adjusting the total number of aligned bases by excluding overlapping regions reported by nucmer.

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality not to work as expected)
  • This change requires a documentation update
  • This is a documentation update

Action Checklist

  • Work on a single issue/concept (if there are multiple separate issues to address, please use a separate pull request for each)
  • Fork the pyani repository under your own account (please allow write access for repository maintainers)
  • Set up an appropriate development environment (please see CONTRIBUTING.md)
  • Create a new branch with a short, descriptive name
  • Work on this branch
    • style guidelines have been followed
    • new code is commented, especially in hard-to-understand areas
    • corresponding changes to documentation have been made
    • tests for the change have been added that demonstrate the fix or feature works
  • Test locally with pytest -v non-passing code will not be merged
  • Rebase against origin/master
  • Check changes with flake8 and black before submission
  • Commit branch
  • Submit pull request, describing the content and intent of the pull request
  • Request a code review
  • Continue the discussion at Pull requests section in the pyani repository

widdowquinn and others added 30 commits February 23, 2024 11:17
- note that MUMmer's AlignedBases count does not correct for indels
which are not aligned to the other sequence in the pairwise comparison.
This commit adds a jupyter notebook and helper script that should
facilitate debugging/development.

Also, changes were made that add the reverse nucmer run to the
joblist for pyani ANIm comparisons.

The reverse nucmer run outputs will likely be incorporated into
the database, but possibly not correctly, as shortcuts/simplifications
were made, assuming a symmetrical matrix.
indent pyani_orm log output to show it belongs to previous log
The sym keyword was removed by @kiepczi as we are now doing
forward/reverse comparisons in MUMmer.
The original test still expected two values to be returned from
parse_delta() - the tuple returned now has four entries.

In addition, the content of the tuple has changed. It now returns
the count of aligned bases from reference and query sequences,
then the calculated percentage identity, and the count of
similarity errors.
The ANIm calculation has been changed to more closely resemble
dnadiff.pl output from the MUMmer package. Also, we now run the
pairwise comparison in two directions (A v B, B v A), so the
resulting matrix is no longer symmatrical.

The old deltadir_result.csv file was generated using the former
ANIm %ID calculation method, and assumed that the matrix was
symmetrical. This has been replaced by results under the new
method, and that have been manually checked to be close to
dnadiff.pl output.

NOTE: the percentage identities found by pyani anim are not
identical to dnadiff.pl output. This is at least in part because
dnadiff.pl averages percentage identities from an output .mcoords
file, where the percentages are rounded to two decimal places.
Over hundreds of fragments, the rounding errors mount up and we
consider that our approach - maintaining a single count (across
exactly the same alignment fragments) and reporting a single
identity value - is less prone to these errors
After updating the method for calculating %ID and genome coverage, we now run
forward and reverse comparisions. However, the logger did not accurately reflect
the number of jobs to run. This was adressed in this commit by using permutations
instead of combinations for the list of genomes.
@kiepczi kiepczi requested a review from widdowquinn as a code owner March 25, 2024 16:25
@widdowquinn widdowquinn self-assigned this Apr 8, 2024
@widdowquinn widdowquinn added bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already performance the issue relates to making pyani more efficient method the issue relates to how results are calculated VERSION_3 issues relating to version 0.3.x of pyani HIGH PRIORITY high priority issue labels Apr 8, 2024
@widdowquinn widdowquinn added this to the 0.3.0 milestone Apr 8, 2024
Support files for specific issues that are not part of the codebase
should not be included in the pull request.
This should avoid some accidental inclusions of non-core files
to the repo for PRs, etc.
@widdowquinn
Copy link
Owner

I have begun review locally.

  • Tests pass as expected (pytest -v)
  • Subfolders issue_340 and issue_421 were removed from tracking. These do not contain files relevant to the code base so should be removed from the PR.
  • I've added an issue_* entry to .gitignore to avoid accidental inclusion of similar folders in future.

pyani/anim.py Outdated
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why l.162 was reintroduced (logger argument) - have removed this.

@@ -255,10 +255,10 @@ def subcmd_report(args: Namespace) -> int:
"Query description",
"Subject ID",
"Subject description",
"% identity",
"% identity (weighted)",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the indication of (weighted) as this is not true for all comparison methods. I've done this in a new commit.

"% query coverage",
"% subject coverage",
"alignment length",
"query alignment length",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the alignment length may not be query alignment length for all methods. We should document what these columns mean for different analysis types.

Copy link
Owner

@widdowquinn widdowquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kiepczi - and apologies for the delay. That's quite a large number of lines added. These seem mostly to be in the new test output/fixture files. Everything else seems OK and the tests pass.

@widdowquinn widdowquinn merged commit 1f10a6c into master Apr 8, 2024
1 check passed
@widdowquinn widdowquinn deleted the issue_421 branch April 8, 2024 17:09
peterjc added a commit to peterjc/pyani that referenced this pull request Oct 8, 2024
This was fixed in subcmd_plot.py during widdowquinn#425
peterjc added a commit to peterjc/pyani that referenced this pull request Oct 29, 2024
See also widdowquinn#425 for subcmd_plot.py and widdowquinn#437 for
pyani/scripts/subcommands/subcmd_report.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already HIGH PRIORITY high priority issue method the issue relates to how results are calculated performance the issue relates to making pyani more efficient VERSION_3 issues relating to version 0.3.x of pyani
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants