Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

kiepczi · 2024-03-25T16:25:12Z

This pull request resolves issue #421. After discovering that pyANI occasionally reports genome coverage values exceeding 1 and non-symmetrical behavior in nucmer, it was agreed to revise the method for calculating both the %ID and genome coverage. The changes implemented in the base code include:

Conducting forward and reverse comparisons.
Calculating the weighted average identity and reporting values for the query rather than the subject.
Adjusting the total number of aligned bases by excluding overlapping regions reported by nucmer.

Type of change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality not to work as expected)
This change requires a documentation update
This is a documentation update

Action Checklist

- note that MUMmer's AlignedBases count does not correct for indels which are not aligned to the other sequence in the pairwise comparison.

This commit adds a jupyter notebook and helper script that should facilitate debugging/development. Also, changes were made that add the reverse nucmer run to the joblist for pyani ANIm comparisons. The reverse nucmer run outputs will likely be incorporated into the database, but possibly not correctly, as shortcuts/simplifications were made, assuming a symmetrical matrix.

indent pyani_orm log output to show it belongs to previous log

@kiepczi

The sym keyword was removed by @kiepczi as we are now doing forward/reverse comparisons in MUMmer.

The original test still expected two values to be returned from parse_delta() - the tuple returned now has four entries. In addition, the content of the tuple has changed. It now returns the count of aligned bases from reference and query sequences, then the calculated percentage identity, and the count of similarity errors.

…now run

… issue_421

The ANIm calculation has been changed to more closely resemble dnadiff.pl output from the MUMmer package. Also, we now run the pairwise comparison in two directions (A v B, B v A), so the resulting matrix is no longer symmatrical. The old deltadir_result.csv file was generated using the former ANIm %ID calculation method, and assumed that the matrix was symmetrical. This has been replaced by results under the new method, and that have been manually checked to be close to dnadiff.pl output. NOTE: the percentage identities found by pyani anim are not identical to dnadiff.pl output. This is at least in part because dnadiff.pl averages percentage identities from an output .mcoords file, where the percentages are rounded to two decimal places. Over hundreds of fragments, the rounding errors mount up and we consider that our approach - maintaining a single count (across exactly the same alignment fragments) and reporting a single identity value - is less prone to these errors

After updating the method for calculating %ID and genome coverage, we now run forward and reverse comparisions. However, the logger did not accurately reflect the number of jobs to run. This was adressed in this commit by using permutations instead of combinations for the list of genomes.

Support files for specific issues that are not part of the codebase should not be included in the pull request.

This should avoid some accidental inclusions of non-core files to the repo for PRs, etc.

widdowquinn · 2024-04-08T16:04:26Z

I have begun review locally.

Tests pass as expected (pytest -v)
Subfolders issue_340 and issue_421 were removed from tracking. These do not contain files relevant to the code base so should be removed from the PR.
I've added an issue_* entry to .gitignore to avoid accidental inclusion of similar folders in future.

widdowquinn · 2024-04-08T16:12:20Z

pyani/anim.py

Not sure why l.162 was reintroduced (logger argument) - have removed this.

widdowquinn · 2024-04-08T16:21:20Z

pyani/scripts/subcommands/subcmd_report.py

@@ -255,10 +255,10 @@ def subcmd_report(args: Namespace) -> int:
                "Query description",
                "Subject ID",
                "Subject description",
-                "% identity",
+                "% identity (weighted)",


We should remove the indication of (weighted) as this is not true for all comparison methods. I've done this in a new commit.

widdowquinn · 2024-04-08T16:21:56Z

pyani/scripts/subcommands/subcmd_report.py

                "% query coverage",
                "% subject coverage",
-                "alignment length",
+                "query alignment length",


As above, the alignment length may not be query alignment length for all methods. We should document what these columns mean for different analysis types.

This config skips assert issues in test files

widdowquinn

Thanks @kiepczi - and apologies for the delay. That's quite a large number of lines added. These seem mostly to be in the new test output/fixture files. Everything else seems OK and the tests pass.

This was fixed in subcmd_plot.py during widdowquinn#425

See also widdowquinn#425 for subcmd_plot.py and widdowquinn#437 for pyani/scripts/subcommands/subcmd_report.py

widdowquinn and others added 30 commits February 23, 2024 11:17

update issue 340

cbe5b6c

update Markdown summary of issue_340

99f84bb

add explainer for delta-filter

949c315

Progress on issue 340

675204a

add notebook example of how indels/overlaps affect MUMmer AlignedBases

039d235

- note that MUMmer's AlignedBases count does not correct for indels which are not aligned to the other sequence in the pairwise comparison.

add documentation

d14af71

add scripts&pytest for AlignedBases calculations

3913798

incongruency investigation

3679f10

check nucmer symmetry

ff7a3d6

check nucmer symmetry

2cef284

add draft scipt for A:B and B:A comparisions

801f872

modify anim.py to get 2 comparisions

2e07857

modify anim calculations to exclude overlaps using IntervalTree

ed3093c

move intervaltree from pip to conda requirements

941340d

update types in parse_delta()

56f0438

Update matrices and run tables for forward and reverse comparisons

693d0c2

update parse_delta function

4310ddd

Update matrices to reflect forward and revese comparisions

23ccff3

update log formatting

6496a11

indent pyani_orm log output to show it belongs to previous log

add additional test commands

6761ddd

fix %ID calculations to weighted %ID

c976a8a

Merge branch 'issue_421' of github.com:widdowquinn/pyani into issue_421

491135d

remove sym keyword from results.add* calls in anib.py

dda0741

The sym keyword was removed by @kiepczi as we are now doing forward/reverse comparisons in MUMmer.

Correct anim calculation

af718bd

remove sym keyword as reverse and forward comparsions for MUMmer are …

fc53772

…now run

Merge branch 'issue_421' of https://github.com/widdowquinn/pyani into…

ccc1175

… issue_421

add temporary debugging print statments/output files

8412d94

kiepczi requested a review from widdowquinn as a code owner March 25, 2024 16:25

widdowquinn self-assigned this Apr 8, 2024

widdowquinn added this to the 0.3.0 milestone Apr 8, 2024

widdowquinn added 2 commits April 8, 2024 16:53

removed issue_340 and issue_421 folders from the PR

724510b

Support files for specific issues that are not part of the codebase should not be included in the pull request.

add all folders beginning issue_* to .gitignore

31f40c5

This should avoid some accidental inclusions of non-core files to the repo for PRs, etc.

widdowquinn reviewed Apr 8, 2024

View reviewed changes

pyani/anim.py Outdated

Copy link

Owner

widdowquinn Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why l.162 was reintroduced (logger argument) - have removed this.

remove unnecessary logger argument

eec3329

widdowquinn reviewed Apr 8, 2024

View reviewed changes

widdowquinn added 3 commits April 8, 2024 17:22

remove "query" from alignment length header

01e432d

add custom BANDIT config file

ca3fff5

This config skips assert issues in test files

rename bandit config file for codacy

492fe59

widdowquinn approved these changes Apr 8, 2024

View reviewed changes

widdowquinn merged commit 1f10a6c into master Apr 8, 2024
1 check passed

widdowquinn deleted the issue_421 branch April 8, 2024 17:09

peterjc added a commit to peterjc/pyani that referenced this pull request Oct 8, 2024

Passing a string to pd.read_json is deprecated

ed183a5

This was fixed in subcmd_plot.py during widdowquinn#425

This was referenced Oct 8, 2024

Passing a string to pd.read_json is deprecated #437

Merged

Discussion - ANIm files using pyani v0.2 vs 0.3 differ pyani-plus/pyani-plus#109

Closed

peterjc added a commit to peterjc/pyani that referenced this pull request Oct 29, 2024

Passing a string to pd.read_json is deprecated

425c9c4

See also widdowquinn#425 for subcmd_plot.py and widdowquinn#437 for pyani/scripts/subcommands/subcmd_report.py

peterjc mentioned this pull request Oct 29, 2024

Passing a string to pd.read_json is deprecated #440

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

kiepczi commented Mar 25, 2024

widdowquinn commented Apr 8, 2024

widdowquinn Apr 8, 2024

widdowquinn Apr 8, 2024

widdowquinn Apr 8, 2024

widdowquinn left a comment

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

Conversation

kiepczi commented Mar 25, 2024

Type of change

Action Checklist

widdowquinn commented Apr 8, 2024

widdowquinn Apr 8, 2024

Choose a reason for hiding this comment

widdowquinn Apr 8, 2024

Choose a reason for hiding this comment

widdowquinn Apr 8, 2024

Choose a reason for hiding this comment

widdowquinn left a comment

Choose a reason for hiding this comment