-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix alignment coverage >1.0 and aniM symmetrical behaviour #421
Comments
Should probably mention that I'm currently handling this under branch |
Update on the issue and its progress: We have decided to change the way we calculate the number of aligned bases and genome coverage by correcting for overlapping regions with We can't replicate Following an investigation into the behavior of nucmer, we have discovered that it is not symmetrical in practice. As a result, we have agreed to run two comparisons (reverse and forward). The values in matrices are now reported for query comparisons rather than the average between the two comparisons. Further details regarding this decision are discussed in issue #151. |
I ran (issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests
================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.19, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: cov-4.1.0, ordering-0.6
collected 120 items
tests/test_anib.py::test_get_version_nonetype PASSED [ 0%]
tests/test_anib.py::test_get_version_no_exe PASSED [ 1%]
tests/test_anib.py::test_get_version_exe_not_executable PASSED [ 2%]
tests/test_anib.py::test_get_version_exe_no_version PASSED [ 3%]
tests/test_anib.py::test_blastall_dbjobdict PASSED [ 4%]
tests/test_anib.py::test_blastall_graph PASSED [ 5%]
tests/test_anib.py::test_blastall_multiple PASSED [ 5%]
tests/test_anib.py::test_blastall_single PASSED [ 6%]
tests/test_anib.py::test_blastn_dbjobdict PASSED [ 7%]
tests/test_anib.py::test_blastn_graph PASSED [ 8%]
tests/test_anib.py::test_blastn_multiple PASSED [ 9%]
tests/test_anib.py::test_blastn_single PASSED [ 10%]
tests/test_anib.py::test_formatdb_multiple PASSED [ 10%]
tests/test_anib.py::test_formatdb_single PASSED [ 11%]
tests/test_anib.py::test_fragment_files PASSED [ 12%]
tests/test_anib.py::test_makeblastdb_multiple PASSED [ 13%]
tests/test_anib.py::test_makeblastdb_single PASSED [ 14%]
tests/test_anib.py::test_parse_legacy_blastdir PASSED [ 15%]
tests/test_anib.py::test_parse_blastdir PASSED [ 15%]
tests/test_anib.py::test_parse_blasttab PASSED [ 16%]
tests/test_anib.py::test_parse_legacy_blasttab PASSED [ 17%]
tests/test_aniblastall.py::test_get_version_nonetype PASSED [ 18%]
tests/test_aniblastall.py::test_get_version_missing_exe PASSED [ 19%]
tests/test_aniblastall.py::test_get_version_not_executable PASSED [ 20%]
tests/test_aniblastall.py::test_get_version_no_version PASSED [ 20%]
tests/test_aniblastall.py::test_get_version_os_incompatible PASSED [ 21%]
tests/test_anim.py::test_get_version_nonetype PASSED [ 22%]
tests/test_anim.py::test_get_version_no_exe PASSED [ 23%]
tests/test_anim.py::test_get_version_exe_not_executable PASSED [ 24%]
tests/test_anim.py::test_get_version_exe_no_version PASSED [ 25%]
tests/test_anim.py::test_get_version_nucmer_3 PASSED [ 25%]
tests/test_anim.py::test_get_version_nucmer_4 PASSED [ 26%]
tests/test_anim.py::test_deltadir_parsing PASSED [ 27%]
tests/test_anim.py::test_deltafile_parsing PASSED [ 28%]
tests/test_anim.py::test_maxmatch_single PASSED [ 29%]
tests/test_anim.py::test_mummer_multiple PASSED [ 30%]
tests/test_anim.py::test_mummer_single PASSED [ 30%]
tests/test_anim.py::test_mummer_job_generation PASSED [ 31%]
tests/test_anim.py::test_genome_sorting SKIPPED (This test currently fails because we no longer sort genomes. This is because we now run ...) [ 32%]
tests/test_cli_parsing.py::test_createdb PASSED [ 33%]
tests/test_cli_parsing.py::test_download_single_genome PASSED [ 34%]
tests/test_concordance.py::test_anim_concordance PASSED [ 35%]
tests/test_concordance.py::test_anib_concordance PASSED [ 35%]
tests/test_concordance.py::test_aniblastall_concordance PASSED [ 36%]
tests/test_concordance.py::test_tetra_concordance PASSED [ 37%]
tests/test_dependencies.py::test_import_biopython PASSED [ 38%]
tests/test_dependencies.py::test_import_matplotlib PASSED [ 39%]
tests/test_dependencies.py::test_import_numpy PASSED [ 40%]
tests/test_dependencies.py::test_import_pandas PASSED [ 40%]
tests/test_dependencies.py::test_import_scipy PASSED [ 41%]
tests/test_dependencies.py::test_blastn_available PASSED [ 42%]
tests/test_dependencies.py::test_run_blastall XPASS (Optional third-party executable (blastall)) [ 43%]
tests/test_dependencies.py::test_run_nucmer PASSED [ 44%]
tests/test_fastani.py::test_get_version_nonetype PASSED [ 45%]
tests/test_fastani.py::test_get_version_no_exe PASSED [ 45%]
tests/test_fastani.py::test_get_version_exe_not_executable PASSED [ 46%]
tests/test_fastani.py::test_get_version_exe_no_version PASSED [ 47%]
tests/test_fastani.py::test_fastanifile_parsing PASSED [ 48%]
tests/test_fastani.py::test_fastani_single PASSED [ 49%]
tests/test_fastani.py::test_fastani_multiple PASSED [ 50%]
tests/test_fastani.py::test_fastani_job_generation PASSED [ 50%]
tests/test_graphics.py::test_png_mpl PASSED [ 51%]
tests/test_graphics.py::test_svg_mpl PASSED [ 52%]
tests/test_graphics.py::test_pdf_mpl PASSED [ 53%]
tests/test_graphics.py::test_png_seaborn PASSED [ 54%]
tests/test_graphics.py::test_svg_seaborn PASSED [ 55%]
tests/test_graphics.py::test_pdf_seaborn PASSED [ 55%]
tests/test_jobs.py::test_create_job PASSED [ 56%]
tests/test_jobs.py::test_create_job_with_command PASSED [ 57%]
tests/test_jobs.py::test_add_dependency PASSED [ 58%]
tests/test_jobs.py::test_remove_dependency PASSED [ 59%]
tests/test_jobs.py::test_create_jobgroup PASSED [ 60%]
tests/test_jobs.py::test_1d_jobgroup PASSED [ 60%]
tests/test_jobs.py::test_2d_jobgroup PASSED [ 61%]
tests/test_jobs.py::test_add_group_dependency PASSED [ 62%]
tests/test_jobs.py::test_remove_group_dependency PASSED [ 63%]
tests/test_legacy_scripts.py::test_legacy_anim_sns PASSED [ 64%]
tests/test_legacy_scripts.py::test_legacy_anim_mpl PASSED [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_sns PASSED [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_mpl PASSED [ 66%]
tests/test_legacy_scripts.py::test_legacy_tetra_sns PASSED [ 67%]
tests/test_legacy_scripts.py::test_legacy_tetra_mpl PASSED [ 68%]
tests/test_legacy_scripts.py::test_legacy_genome_downloads PASSED [ 69%]
tests/test_multiprocessing.py::test_multiprocessing_run PASSED [ 70%]
tests/test_multiprocessing.py::test_cmdsets PASSED [ 70%]
tests/test_multiprocessing.py::test_dependency_graph_run PASSED [ 71%]
tests/test_multiprocessing.py::test_failed_comparison PASSED [ 72%]
tests/test_parsing.py::test_anim_delta PASSED [ 73%]
tests/test_subcmd_01_download.py::test_create_hash PASSED [ 74%]
tests/test_subcmd_01_download.py::test_download_dry_run PASSED [ 75%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia PASSED [ 75%]
tests/test_subcmd_01_download.py::test_download_kraken PASSED [ 76%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_index PASSED [ 77%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_missing_index PASSED [ 78%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb PASSED [ 79%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb_fail PASSED [ 80%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb_force PASSED [ 80%]
tests/test_subcmd_04_anim.py::TestANImSubcommand::test_anim PASSED [ 81%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes PASSED [ 82%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes_runs PASSED [ 83%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_matrices PASSED [ 84%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_no_matrix_labels PASSED [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_results PASSED [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs PASSED [ 86%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs_genomes PASSED [ 87%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_jpg PASSED [ 88%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_pdf PASSED [ 89%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_png PASSED [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_svg PASSED [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_jpg PASSED [ 91%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_pdf PASSED [ 92%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_png PASSED [ 93%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_svg PASSED [ 94%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_defaults PASSED [ 95%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_no_thresh PASSED [ 95%]
tests/test_subcmd_08_anib.py::TestANIbsubcommand::test_anib XFAIL (ANIb is not currently fully implemented) [ 96%]
tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani PASSED [ 97%]
tests/test_tetra.py::test_tetraclean PASSED [ 98%]
tests/test_tetra.py::test_zscore PASSED [ 99%]
tests/test_tetra.py::test_correlations PASSED [100%]
==================================================================== warnings summary ====================================================================
tests/test_legacy_scripts.py::test_legacy_anim_sns
/opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/seaborn/matrix.py:715: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
self._figure = plt.figure(figsize=figsize)
tests/test_subcmd_01_download.py::test_download_dry_run
/opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/Bio/Entrez/Parser.py:1075: UserWarning: Failed to save uilist.dtd at /Users/angelikakiepas/.config/biopython/Bio/Entrez/DTDs/uilist.dtd
warnings.warn(f"Failed to save {filename} at {path}")
tests/test_subcmd_01_download.py::test_download_dry_run
/opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/Bio/Entrez/Parser.py:1075: UserWarning: Failed to save esummary_assembly.dtd at /Users/angelikakiepas/.config/biopython/Bio/Entrez/DTDs/esummary_assembly.dtd
warnings.warn(f"Failed to save {filename} at {path}")
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================== 117 passed, 1 skipped, 1 xfailed, 1 xpassed, 3 warnings in 406.99s (0:06:46) ==================================== I also run the new code on examples available here. Matrices are no longer symmetrical, two comparisons are executed (reverse and forward), more sensible numbers are reported (e.g., genome coverage is no longer greater than 1), and no errors are raised. However, no graphical output is generated. Interestingly, no issue is raised either. I was looking into this this morning, and I don't know where to go with this... @widdowquinn, do you think something broke after changing |
This has been fixed at today's meeting by @widdowquinn. It turned out that the plotting function was indexing args.method incorrectly (fixed with the 2cede3a commit). |
Fixed with 1f10a6c. |
@all-contributors please add @kiepczi for code, design, test |
I've put up a pull request to add @kiepczi! 🎉 |
Summary:
There are two big issues that need to be prioritised to improve pyANI comparisions. These are:
Description: Alignment coverage >1.0:
Fixing issue #340 first. Let's consider running pyANI aniM on two genomes and finding that there were 2 alignments: one alignment runs from position 1 to 37636, and the other alignment runs from 17626 to 39176 (in the query). The total length of the query genome is 39594, so we cannot expect the number of aligned bases to be more than this. However, currently pyANI reports that the total number of aligned bases in the query is 59187, which is calculated as follows: (37636-1+1)+(39176-17626+1) = 59187. This discrepancy arises because pyANI does not currently take into consideration that certain regions might overlap and are unfortunately counted twice. In this case, region 17626 to 37636.
Progress:
I attempted to correct for this by using IntervalTree Python module. The code can be found here, and in a short summary it works in the following steps:
delta
file as an input to retrieve alignment information, which is then saved to a dictionary. The dictionary then has the following format:merge_overlaps(strict=False)
function is then called to merge overlaps.I have checked the comprehensiveness of this script by running it on 20 test sets. I also compared the values to those provided by
dnadiff
, specifically theAlignedBases
value. As a quick reminder,dnadiff
can compare genomes, returning a value for aligned bases (AlignedBases
), which, as suggested in previous discussions, can be interpreted as the genome coverage. I have attached the results of this analysis (seecomparisions.csv) In the table, each row represents a single pairwise comparison, with the following columns:
dnadiff_REF
anddnadiff_QRY
:AlignedBases
value returned bydnadiff
for both the reference and query.pyANI_AniM_REF
andpyANI_AniM_QRY
: current values returned by pyANI v0.3.IT_aniM_filter_REF
andIT_aniM_filter_QRY
: values returned by using the approach with IntervalTree and pyANIdelta.filter
files as inputIT_dnadiff_mdelta_REF
andIT_dnadiff_mdelta_QRY
:IntervalTree
approach withdnadiff mdelta
files as inputIT_dnadiff_1delta_REF
andIT_dnadiff_1delta_QRY
:IntervalTree
approach withdnadiff 1delta
files as inputREF_len
andQRY_len
: total length of the reference and queryThere are few interesting observations in this analysis:
AlignedBases
value provided bydnadiff
and our values inIT_dnadiff_mdelta
columns do not match as we would expect. And we would expect them to match, becauseAlignedBases
is calculated based on many-to-many alignments. I investigated this discepcany, and reproduced this issue on much smaller test set (test_1_incongruencies.zip).
Reproducible steps:
AlignedBases
for the attached genomes with dnadiff using:dnadiff mdelta
file as the input.Investigation
Let's look at the
mdelta
file to see what is happening, or what value we would expect:The delta file reveals three alignments. One alignment spans from 4263072 to 4263462, the second from 4258377 to 4258701, and the third from 4263071 to 4263319. From this we can see that there is one region that overlaps, and the aligment looks like this:
As expected, after the
IntervalTree
merges the overlaps, it returns the following regions:This gives a count of aligned bases of 717, calculated as (4258701 - 4258377 + 1) = 325 and (4263462 - 4263071 + 1) = 392, giving us the sum of 717. From this we can conclude that the unagliged regions are: 1-4258376, 4258702-4263070, and 4263463-7787608.
However, the dnadiff AlignedBases value is 716.
To investigate the difference, @widdowquinn looked into the dnadiff source code and how it calculates the
AlignedBases
value. The process involves opening coordinates files containing information about aligned regions, incrementing the AlignedBases value by the length of each aligned sequence (considering all indels and no overlaps), and then adjusting for gaps by subtracting the length of unaligned regions given inrdiff
(column 4) file from the AlignedBases value.Our suspicion led us to examine the rdiff file:
This shows there is a gap discrepancy, where one gap is longer by 1 bp. For example,
dnadiff
considers the gap to run from 4258702 to 4263071, whereas our code considers the gap to run from 4258702 to 4263070. We know that therdiff
file is generated by theshow-diff
program (source code available here). Unfortunately, I have no experience inC++
, so understanding how exactly the gaps are identified is beyond my capability ;). @widdowquinn and I had a conversation about this last week, and I understand thatshow-diff
uses graphs to identify unaligned regions, and this is where the differences might arise.Next steps:
The text was updated successfully, but these errors were encountered: