Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 421: Fix alignment coverage >1.0 and aniM symmetrical behaviour #425

Merged
merged 42 commits into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
cbe5b6c
update issue 340
widdowquinn Dec 20, 2023
99f84bb
update Markdown summary of issue_340
widdowquinn Dec 20, 2023
949c315
add explainer for delta-filter
widdowquinn Dec 20, 2023
675204a
Progress on issue 340
Jan 13, 2024
039d235
add notebook example of how indels/overlaps affect MUMmer AlignedBases
widdowquinn Jan 15, 2024
d14af71
add documentation
Jan 27, 2024
3913798
add scripts&pytest for AlignedBases calculations
Jan 27, 2024
3679f10
incongruency investigation
Feb 5, 2024
ff7a3d6
check nucmer symmetry
Feb 8, 2024
2cef284
check nucmer symmetry
Feb 8, 2024
801f872
add draft scipt for A:B and B:A comparisions
kiepczi Feb 19, 2024
2e07857
modify anim.py to get 2 comparisions
kiepczi Feb 19, 2024
3554087
add reverse comparison to nucmer in pyani 0.3+
widdowquinn Feb 19, 2024
ed3093c
modify anim calculations to exclude overlaps using IntervalTree
kiepczi Feb 22, 2024
941340d
move intervaltree from pip to conda requirements
widdowquinn Feb 23, 2024
56f0438
update types in parse_delta()
widdowquinn Feb 23, 2024
693d0c2
Update matrices and run tables for forward and reverse comparisons
kiepczi Feb 29, 2024
4310ddd
update parse_delta function
kiepczi Mar 1, 2024
23ccff3
Update matrices to reflect forward and revese comparisions
kiepczi Mar 4, 2024
6496a11
update log formatting
widdowquinn Mar 5, 2024
6761ddd
add additional test commands
widdowquinn Mar 5, 2024
c976a8a
fix %ID calculations to weighted %ID
kiepczi Mar 8, 2024
491135d
Merge branch 'issue_421' of github.com:widdowquinn/pyani into issue_421
widdowquinn Mar 11, 2024
dda0741
remove sym keyword from results.add* calls in anib.py
widdowquinn Mar 11, 2024
3dddcc1
modify expected values in test_anim_delta()
widdowquinn Mar 11, 2024
af718bd
Correct anim calculation
kiepczi Mar 11, 2024
fc53772
remove sym keyword as reverse and forward comparsions for MUMmer are …
kiepczi Mar 11, 2024
ccc1175
Merge branch 'issue_421' of https://github.com/widdowquinn/pyani into…
kiepczi Mar 11, 2024
8412d94
add temporary debugging print statments/output files
widdowquinn Mar 11, 2024
97e5570
update test_anim.py reference output for new ANIm calculation
widdowquinn Mar 12, 2024
da8852c
Update test_anim.py tests to refrect recent changes in codebase
kiepczi Mar 12, 2024
7b1827e
Rounding Error Investigation
kiepczi Mar 15, 2024
28c99ba
Update --run_results columns
kiepczi Mar 21, 2024
a818323
add data with no graphical output
kiepczi Mar 21, 2024
2cede3a
fix two blocking errors for plotting
widdowquinn Mar 25, 2024
b6dfa77
Update logger to reflect correct number of jobs
kiepczi Mar 25, 2024
724510b
removed issue_340 and issue_421 folders from the PR
widdowquinn Apr 8, 2024
31f40c5
add all folders beginning issue_* to .gitignore
widdowquinn Apr 8, 2024
eec3329
remove unnecessary logger argument
widdowquinn Apr 8, 2024
01e432d
remove "query" from alignment length header
widdowquinn Apr 8, 2024
ca3fff5
add custom BANDIT config file
widdowquinn Apr 8, 2024
492fe59
rename bandit config file for codacy
widdowquinn Apr 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ C_blochmannia*/
classes_pyani.pdf
packages_pyani.pdf

# Issue development output
issue_*/

# SGE output
stderr/
stdout/
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ repos:
- repo: https://github.com/pycqa/flake8
rev: 3.9.2
hooks:
- id: flake8
- id: flake8
6 changes: 3 additions & 3 deletions pyani/anib.py
Original file line number Diff line number Diff line change
Expand Up @@ -558,9 +558,9 @@ def process_blast(
# Populate dataframes: when assigning data, we need to note that
# we have asymmetrical data from BLAST output, so only the
# upper triangle is populated
results.add_tot_length(qname, sname, resultvals[0], sym=False)
results.add_sim_errors(qname, sname, resultvals[1], sym=False)
results.add_pid(qname, sname, 0.01 * resultvals[2], sym=False)
results.add_tot_length(qname, sname, resultvals[0])
results.add_sim_errors(qname, sname, resultvals[1])
results.add_pid(qname, sname, 0.01 * resultvals[2])
results.add_coverage(qname, sname, query_cover)
return results

Expand Down
191 changes: 127 additions & 64 deletions pyani/anim.py
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why l.162 was reintroduced (logger argument) - have removed this.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# (c) The James Hutton Institute 2016-2019
# (c) University of Strathclyde 2019-2021
# (c) University of Strathclyde 2019-2024
# Author: Leighton Pritchard
#
# Contact:
Expand All @@ -17,7 +17,7 @@
# The MIT License
#
# Copyright (c) 2016-2019 The James Hutton Institute
# Copyright (c) 2019-2021 University of Strathclyde
# Copyright (c) 2019-2024 University of Strathclyde
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -59,6 +59,8 @@
import shutil
import subprocess
import sys
import intervaltree
from collections import defaultdict

from logging import Logger
from pathlib import Path
Expand Down Expand Up @@ -157,6 +159,7 @@ def generate_nucmer_jobs(
filter_exe: Path = pyani_config.FILTER_DEFAULT,
maxmatch: bool = False,
jobprefix: str = "ANINUCmer",
logger: Optional[Logger] = None,
):
"""Return list of Jobs describing NUCmer command-lines for ANIm.

Expand All @@ -170,11 +173,14 @@ def generate_nucmer_jobs(
Loop over all FASTA files, generating Jobs describing NUCmer command lines
for each pairwise comparison.
"""
logger = logging.getLogger(__name__)

ncmds, fcmds = generate_nucmer_commands(
filenames, outdir, nucmer_exe, filter_exe, maxmatch
)
joblist = []
for idx, ncmd in enumerate(ncmds):
logger.info("Working with nucmer command: %s" % ncmd)
njob = pyani_jobs.Job(f"{jobprefix}_{idx:06d}-n", ncmd)
fjob = pyani_jobs.Job(f"{jobprefix}_{idx:06d}-f", fcmds[idx])
fjob.add_dependency(njob)
Expand Down Expand Up @@ -212,11 +218,21 @@ def generate_nucmer_commands(
filenames = sorted(filenames) # enforce ordering of filenames
for idx, fname1 in enumerate(filenames[:-1]):
for fname2 in filenames[idx + 1 :]:

# Comparisions A_vs_B
ncmd, dcmd = construct_nucmer_cmdline(
fname1, fname2, outdir, nucmer_exe, filter_exe, maxmatch
)
nucmer_cmdlines.append(ncmd)
delta_filter_cmdlines.append(dcmd)

# Comparions B_vs_A
ncmd_rvs, dcmd_rvs = construct_nucmer_cmdline(
fname2, fname1, outdir, nucmer_exe, filter_exe, maxmatch
)
nucmer_cmdlines.append(ncmd_rvs)
delta_filter_cmdlines.append(dcmd_rvs)

return (nucmer_cmdlines, delta_filter_cmdlines)


Expand Down Expand Up @@ -248,7 +264,7 @@ def construct_nucmer_cmdline(
outdir, called "nucmer_output".
"""
# Cast path strings to pathlib.Path for safety
fname1, fname2 = sorted([Path(fname1), Path(fname2)])
fname1, fname2 = [Path(fname1), Path(fname2)]

# Compile commands
# Nested output folders to avoid N^2 scaling in files-per-folder
Expand All @@ -274,21 +290,16 @@ def construct_nucmer_cmdline(
return (nucmercmd, filtercmd)


# Parse NUCmer delta file to get total alignment length and total sim_errors
def parse_delta(filename: Path) -> Tuple[int, int]:
"""Return (alignment length, similarity errors) tuple from passed .delta.
def parse_delta(filename: Path) -> Tuple[int, int, float, int]:

:param filename: Path, path to the input .delta file
"""Return (reference alignment length, query alignment length, average identity, similarity erors)

Extracts the aligned length and number of similarity errors for each
aligned uniquely-matched region, and returns the cumulative total for
each as a tuple.
:param filename: Path to the input .delta file

Similarity errors are defined in the .delta file spec (see below) as
non-positive match scores. For NUCmer output, this is identical to the
number of errors (non-identities and indels).
Calculates the aligned lengths for reference and query and average nucleotide
identity, and returns the cumulative total for each as a tuple.

Delta file format has seven numbers in the lines of interest:
The delta file format contains seven numbers in the lines of interest:
see http://mummer.sourceforge.net/manual/ for specification

- start on query
Expand All @@ -300,49 +311,106 @@ def parse_delta(filename: Path) -> Tuple[int, int]:
[NOTE: with PROmer this is equal to error count]
- stop codons (always zero for nucmer)

To calculate alignment length, we take the length of the aligned region of
the reference (no gaps), and process the delta information. This takes the
form of one value per line, following the header sequence. Positive values
indicate an insertion in the reference; negative values a deletion in the
reference (i.e. an insertion in the query). The total length of the alignment
is then:
We report ANIm identity by finding an average across all alignments using
the following formula:

reference_length + insertions - deletions
sum of weighted identical bases / sum of aligned bases from each fragment

For example:

A = ABCDACBDCAC$
B = BCCDACDCAC$
Delta = (1, -3, 4, 0)
A = ABC.DACBDCAC$
B = .BCCDAC.DCAC$

A is the reference and has length 11. There are two insertions (positive delta),
and one deletion (negative delta). Alignment length is then 11 + 1 = 12.
reference.fasta query.fasta
NUCMER
>ref_seq_A ref_seq_B 40 40
1 10 1 11 5 5 0
-1
0
15 20 25 30 0 0 0

The delta file tells us there are two alignments. The first alignment runs from base 1
to base 10 in the reference sequence, and from base 1 to 11 in the query sequence
with a similarity error of 5. The second alignment runs from base 15 to 20 in
the reference, and base 25 to 30 in the query with 0 similarity errors. To calculate
the %ID, we can:

- Find the number of all aligned bases from each sequence:
aligned reference bases region 1 = 10 - 1 + 1 = 10
aligned query bases region 1 = 11 - 1 + 1 = 11
aligned reference bases region 2 = 20 - 15 + 1 = 6
aligned query bases region 2 = 30 - 25 + 1 = 6

- Find weighted identical bases
alignment 1 identity weighted = (10 + 11) - (2 * 5) = 11
alignment 2 identity weighted = (6 + 6) - (2 * 0) = 12

- Calculate %ID
(11 + 12) / (10 + 11 + 6 + 6) = 0.696969696969697

To calculate alignment lengths, we extract the regions of each alignment
(either for query or reference) provided in the .delta file and merge the overlapping
regions with IntervalTree. Then, we calculate the total sum of all aligned regions.
"""
in_aln, aln_length, sim_errors = False, 0, 0

current_ref, current_qry, raln_length, qaln_length, sim_error, avrg_ID = (
None,
None,
0,
0,
0,
0.0,
)

regions_ref = defaultdict(list) # Hold a dictionary for query regions
regions_qry = defaultdict(list) # Hold a dictionary for query regions

aligned_bases = [] # Hold a list for aligned bases for each sequence
weighted_identical_bases = [] # Hold a list for weighted identical bases

for line in [_.strip().split() for _ in filename.open("r").readlines()]:
if line[0] == "NUCMER" or line[0].startswith(">"): # Skip headers
if line[0] == "NUCMER": # Skip headers
continue
# Lines starting with ">" indicate which sequences are aligned
if line[0].startswith(">"):
current_ref = line[0].strip(">")
current_qry = line[1]
# Lines with seven columns are alignment region headers:
if len(line) == 7:
aln_length += abs(int(line[1]) - int(line[0])) + 1 # reference length
sim_errors += int(line[4]) # count of non-identities and indels
in_aln = True
# Lines with a single column (following a header) report numbers of symbols
# until next insertion (+ve) or deletion (-ve) in the reference; one line per
# insertion/deletion; the alignment always ends with 0
if in_aln and line[0].startswith("0"):
in_aln = False
elif in_aln:
# Add one to the alignment length for each reference insertion; subtract
# one for each deletion
val = int(line[0])
if val < 1: # deletion in reference
aln_length += 1
elif val == 0: # ends the alignment entry
in_aln = False
return aln_length, sim_errors
# Obtaining aligned regions needed to check for overlaps
regions_ref[current_ref].append(
tuple(sorted(list([int(line[0]), int(line[1])])))
) # aligned regions reference
regions_qry[current_qry].append(
tuple(sorted(list([int(line[2]), int(line[3])])))
) # aligned regions qry

# Calculate aligned bases for each sequence
ref_aln_lengths = abs(int(line[1]) - int(line[0])) + 1
qry_aln_lengths = abs(int(line[3]) - int(line[2])) + 1
aligned_bases.append(ref_aln_lengths)
aligned_bases.append(qry_aln_lengths)

# Calculate weighted identical bases
sim_error += int(line[4])
weighted_identical_bases.append(
(ref_aln_lengths + qry_aln_lengths) - (2 * int(line[4]))
)

# Calculate average %ID
avrg_ID = sum(weighted_identical_bases) / sum(aligned_bases)

# Calculate total aligned bases (no overlaps)
for seq_id in regions_qry:
qry_tree = intervaltree.IntervalTree.from_tuples(regions_qry[seq_id])
qry_tree.merge_overlaps(strict=False)
for interval in qry_tree:
qaln_length += interval.end - interval.begin + 1

for seq_id in regions_ref:
ref_tree = intervaltree.IntervalTree.from_tuples(regions_ref[seq_id])
ref_tree.merge_overlaps(strict=False)
for interval in ref_tree:
raln_length += interval.end - interval.begin + 1

return (raln_length, qaln_length, avrg_ID, sim_error)


# Parse all the .delta files in the passed directory
Expand Down Expand Up @@ -406,30 +474,25 @@ def process_deltadir(
deltafile,
)
continue
tot_length, tot_sim_error = parse_delta(deltafile)
if tot_length == 0 and logger is not None:
(
query_tot_length,
subject_tot_length,
weighted_identity,
tot_sim_error,
) = parse_delta(deltafile)
if subject_tot_length == 0 and logger is not None:
if logger:
logger.warning(
"Total alignment length reported in %s is zero!", deltafile
)
sys.exit("Zero length alignment!")
query_cover = float(tot_length) / org_lengths[qname]
sbjct_cover = float(tot_length) / org_lengths[sname]

# Calculate percentage ID of aligned length. This may fail if
# total length is zero.
# The ZeroDivisionError that would arise should be handled
# Common causes are that a NUCmer run failed, or that a very
# distant sequence was included in the analysis.
try:
perc_id = 1 - float(tot_sim_error) / tot_length
except ZeroDivisionError:
perc_id = 0 # set arbitrary value of zero identity
results.zero_error = True
query_cover = float(query_tot_length) / org_lengths[qname]
sbjct_cover = float(subject_tot_length) / org_lengths[sname]
perc_id = weighted_identity

# Populate dataframes: when assigning data from symmetrical MUMmer
# output, both upper and lower triangles will be populated
results.add_tot_length(qname, sname, tot_length)
results.add_tot_length(qname, sname, query_tot_length, subject_tot_length)
results.add_sim_errors(qname, sname, tot_sim_error)
results.add_pid(qname, sname, perc_id)
results.add_coverage(qname, sname, query_cover, sbjct_cover)
Expand Down
8 changes: 4 additions & 4 deletions pyani/nucmer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# (c) The James Hutton Institute 2019
# (c) University of Strathclyde 2019
# (c) The James Hutton Institute 2016-2019
# (c) University of Strathclyde 2019-2024
# Author: Leighton Pritchard
#
# Contact:
Expand All @@ -17,7 +17,7 @@
# The MIT License
#
# Copyright (c) 2016-2019 The James Hutton Institute
# Copyright (c) 2019 University of Strathclyde
# Copyright (c) 2019-2024 University of Strathclyde
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -237,7 +237,7 @@ class DeltaMetadata:

def __init__(self) -> None:
"""Initialise DeltaMetadata object."""
self.reference = None #  type: Optional[Path]
self.reference = None # type: Optional[Path]
self.query = None # type: Optional[Path]
self.program = None # type: Optional[str]

Expand Down
Loading