Updates to FASTQ file validation (#57)

* Updated FASTQ validation function to fix issue with some PacBio file headers * Updates to changelog and Readme * updated changelog
USDA-ARS-GBRU · Nov 14, 2024 · e6bf294 · e6bf294
1 parent 2dcdaab
commit e6bf294
Show file tree

Hide file tree

Showing 4 changed files with 113 additions and 93 deletions.
diff --git a/README.rst b/README.rst
@@ -28,12 +28,15 @@ Author
 Citations
 ---------
 
-Einarsson, SV and Rivers, AR. ITSxpress Version 2: Software to rapidly trim internal 
-transcribed spacer sequences with quality scores for amplicon sequencing. Microbiology Spectrum. In press, 2024.
+Einarsson SV, Rivers AR. 2024. ITSxpress version 2: software to rapidly trim internal transcribed 
+spacer sequences with quality scores for amplicon sequencing. Microbiol Spectr 0:e00601-24.
+(doi: `10.1128/spectrum.00601-24`_)
+
+.. _`10.1128/spectrum.00601-24`: https://doi.org/10.1128/spectrum.00601-24
 
 Rivers AR, Weber KC, Gardner TG, Liu, S, Armstrong, SD. ITSxpress: Software to rapidly trim
 internally transcribed spacer sequences with quality scores for marker gene
-analysis [version 1; referees: awaiting peer review]. F1000Research 2018, 7:1418
+analysis [version 1]. F1000Research 2018, 7:1418
 (doi: `10.12688/f1000research.15704.1`_)
 
 .. _`10.12688/f1000research.15704.1`: https://doi.org/10.12688/f1000research.15704.1

diff --git a/changelog.md b/changelog.md
@@ -1,6 +1,12 @@
+# 2.1.3 (2024-11-14)
+
+- fixed bug in [Issue 56](https://github.com/USDA-ARS-GBRU/itsxpress/issues/56) that caused some FASTQ files generated by PacBio to raise an error during validation.
+  these files now pass validation and the interleave check has been modified from raise a warning not an error. 
+- Updated the citations.bib file to include the new Microbiology Spectrum article for ITSxpress
+
 # 2.1.2 (2024-9-23)
 
-- Fixed bug [Issue 54](https://github.com/USDA-ARS-GBRU/itsxpress/issues/54) that casued `--tempdir` input to be ignored in the ITSxpress CLI and updated q2_itsxpress.py and test code for this change.
+- Fixed bug [Issue 54](https://github.com/USDA-ARS-GBRU/itsxpress/issues/54) that caused `--tempdir` input to be ignored in the ITSxpress CLI and updated q2_itsxpress.py and test code for this change.
 - updates to Dockerfile
 - updates to changelog formatting and readme
 
@@ -22,15 +28,15 @@
     -Version 2 (see https://github.com/USDA-ARS-GBRU/ITS_HMMs)
     -5 April 2024
 
-    New Updates to models A, D, E, F, O, P, R to impove support for oomycetes and fungal taxa with deviant ribosomal genes, notably Wickerhamiella, Calocera, and Mucoromycotina fine root endophytes
+    New Updates to models A, D, E, F, O, P, R to improve support for oomycetes and fungal taxa with deviant ribosomal genes, notably Wickerhamiella, Calocera, and Mucoromycotina fine root endophytes
 
     Addition of Y.hmm to support Parabasalia
 - Added option of Y.hmm to ITSxpress standalone and Qiime2 plugin
 - Added documentation for Apple Silicon chip support.
 
 # 2.0.2 (2024-3-20)
 
-- Fixed a bug where the 3' end of the ITS region was not being trimmed from both forward and reverse reads if the read extended past the ITS region. This was due to the trimming being done at the start of both forward and reverse reads and not the end of each read. Thus if the read overlaped the opposite end of the ITS read, part of the conserved region would still be found on the ends of the forward and reverse read. This was fixed by trimming to just the ITS region for both forward and reverse reads. This bug did not affect the results of ASV calling with Dada2 becasue Dada2 ignored sequecne beyond the ITS region.  This fix will make the output more consistent with expectation.
+- Fixed a bug where the 3' end of the ITS region was not being trimmed from both forward and reverse reads if the read extended past the ITS region. This was due to the trimming being done at the start of both forward and reverse reads and not the end of each read. Thus if the read overlapped the opposite end of the ITS read, part of the conserved region would still be found on the ends of the forward and reverse read. This was fixed by trimming to just the ITS region for both forward and reverse reads. This bug did not affect the results of ASV calling with Dada2 because Dada2 ignored sequences beyond the ITS region.  This fix will make the output more consistent with expectation.
 
 - Fixed a bug for submodule logging, where submodules were not logging to the main log file. This was fixed by passing the log file to the submodules and having them write to the same log file. This issue was introduced in version 2.0.0.
 
@@ -50,7 +56,7 @@ Release Highlights
         - reformat.sh (interleaved files no longer supported)
         - bbmerge.sh (merging of paired-end reads now done with Vsearch --fastq_mergepairs)
              - merging of paired-end reads is different between Vsearch and BBmerge, so results may differ
-- Updated dereplication step for newer versions of Vsearch
+- Updated de-replication step for newer versions of Vsearch
     - Dereplication step now done using Vsearch --fastx_uniques (derep_fulllength command no longer supports fastq files)
 - Qiime2 plugin version of ITSxpress is now part of the standalone package of ITSxpress
     - No longer need to install q2-itsxpress separately and will be installed if Qiime2 is already installed, otherwise only the standalone version will be installed
@@ -105,7 +111,7 @@ New Features:
 # 1.6.1 (2018-7-19)
 
 - Changed the default clustering identity to 99.5%.
-- Experiments with fungal soil samples showed that ITSxpress and ITSx trimmed 99.822% of reads in the ITS1 region within 2 bases of each other and 99.099% of reads in the ITS2 region within 2 bases of each other at 99.5% identity. For higher accuracy, dereplication can be run at at 100% identity.
+- Experiments with fungal soil samples showed that ITSxpress and ITSx trimmed 99.822% of reads in the ITS1 region within 2 bases of each other and 99.099% of reads in the ITS2 region within 2 bases of each other at 99.5% identity. For higher accuracy, de-replication can be run at at 100% identity.
 
 
 # 1.6.0 (2018-7-13)

diff --git a/itsxpress/citations.bib b/itsxpress/citations.bib
@@ -1,10 +1,18 @@
-@article{ Einarsson2024,
-    AUTHOR  = {Einarsson, SV and Rivers, AR},
-    TITLE = {ITSxpress Version 2: Software to rapidly trim internal transcribed spacer sequences with quality scores for amplicon sequencing},
-    YEAR    = "in press 2024",
-    JOURNAL = "Microbiology Spectrum",
-    publisher={American Society for Microbiology 1752 N St., NW, Washington, DC}
-    }
+
+@article{Einarsson2024,
+	abstract = {The nuclear ribosomal RNA (rRNA) internal transcribed spacer (ITS) regions are commonly used to identify fungi and other eukaryotic taxa in amplicon sequencing. The highly conserved rRNA regions flanking the ITS are often trimmed before being used for taxonomic assignment. The Python software package ITSxpress rapidly trims single-end or paired-end sequences in FASTQ format for use in amplicon sequence variant clustering methods like DADA2. This new major release of ITSxpress improves the paired-end merging method, simplifies installation of the QIIME 2 ITSxpress plugin, removes major dependencies, adds use cases, and is compatible with newer compression formats. This article discusses the modifications to ITSxpress that improve the output and user experience, leading to a major version increase.IMPORTANCE ITSxpress is a sequence trimming method applied to internal transcribed spacer (ITS) amplicon sequences before calling amplicon sequence variants (ASVs). The ITS region is used to understand the composition of eukaryotic microbial communities. ITS sequences provide good taxonomic resolution due to their hypervariability, but are flanked by conserved regions that allow their primers to be more universal. Amplicons generated with such primers contain regions with different evolutionary rates, and trimming these conserved regions results in better taxonomic classification and a more valid set of ASVs. This package can be used for most amplicon sequencing methods including for newer long-read sequencing formats, such as PacBio. ITSxpress can be installed from Bioconda, used as a Docker image, or installed from source code. The package works well with high-performance computing clusters or laptops due to its low-resource requirements. },
+	author = {Sveinn V. Einarsson and Adam R. Rivers},
+	doi = {10.1128/spectrum.00601-24},
+	eprint = {https://journals.asm.org/doi/pdf/10.1128/spectrum.00601-24},
+	journal = {Microbiology Spectrum},
+	number = {0},
+	pages = {e00601-24},
+	title = {ITSxpress version 2: software to rapidly trim internal transcribed spacer sequences with quality scores for amplicon sequencing},
+	url = {https://journals.asm.org/doi/abs/10.1128/spectrum.00601-24},
+	volume = {0},
+	bdsk-url-1 = {https://journals.asm.org/doi/abs/10.1128/spectrum.00601-24},
+	bdsk-url-2 = {https://doi.org/10.1128/spectrum.00601-24}}
+
 @Article{ Rivers2018,
     AUTHOR = { Rivers, AR and Weber, KC and Gardner, TG and Liu, S and Armstrong, SD},
     TITLE = {ITSxpress: Software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis},

diff --git a/itsxpress/main.py b/itsxpress/main.py
@@ -36,7 +36,8 @@
 import shutil
 import math
 import tempfile
-from itertools import tee
+from itertools import tee, islice
+import contextlib
 
 from numpy import empty
 
@@ -133,99 +134,101 @@ def _logger_setup(logfile):
         print("An error occurred setting up logging")
         raise e
 
-def _check_fastqs(fastq, fastq2=None):
+@contextlib.contextmanager
+def read_file(filename: str, mode: str ='r') -> contextlib.contextmanager:
+    """ A context manager for opening gzipped, stdz compressed or uncompressed files.
+
+        Args:
+            filename: the filename to open
+            mode: 'r' for read 'w' for write. 't' will be adeed automacially for compressed files.
+        
+        Returns a context manager so the file can be opened using 'with'
+    """
+    try:
+        if filename.endswith('.gz'):
+            file = gzip.open(filename, mode +'t')
+            yield file
+        elif filename.endswith('.zst'):
+            file = zstd.open(filename, mode + 't')
+            yield file
+        else:
+            file = open(filename, mode)
+            yield file
+    except FileNotFoundError as f:
+        logging.error("The input file {} could not be found.".format(filename))
+        raise f
+    except Exception as g:
+        logging.error("There appears to be an issue reading the input file {}.".format(filename))
+        raise g
+    finally:
+        if 'file' in locals():
+            file.close()
+
+def _check_fastqs(fastq: str, fastq2: str=None) -> None:
     """Verifies the input files are valid fastq or fastq.gz files. Also checks for interleaved files which are no longer supported.
     Args:
         fastq (str): The path to a fastq or fastq.gz file
         fastq2 (str): The path to a fastq or fastq.gz file for the reverse sequences
 
     Raises:
         ValueError: If Biopython detected invalid FASTQ files
-        AssertionError: If interleaved files are detected
     """
-    def check_interleaved(file):
-        try:
-            if file.endswith('.gz'):
-                f = gzip.open(file, 'rt')
-            elif file.endswith('.zst'):
-                f = zstd.open(file, 'rt')
-            else:
-                f = open(file, 'r')
-            lines = f.readlines()
-            L1 = lines[0:24:8]
-            L2 = lines[4:24:8]
-            L1_old = [i.strip().split('/', 1)[0] for i in L1]
-            L2_old = [i.strip().split('/', 1)[0] for i in L2]
-            L1_new = [i.strip().split(' ', 1)[0] for i in L1]
-            L2_new = [i.strip().split(' ', 1)[0] for i in L2]
-            assert L1_old != L2_old and L1_new != L2_new
-
-        except AssertionError as a:
-            logging.error("'File may be interleaved. ITSxpress will run with errors. Check BBmap reformat.sh to split interleaved files before using ITSxpress.")
-            raise a 
-        except IOError as f:
-            logging.error("File may be wrong format for interleaved file check.")
-            raise f
+    # Validation method from BBtools (Thanks Brian!)
+    def test_pair_names_str(id1, id2):
+        len1 = len(id1)
+        len2 = len(id2)
+        if len1 != len2:
+            return False  # Can happen in PacBio names, but never Illumina
+        idx_slash1 = id1.rfind('/')
+        idx_slash2 = id2.rfind('/')
+        idx_space1 = id1.find(' ')
+        idx_space2 = id2.find(' ')
 
-
-    def core(file):   
-        try:
-            if file.endswith('.gz'):
-                f = gzip.open(file, 'rt')
-            elif file.endswith('.zst'):
-                f = zstd.open(file, 'rt')
-            else:
-                f = open(file, 'r')
-            n = 0
-            for record in SeqIO.parse(f, 'fastq'):
-                while n < 100:
-                    n += 1
-            check_interleaved(file)
-            f.close()
-            if fastq2:
-                if fastq2.endswith('.gz'):
-                    f = gzip.open(fastq2, 'rt')
-                elif file.endswith('.zst'):
-                    f = zstd.open(file, 'rt')
-                else:
-                    f = open(fastq2, 'r')
-                n = 0
-                for record in SeqIO.parse(f, 'fastq'):
-                    while n < 100:
-                        n += 1
-            f.close()
-        except ValueError as e:
-            logging.error("There appears to be an issue with the format of input file {}.".format(file))
-            raise e
-        except FileNotFoundError as f:
-            logging.error("The input file {} could not be found.".format(file))
-            raise f
-        except Exception as g:
-            logging.error("There appears to be an issue reading the input file {}.".format(file))
-            raise g
-
-    core(fastq)
+        if idx_space1 == idx_space2 and idx_space1 > 0 and len1 >= idx_space1 + 3 and len2 >= idx_space2 + 3:
+            if id1[idx_space1 + 1] == '1' and id1[idx_space1 + 2] == ':' and id2[idx_space2 + 1] == '2' and id2[idx_space2 + 2] == ':':
+                for i in range(idx_space1):
+                    if id1[i] != id2[i]:
+                        return False
+                return True
+
+        if idx_slash1 == idx_slash2 and idx_slash1 > 0 and len1 >= idx_slash1 + 2 and len2 >= idx_slash2 + 2:
+            if id1[idx_slash1 + 1] == '1' and id2[idx_slash2 + 1] == '2':
+                for i in range(idx_slash1):
+                    if id1[i] != id2[i]:
+                        return False
+                for i in range(idx_slash1 + 2, len1):
+                    if id1[i] != id2[i]:
+                        return False
+                return True
+
+        return id1 == id2
+
+    def core(filename):
+        with read_file(filename) as handle:
+            records = SeqIO.parse(handle, 'fastq')
+            reclist = list(islice(records, 2))
+            return test_pair_names_str(reclist[0].id, reclist[1].id)
+
+    def warn_mess(filename):
+        return logging.warning( "The file {} may be interleaved, which is not supported. Please verify your input file manually.")
+
+    if core(fastq):
+        warn_mess(fastq)
     if fastq2:
-        core(fastq2)
+        if core(fastq2):
+            warn_mess(fastq2)
 
 def _check_total_reads(file, file2 = None):
     """Check the total number of reads in the input file(s).
     """
     #Count every fourth line in fastq file.
     def core(file):
-        if file.endswith('.gz'):
-            f = gzip.open(file, 'rt')
-        elif file.endswith('.zst'):
-            f = zstd.open(file, 'rt')
-        else:
-            f = open(file, 'r')
-        n = 0
-
-        for i, line in enumerate(f):
-            if i % 4 == 0:
-                n += 1
-        f.close()
-        return n
+        with read_file(file) as handle:
+            n = 0
+            for i, _ in enumerate(handle):
+                if i % 4 == 0:
+                    n += 1
+            return n
 
     reads = core(file)
     logging.info("Total number of reads in file {} is {}.".format(file, reads))