Skip to content

Releases: dportik/SuperCRUNCH

v1.3.2

03 Jul 16:46
Compare
Choose a tag to compare
  • Increased speed for Cluster_Blast_Extract.py and Reference_Blast_Extract.py by changing underlying data structure used to gather blast coordinates per accession.

v1.3.1

16 May 15:46
Compare
Choose a tag to compare

Fixed issue with voucher identification step, which prevented relabeling and recognition.

v1.3.0

28 Jan 18:01
Compare
Choose a tag to compare

Changes in v1.3.0:

  • Added a conda environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE.
  • Parse_Loci.py: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term pseudogene will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (N/A in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the N/A is automatically generated.
  • Filter_Seqs_and_Species.py: Added --accessions_include flag. This points to a text file of accession numbers (one per line). When used with the --seq_selection oneseq option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the --accessions_exclude flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".
  • Taxa_Assessment.py: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the SeqIO.index_db() method for sequence files >5GB, rather than using SeqIO.index() method, which is much more memory efficient for big data. The SeqIO.index_db() method is already used in Parse_Loci.py.
  • Cluster_Blast_Extract.py: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
  • Added a new Remove_Long_Accessions.py module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
  • Updated recognition for file extensions produced by updated blastn tools (.ndb, .not, .ntf, .nto).

Release for Zenodo archiving

18 Mar 17:43
Compare
Choose a tag to compare
V1.2.1

added interleaved nexus output

Release 1.2

22 Jul 21:27
Compare
Choose a tag to compare
  • Version 1.2:
    • Made all modules compatible with Python 2.7 and Python 3.7.
    • SQL now implemented in Parse_Loci.py (up to 30x speedup!), Filter_Seqs_and_Species.py (3x speedup), and Taxon_Assessment.py (3x speedup).
    • Added output directory specification to all modules.
    • Two trimming modules now included: Trim_Alignments_Trimal.py and Trim_Alignments_Custom.py. The Trim_Alignments_Custom.py module allows finding start and stop block positions, and row-wise (internal) sliding window trimming based on divergence.
    • Added new module Filter_Fasta_by_Min_Seqs.py to filter fasta files using a minimum number of sequences.
    • Output directory structures improved for all modules.
    • Added --quiet option to Filter_Seqs_and_Species.py for less output on screen (useful when processing large numbers of loci).
    • Added option --numerical to Fasta_Get_Taxa.py to allow non-alphabetical identifiers for subspecies/trinomial name combinations. This allows museum, field, or numerical codes to be discovered.
    • Re-ordered tasks in Cluster_Blast_Extract.py to allow completion of all steps for one fasta file before moving to next fasta file in sequence.
    • Added multithreading for BLAST searches and new --bp_bridge flag for coordinate merging in Cluster_Blast_Extract.py and Reference_Blast_Extract.py.
    • Remove empty fasta files sometimes produced by Coding_Translation_Tests.py.
    • Complete code re-write for Align.py, Cluster_Blast_Extract.py, Filter_Seqs_and_Species.py, Parse_Loci.py, Taxon_Assessment.py.
    • Module Relabel_Fasta.py is now Fasta_Relabel_Seqs.py.

Release 1.1

15 May 18:56
Compare
Choose a tag to compare
  • Version 1.1:
    • Added multithreading option for MAFFT and Clustal-O in Align.py
    • Added multithreading option for MAFFT in Adjust_Direction.py
    • Added arg to specify output directory for Concatenation.py
    • Corrected output column labeling in label key output files from Relabel_Fasta.py
    • Added gappyout option for trimming with trimAl in Trim_Alignments.py
    • Output sequences failing similarity searches to own file in Cluster_Blast_Extract.py and Reference_Blast_Extract.py
    • Updated documentation on wiki pages

initial release

02 Feb 21:50
Compare
Choose a tag to compare

Initial release of SuperCRUNCH.