Releases · dportik/SuperCRUNCH

03 Jul 16:46

dportik

v1.3.2

01e72ff

v1.3.2 Latest

Latest

Increased speed for Cluster_Blast_Extract.py and Reference_Blast_Extract.py by changing underlying data structure used to gather blast coordinates per accession.

Assets 2

16 May 15:46

dportik

v1.3.1

6542014

v1.3.1

Fixed issue with voucher identification step, which prevented relabeling and recognition.

Assets 2

28 Jan 18:01

dportik

v1.3.0

2518876

v1.3.0

Changes in v1.3.0:

Added a conda environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE.
Parse_Loci.py: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term pseudogene will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (N/A in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the N/A is automatically generated.
Filter_Seqs_and_Species.py: Added --accessions_include flag. This points to a text file of accession numbers (one per line). When used with the --seq_selection oneseq option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the --accessions_exclude flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".
Taxa_Assessment.py: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the SeqIO.index_db() method for sequence files >5GB, rather than using SeqIO.index() method, which is much more memory efficient for big data. The SeqIO.index_db() method is already used in Parse_Loci.py.
Cluster_Blast_Extract.py: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
Added a new Remove_Long_Accessions.py module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
Updated recognition for file extensions produced by updated blastn tools (.ndb, .not, .ntf, .nto).

Assets 2

18 Mar 17:43

dportik

V1.2.1

9ea0885

Release for Zenodo archiving

V1.2.1

added interleaved nexus output

Assets 2

22 Jul 21:27

dportik

v1.2

0f3dda9

Release 1.2

Version 1.2:
- Made all modules compatible with Python 2.7 and Python 3.7.
- SQL now implemented in Parse_Loci.py (up to 30x speedup!), Filter_Seqs_and_Species.py (3x speedup), and Taxon_Assessment.py (3x speedup).
- Added output directory specification to all modules.
- Two trimming modules now included: Trim_Alignments_Trimal.py and Trim_Alignments_Custom.py. The Trim_Alignments_Custom.py module allows finding start and stop block positions, and row-wise (internal) sliding window trimming based on divergence.
- Added new module Filter_Fasta_by_Min_Seqs.py to filter fasta files using a minimum number of sequences.
- Output directory structures improved for all modules.
- Added --quiet option to Filter_Seqs_and_Species.py for less output on screen (useful when processing large numbers of loci).
- Added option --numerical to Fasta_Get_Taxa.py to allow non-alphabetical identifiers for subspecies/trinomial name combinations. This allows museum, field, or numerical codes to be discovered.
- Re-ordered tasks in Cluster_Blast_Extract.py to allow completion of all steps for one fasta file before moving to next fasta file in sequence.
- Added multithreading for BLAST searches and new --bp_bridge flag for coordinate merging in Cluster_Blast_Extract.py and Reference_Blast_Extract.py.
- Remove empty fasta files sometimes produced by Coding_Translation_Tests.py.
- Complete code re-write for Align.py, Cluster_Blast_Extract.py, Filter_Seqs_and_Species.py, Parse_Loci.py, Taxon_Assessment.py.
- Module Relabel_Fasta.py is now Fasta_Relabel_Seqs.py.

Assets 2

15 May 18:56

dportik

v1.1

22907f8

Release 1.1

Version 1.1:
- Added multithreading option for MAFFT and Clustal-O in Align.py
- Added multithreading option for MAFFT in Adjust_Direction.py
- Added arg to specify output directory for Concatenation.py
- Corrected output column labeling in label key output files from Relabel_Fasta.py
- Added gappyout option for trimming with trimAl in Trim_Alignments.py
- Output sequences failing similarity searches to own file in Cluster_Blast_Extract.py and Reference_Blast_Extract.py
- Updated documentation on wiki pages

Assets 2

02 Feb 21:50

dportik

v1.0

04711f2

initial release

Initial release of SuperCRUNCH.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: dportik/SuperCRUNCH

v1.3.2

v1.3.1

v1.3.0

Release for Zenodo archiving

Release 1.2

Release 1.1

initial release