v1.3.0
Changes in v1.3.0:
- Added a
conda
environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE. Parse_Loci.py
: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative termpseudogene
will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (N/A
in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present theN/A
is automatically generated.Filter_Seqs_and_Species.py
: Added--accessions_include
flag. This points to a text file of accession numbers (one per line). When used with the--seq_selection oneseq
option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the--accessions_exclude
flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".Taxa_Assessment.py
: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes theSeqIO.index_db()
method for sequence files >5GB, rather than usingSeqIO.index()
method, which is much more memory efficient for big data. TheSeqIO.index_db()
method is already used inParse_Loci.py
.Cluster_Blast_Extract.py
: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.- Added a new
Remove_Long_Accessions.py
module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH. - Updated recognition for file extensions produced by updated blastn tools (
.ndb
,.not
,.ntf
,.nto
).