Skip to content

v1.3.0

Compare
Choose a tag to compare
@dportik dportik released this 28 Jan 18:01
· 23 commits to master since this release

Changes in v1.3.0:

  • Added a conda environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE.
  • Parse_Loci.py: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term pseudogene will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (N/A in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the N/A is automatically generated.
  • Filter_Seqs_and_Species.py: Added --accessions_include flag. This points to a text file of accession numbers (one per line). When used with the --seq_selection oneseq option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the --accessions_exclude flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".
  • Taxa_Assessment.py: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the SeqIO.index_db() method for sequence files >5GB, rather than using SeqIO.index() method, which is much more memory efficient for big data. The SeqIO.index_db() method is already used in Parse_Loci.py.
  • Cluster_Blast_Extract.py: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
  • Added a new Remove_Long_Accessions.py module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
  • Updated recognition for file extensions produced by updated blastn tools (.ndb, .not, .ntf, .nto).