Release MMseqs2 Release 2-23394 · soedinglab/MMseqs2

Changes since 1-c7a89 Release

New Features

Translated searches (blastx and tblastn like search modes)
Improvement splitting input sequences in kmermatcher (Less memory needed for linclust)
linclust supports nucleotide sequences (experimental feature, k-mer length is not yet optimized)
search supports nucleotide-nucleotide searches (preview, not stable yet)
pssm2profile module to print human readable profiles
msa2profile has a gap match mode to to convert multiple sequences alignments without representative sequence to profile databases
Compute sequence identity in a similar way to BLAST if --alignment-mode 3 is used
apply module to execute a arbitrary program on each entry of a mmseqs database. Like map from MapReduce.
extractorf can use start/stop codons from alternative translation tables
filterdb now can append entries from other databases by looking them up
proteinaln2nucl maps a protein alignment back to its original nucleotide sequences
taxonomy now can blacklist nodes (per default the unclassified and others nodes)
tmp folder is automatically created, all workflow intermediate results are placed in a subfolder based on the hash of all paths and parameters

Incremented index version, old precomputed indices have to be regenerated
New Profile format, databases generated through convertprofiledb and msa2profile have to be regenerated
Clustering workflow is now by default cascaded. We replaced the --cascaded flag with --single-step-clustering
Max sequence length of 32768 is now actually validated and enforced
Each sequence database has now a dbtype file (AA=0, NUC=1, PROFILE=2)
extractorf was reworked:
* --skip-incomplete was split into two parameters --contig-start-mode and --contig-end-mode
* --longest-orf was reworked into --orf-start-mode
* removed --extend-min parameter

Factor four times faster clustering workflow
Improve speed of linclust by a factor of two
Remove 'X' from prefilter index (reduces memory and improves speed at the same sensitivity)
Fix bugs for Query coverage mode (--cov-mode 2)
Clustering is now the same between single and multi threaded version
Speedup of kmermatcher
Fix bug in Clust hash. It can now cluster to 1.0 sequence identity
Improve target profile search, set max-seqs to infinite for alignments.
Improve speed of align if prefilter result fit into memory
Many usability improvements
Improved suggestions of bash completion
Expert modules are hidden by default, use -h flag to show everything
Speed up mergeclusters by a lot
Fix sequence identity print out bug if the id is less than 10%
MPI Runner variable can now correctly contain further parameters (RUNNER="mpirun -np 4" was not working)
Enforcing GCC 4.6 compatibilty in our continous integration