Skip to content

MMseqs2 Release 2-23394

Compare
Choose a tag to compare
@martin-steinegger martin-steinegger released this 05 Mar 16:28
· 2642 commits to master since this release

Changes since 1-c7a89 Release

New Features

  • Translated searches (blastx and tblastn like search modes)
  • Improvement splitting input sequences in kmermatcher (Less memory needed for linclust)
  • linclust supports nucleotide sequences (experimental feature, k-mer length is not yet optimized)
  • search supports nucleotide-nucleotide searches (preview, not stable yet)
  • pssm2profile module to print human readable profiles
  • msa2profile has a gap match mode to to convert multiple sequences alignments without representative sequence to profile databases
  • Compute sequence identity in a similar way to BLAST if --alignment-mode 3 is used
  • apply module to execute a arbitrary program on each entry of a mmseqs database. Like map from MapReduce.
  • extractorf can use start/stop codons from alternative translation tables
  • filterdb now can append entries from other databases by looking them up
  • proteinaln2nucl maps a protein alignment back to its original nucleotide sequences
  • taxonomy now can blacklist nodes (per default the unclassified and others nodes)
  • tmp folder is automatically created, all workflow intermediate results are placed in a subfolder based on the hash of all paths and parameters

Performance Regressions Fixed

  • Fixed regression when multiple mmseqs instances were running at the same time

Breaking Command Line Interface Changes

  • Incremented index version, old precomputed indices have to be regenerated
  • New Profile format, databases generated through convertprofiledb and msa2profile have to be regenerated
  • Clustering workflow is now by default cascaded. We replaced the --cascaded flag with --single-step-clustering
  • Max sequence length of 32768 is now actually validated and enforced
  • Each sequence database has now a dbtype file (AA=0, NUC=1, PROFILE=2)
  • extractorf was reworked:
    * --skip-incomplete was split into two parameters --contig-start-mode and --contig-end-mode
    * --longest-orf was reworked into --orf-start-mode
    * removed --extend-min parameter

Others

  • Factor four times faster clustering workflow
  • Improve speed of linclust by a factor of two
  • Remove 'X' from prefilter index (reduces memory and improves speed at the same sensitivity)
  • Fix bugs for Query coverage mode (--cov-mode 2)
  • Clustering is now the same between single and multi threaded version
  • Speedup of kmermatcher
  • Fix bug in Clust hash. It can now cluster to 1.0 sequence identity
  • Improve target profile search, set max-seqs to infinite for alignments.
  • Improve speed of align if prefilter result fit into memory
  • Many usability improvements
  • Improved suggestions of bash completion
  • Expert modules are hidden by default, use -h flag to show everything
  • Speed up mergeclusters by a lot
  • Fix sequence identity print out bug if the id is less than 10%
  • MPI Runner variable can now correctly contain further parameters (RUNNER="mpirun -np 4" was not working)
  • Enforcing GCC 4.6 compatibilty in our continous integration

Devlopers

  • MMseqs2 can now be included in framework mode to subprojects
  • DBReader has a SHUFFLE mode