Skip to content

Releases: soedinglab/MMseqs2

MMseqs2 Release 7-4e23d

29 Nov 23:43
Compare
Choose a tag to compare

Changes since release 6-f5a1c

New features

  • Simplified taxonomy. We add tools the tools to create the taxonomical annotated database createtaxdb. It is possible to filter result databaese based on taxonomy with filtertaxdb and addtaxonomy to append taxonomy information to result databases
  • index (createindex) support for translated target databaes searches
  • add nucleotide search (experimental)
  • support NEON CPU architecture (experimental)
  • improve performance of prefilter if L2 is greater 256K
  • easy-search automatically computes backtrace if requested by --format-output
  • Create search-2m workflow, similiar to 2bLCA but without the LCA computation
  • We add a database preload mode. Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch. The processing time per query with fread is 15% faster but the read in is slower. mmap is use for the MMseqs2 webserver, it enables instance searches if the database is already in memory, mmap+touch uses mmap an touches every page.
  • We add a new tool touchdb, it loads the database in memory. This can be useuful for "--db-load-mode 2.
  • add local hard disks support --local-tmp for MPI runs. This reduces pressure from the NFS
  • Introduce sortresult tool to sort an unordered sequence db (e.g. from mergeresult)
  • prefilter supports now indexes with k-mer ranges > 2^31
  • convertkb can read multiple files
  • speed up mmap memory touch function

breaking changes

  • new index version. Recomputation of old indexes in needed
  • --format-output is now comma separated
  • changed taxonomy database format, old taxonomy databaes are not supported anymore

default parameter change

  • extractorfs default is now --orf-start-mode 1. This is important for translated searches in organisms with introns.

Bug fixes

  • Fix wrong alignment positions for translated searches
  • Fix of by one error in extratalignedregion
  • Fix bug in NcbiTaxonomy tool
  • Fix e-value threshold if -e < --e-profile

Developer

  • Update to newest ALP version

MMseqs2 Release 6-f5a1c

09 Oct 01:40
Compare
Choose a tag to compare

Changes since release 5-9375b

New features

  • Support user defined output format in convertalis.
  • Add parameters for gap open and gap extension costs.
  • Improve substitution matrix support. Letters of alphabet can now be chose freely.
  • Add a few PAM matrices to the data folder. Chose them with the --sub-mat parameter.
  • Support IUPAC codes in translated search.
  • Add parameter to define a spaced k-mer pattern.
  • Add a new module ungappedprefilter. It computes an optimal ungapped score using a vectorized algorithm.

Bug fixes

  • Fix easy-linclust parameter parsing issue.
  • Fix coverage filtering in align when the parameter --realign is set.
  • Fix sequence identity computation in rescorediagonal --rescore-mode 2.
  • Fix apply MPI support.
  • Fix representative sequence output bug in result2repseq.
  • Fix possible MPI issues in modules creating symlinks.
  • Fix slightly wrong E-value computed in alignall module.

Known Issues

  • easy-search output has only one column. Workaround: Add parameter --format-output "".

MMseqs2 Release 5-9375b

04 Sep 23:20
Compare
Choose a tag to compare

Changes since release 4-0b8cc

Bug fixes

  • bool flag parameters (e.g. -a) work again
  • swapresults will deterministically rank results
  • shellcompletion does not report run time anymore

MMseqs2 Release 4-0b8cc

04 Sep 15:31
Compare
Choose a tag to compare

Changes since release 3-be8f6

New features

  • Alternative alignments in search (--alt-ali). Find alignments by masking out previously found regions in the target sequence.
  • Added map workflow for fast near-exact mapping of reads
  • Added easy-linclust workflow, that works on FASTA files
  • Sequence lengths longer than 32k are now supported (default sequence length limit is now 65535)
  • createdb shuffles the order of entries by default (--dont-shuffle to disable), useful for database splits, where one split could take much longer than others
  • linclust now supports MPI
  • linclust adds one hash for the whole sequence, to improve extract sequence matching
  • New sequence identity computation modes, where the normalization happens on the query or target length instead of alignment length
  • New --cov-mode that computes the coverage only based on sequence lengths (--cov-mode 3)
  • search/cluster/linclust workflows have learned --alignment-mode 4 for faster ungapped alignments
  • Translated search sorts now results by E-value and aggregates all ORFs under the corresponding contig identifier
  • prefiltering can now sort hits with score > 255 correctly
  • convertalis now works with profiles
  • Added generalized database transposition tool swapdb (swapresults only makes sense for prefiltering/alignment results)

Performance

  • Speedup extractorf with vectorization
  • Many performance improvements to reduce overhead for web server mode
  • createtsv writes output in parallel
  • Avoid many unnecessary memory allocations in various modules

Bug fixes

  • covertmsa does now correctly parses STOCKHOLM files without accession keys
  • In search when using splits less than --max-seqs sequences would be the limit, now correctly computes the limit (max-seqs/Splits + 4*sqrt(max_seqs/Splits))
  • Fix bug in MsaFilter where wrong sequences would be filtered
  • swapresults will add an empty entry if a target entry has no corresponding query match, instead of no entry at all
  • createindex creates now correctly creates a tmp directory if no directory exists already
  • Fix query split runs for small input databases
  • result2stats was reading the wrong first sequence (from query instead of target database)
  • result2repseq now writes the correct .dbtype file
  • convertalis now reads the correct dbtype for the target sequence
  • Fix empty REG_EMPTY bug on macOS
  • Fix possible memory corruption when searching against database indexed by 'createindex'
  • Report error if -DHAVE_MPI was set and MPI is not installed on the system
  • Avoid race condition in kmermatcher (invalid parallel writing to vector)
  • Fix msa2profile header output format
  • msa2profile uses the FASTA readin mode by default now
  • Target profile databases and databases build with --exact-kmer-matching now correctly extract all k-mers
  • Fix identical score computation of alignment if clustering using profiles
  • Nucleotide backtranslation translateaa would produce invalid codons for X

Others

  • removed --early-exit
  • Output name of program called

Experimental new modules

  • new fast alignment method alignbykmer

Developers

  • Cmake flag -DHAVE_GPROF for profiling MMseqs2 using gprof
  • Fixed most warnings
  • SSTR does not use stringstreams anymore
  • Refactored time measuring
  • Debug::INFO/WARNING/ERROR is now used consistently across the codebase
  • If available (shellcheck)[https://github.com/koalaman/shellcheck] will critique shell scripts and fail the compilation

MMseqs2 Release 3-be8f6

28 May 08:11
Compare
Choose a tag to compare

Changes since 2-23394 Release

New Features

  • Create simple workflows fasta/fastq in flat file out for clustering easy-cluster and searching easy-search
  • Add a new clustering greedy incremental clustering algorithm to the clust module which needs less memory
  • Make the new low memory clustering algorithm default if --cov-mode 1 is used in linclust and cluster
  • Add alignall module for all-against-all alignments of e.g. clusters
  • Improved Windows support
  • filterdb learned new modes

Bug fixes

  • Fix wrong merging code in linclust
  • Fix e-value issues in target-split case
  • Fix seg. fault in rescore diagonal if 'z' is used
  • Fix seg. fault when using masking in kmermatcher
  • Fix wrong filterdb default mode
  • prefilter overestimated the required amount of memory and refused to run
  • prefilter scores would saturate to early, now they have the full 2^16 range

Others

  • Profile searches do create less high scoring false positive through better compositional bias correction and masking of low complexity regions of profiles
  • Clustering supports now the whole 2^32 range instead the previously 2^31
  • Speed up clustering when using --cov-mode 1
  • Rework symlinks to the header databaes
  • Support profiles on query and target side in result2profile

MMseqs2 Release 2-23394

05 Mar 16:28
Compare
Choose a tag to compare

Changes since 1-c7a89 Release

New Features

  • Translated searches (blastx and tblastn like search modes)
  • Improvement splitting input sequences in kmermatcher (Less memory needed for linclust)
  • linclust supports nucleotide sequences (experimental feature, k-mer length is not yet optimized)
  • search supports nucleotide-nucleotide searches (preview, not stable yet)
  • pssm2profile module to print human readable profiles
  • msa2profile has a gap match mode to to convert multiple sequences alignments without representative sequence to profile databases
  • Compute sequence identity in a similar way to BLAST if --alignment-mode 3 is used
  • apply module to execute a arbitrary program on each entry of a mmseqs database. Like map from MapReduce.
  • extractorf can use start/stop codons from alternative translation tables
  • filterdb now can append entries from other databases by looking them up
  • proteinaln2nucl maps a protein alignment back to its original nucleotide sequences
  • taxonomy now can blacklist nodes (per default the unclassified and others nodes)
  • tmp folder is automatically created, all workflow intermediate results are placed in a subfolder based on the hash of all paths and parameters

Performance Regressions Fixed

  • Fixed regression when multiple mmseqs instances were running at the same time

Breaking Command Line Interface Changes

  • Incremented index version, old precomputed indices have to be regenerated
  • New Profile format, databases generated through convertprofiledb and msa2profile have to be regenerated
  • Clustering workflow is now by default cascaded. We replaced the --cascaded flag with --single-step-clustering
  • Max sequence length of 32768 is now actually validated and enforced
  • Each sequence database has now a dbtype file (AA=0, NUC=1, PROFILE=2)
  • extractorf was reworked:
    * --skip-incomplete was split into two parameters --contig-start-mode and --contig-end-mode
    * --longest-orf was reworked into --orf-start-mode
    * removed --extend-min parameter

Others

  • Factor four times faster clustering workflow
  • Improve speed of linclust by a factor of two
  • Remove 'X' from prefilter index (reduces memory and improves speed at the same sensitivity)
  • Fix bugs for Query coverage mode (--cov-mode 2)
  • Clustering is now the same between single and multi threaded version
  • Speedup of kmermatcher
  • Fix bug in Clust hash. It can now cluster to 1.0 sequence identity
  • Improve target profile search, set max-seqs to infinite for alignments.
  • Improve speed of align if prefilter result fit into memory
  • Many usability improvements
  • Improved suggestions of bash completion
  • Expert modules are hidden by default, use -h flag to show everything
  • Speed up mergeclusters by a lot
  • Fix sequence identity print out bug if the id is less than 10%
  • MPI Runner variable can now correctly contain further parameters (RUNNER="mpirun -np 4" was not working)
  • Enforcing GCC 4.6 compatibilty in our continous integration

Devlopers

  • MMseqs2 can now be included in framework mode to subprojects
  • DBReader has a SHUFFLE mode

MMseqs2 Release 1-c7a89

29 Oct 10:04
Compare
Choose a tag to compare

Changes since vNatBiotech Release

New Features

  • Taxonomy classification workflow with robust 2bLCA computation and fast LCA computation in O(N LogN)
  • Support reading .bz2 archives for createdb
  • Createdb can turn multiple fasta files into one database now
  • Extend prefilter score range to improve order of best hits after prefiltering.
  • Automatically split input sequence set based on system RAM in kmermatcher. Linclust can now run with less memory.

Performance Regressions Fixed

  • Fixed underperforming iterative-sequence-profile search without a precomputed index table

Breaking Command Line Interface Changes

  • Iterative-non-profile-search --sens-step-size changed to --sens-steps (Number of Iterations) (Does not break nested workflows anymore)

Others

  • Query coverage mode (--cov-mode 2) for searching
  • Clustering is now the same between single and multi threaded version
  • Bug fixes in rescorediagonal
  • Speedup of kmermatcher
  • Speedup and memory reduction of swapresults
  • Many usability improvements

Devlopers

  • MMseqs2 can now be included in framework mode to subprojects

Nature Biotechnology Release

08 Aug 16:06
Compare
Choose a tag to compare

Release for Nature Biotechnology