Skip to content

MMseqs2 Release 4-0b8cc

Compare
Choose a tag to compare
@martin-steinegger martin-steinegger released this 04 Sep 15:31
· 2285 commits to master since this release

Changes since release 3-be8f6

New features

  • Alternative alignments in search (--alt-ali). Find alignments by masking out previously found regions in the target sequence.
  • Added map workflow for fast near-exact mapping of reads
  • Added easy-linclust workflow, that works on FASTA files
  • Sequence lengths longer than 32k are now supported (default sequence length limit is now 65535)
  • createdb shuffles the order of entries by default (--dont-shuffle to disable), useful for database splits, where one split could take much longer than others
  • linclust now supports MPI
  • linclust adds one hash for the whole sequence, to improve extract sequence matching
  • New sequence identity computation modes, where the normalization happens on the query or target length instead of alignment length
  • New --cov-mode that computes the coverage only based on sequence lengths (--cov-mode 3)
  • search/cluster/linclust workflows have learned --alignment-mode 4 for faster ungapped alignments
  • Translated search sorts now results by E-value and aggregates all ORFs under the corresponding contig identifier
  • prefiltering can now sort hits with score > 255 correctly
  • convertalis now works with profiles
  • Added generalized database transposition tool swapdb (swapresults only makes sense for prefiltering/alignment results)

Performance

  • Speedup extractorf with vectorization
  • Many performance improvements to reduce overhead for web server mode
  • createtsv writes output in parallel
  • Avoid many unnecessary memory allocations in various modules

Bug fixes

  • covertmsa does now correctly parses STOCKHOLM files without accession keys
  • In search when using splits less than --max-seqs sequences would be the limit, now correctly computes the limit (max-seqs/Splits + 4*sqrt(max_seqs/Splits))
  • Fix bug in MsaFilter where wrong sequences would be filtered
  • swapresults will add an empty entry if a target entry has no corresponding query match, instead of no entry at all
  • createindex creates now correctly creates a tmp directory if no directory exists already
  • Fix query split runs for small input databases
  • result2stats was reading the wrong first sequence (from query instead of target database)
  • result2repseq now writes the correct .dbtype file
  • convertalis now reads the correct dbtype for the target sequence
  • Fix empty REG_EMPTY bug on macOS
  • Fix possible memory corruption when searching against database indexed by 'createindex'
  • Report error if -DHAVE_MPI was set and MPI is not installed on the system
  • Avoid race condition in kmermatcher (invalid parallel writing to vector)
  • Fix msa2profile header output format
  • msa2profile uses the FASTA readin mode by default now
  • Target profile databases and databases build with --exact-kmer-matching now correctly extract all k-mers
  • Fix identical score computation of alignment if clustering using profiles
  • Nucleotide backtranslation translateaa would produce invalid codons for X

Others

  • removed --early-exit
  • Output name of program called

Experimental new modules

  • new fast alignment method alignbykmer

Developers

  • Cmake flag -DHAVE_GPROF for profiling MMseqs2 using gprof
  • Fixed most warnings
  • SSTR does not use stringstreams anymore
  • Refactored time measuring
  • Debug::INFO/WARNING/ERROR is now used consistently across the codebase
  • If available (shellcheck)[https://github.com/koalaman/shellcheck] will critique shell scripts and fail the compilation