Skip to content

πŸš€ 0.4.0 - Fast-Fuzzy update

Compare
Choose a tag to compare
@ChillFish8 ChillFish8 released this 26 Aug 12:33
· 209 commits to master since this release

0.4 is out! This brings with it a massive set of performance improvements and relevancy options to tune to your linking πŸ˜„

What's new

  • Fast-Fuzzy: A hyper optimised mode for search as you type experiences, this uses pre-computational spell-correction for high-speed correction, this improves performance by about 10x (both throughput and latency). This is an opt-in feature via --enable-fast-fuzzy and then the "use_fast_fuzzy": true on index creation payload.
  • Stop words: This was introduced to try to increase the search relevancy, before the system would be matching 17,000 results out of 20,000 just because you included words like the etc... Now if the system detects more than 1 word, providing that they are not all stop words; Any stop words will be removed from the query. (This can be toggled on a per-index basis using strip_stop_words defaults to false)

Breaking behaviour

  • The system on both fast-fuzzy and more-like-this queries will have a much different performance characteristic now where some common results you might use for testing will now be invalid.
  • The system uses much higher memory when fast-fuzzy is enabled.

Notes on relevancy

  • The fast fuzzy system is almost at the same level as the current default (Levenshtein distance) if not maybe a little better in places, especially in non-English languages.

Details for nerds πŸ€“

  • We used the symspell algorithm along with pre-computed frequency dictionaries to do spell correction over Levenshtein distance which corrects entire sentences in the time it takes the traditional method to do one word.
  • The frequency dictionaries are made from traditional word dictionaries and the google n-gram corpus, merging these two gives us correctly spelt frequency dicts.
  • The jump in performance is roughly from 400 searches a second to 4000 searches a second (this was done on the small movies dataset, a larger dataset with around 2 million documents was also used which produced a similar growth in performance).