π 0.4.0 - Fast-Fuzzy update
0.4 is out! This brings with it a massive set of performance improvements and relevancy options to tune to your linking π
What's new
- Fast-Fuzzy: A hyper optimised mode for search as you type experiences, this uses pre-computational spell-correction for high-speed correction, this improves performance by about 10x (both throughput and latency). This is an opt-in feature via --enable-fast-fuzzy and then the "use_fast_fuzzy": true on index creation payload.
- Stop words: This was introduced to try to increase the search relevancy, before the system would be matching 17,000 results out of 20,000 just because you included words like the etc... Now if the system detects more than 1 word, providing that they are not all stop words; Any stop words will be removed from the query. (This can be toggled on a per-index basis using
strip_stop_words
defaults tofalse
)
Breaking behaviour
- The system on both fast-fuzzy and more-like-this queries will have a much different performance characteristic now where some common results you might use for testing will now be invalid.
- The system uses much higher memory when fast-fuzzy is enabled.
Notes on relevancy
- The fast fuzzy system is almost at the same level as the current default (Levenshtein distance) if not maybe a little better in places, especially in non-English languages.
Details for nerds π€
- We used the symspell algorithm along with pre-computed frequency dictionaries to do spell correction over Levenshtein distance which corrects entire sentences in the time it takes the traditional method to do one word.
- The frequency dictionaries are made from traditional word dictionaries and the google n-gram corpus, merging these two gives us correctly spelt frequency dicts.
- The jump in performance is roughly from 400 searches a second to 4000 searches a second (this was done on the small movies dataset, a larger dataset with around 2 million documents was also used which produced a similar growth in performance).