diff --git a/270_Fuzzy_matching.asciidoc b/270_Fuzzy_matching.asciidoc index 1fccafe7f..01f202e54 100644 --- a/270_Fuzzy_matching.asciidoc +++ b/270_Fuzzy_matching.asciidoc @@ -1,22 +1,14 @@ -[[fuzzy-matching]] -== Fuzzy matching (TODO) +include::270_Fuzzy_matching/10_Intro.asciidoc[] -TODO +include::270_Fuzzy_matching/20_Fuzziness.asciidoc[] -=== What is fuzzy matching? +include::270_Fuzzy_matching/30_Fuzzy_query.asciidoc[] -TODO +include::270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc[] -=== Fuzzy terms +include::270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc[] -fuzzy query +include::270_Fuzzy_matching/60_Phonetic_matching.asciidoc[] -=== Searching typos - -fuzziness in match and query_string - -=== Phonetic matching - -TODO diff --git a/270_Fuzzy_matching/010_Intro.asciidoc b/270_Fuzzy_matching/10_Intro.asciidoc similarity index 67% rename from 270_Fuzzy_matching/010_Intro.asciidoc rename to 270_Fuzzy_matching/10_Intro.asciidoc index f78adb771..041203961 100644 --- a/270_Fuzzy_matching/010_Intro.asciidoc +++ b/270_Fuzzy_matching/10_Intro.asciidoc @@ -1,5 +1,5 @@ [[fuzzy-matching]] -== Fuzzy matching +== Typoes and mispelings We expect a query on structured data like dates and prices to only return documents that match exactly. However, good full text search shouldn't have the @@ -19,17 +19,11 @@ included further down the list. If no documents match exactly, at least we can show the user potential matches -- they may even be what the user originally intended! -There are several lines of attack: - -* Language specific stemmer token filters reduce each word to its root form, - indexing ``foxes'' as `fox`, or ``jumping'', ``jumps'' and ``jumped'' as - `jump`. - -* Synonym token filters can add synonyms into the token stream, allowing a - query for ``quick'' to match ``fast'' or ``rapid'', or a query for ``UK`` - to match ``United Kingdom``. - -* Fuzzy queries -* Phonetic token filters can +We have already looked at diacritic-free matching in <>, +word stemming in <>, and synonyms in <>, but all of those +approaches presuppose that words are spelled correctly, or that there is only +one way to spell each word. +Fuzzy matching allows for query-time matching of misspelled words, while +phonetic token filters at index time can be used for _sounds-like_ matching. diff --git a/270_Fuzzy_matching/20_Fuzziness.asciidoc b/270_Fuzzy_matching/20_Fuzziness.asciidoc new file mode 100644 index 000000000..25d6d065b --- /dev/null +++ b/270_Fuzzy_matching/20_Fuzziness.asciidoc @@ -0,0 +1,53 @@ +[[fuzziness]] +=== Fuzziness + +Fuzzy matching treats two words which are ``fuzzily'' similar as if they were +the same word. First, we need to define what we mean by _fuzziness_. + +In 1965, Vladimir Levenshtein developed the +http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance], which +measures the number of single character edits which are required to transform +one word into the other. He proposed three types of one character edits: + +* *substitution* of one character for another: +**F**ox+ -> +**B**ox+ + +* *insertion* of a new character: +sic+ -> +sic**K**+ + +* *deletion* of a character:: +b**L**ack+ -> +back+ + +http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau] +later expanded these operations to include: + +* *transposition* of two adjacent characters: +**ST**ar+ -> +**TS**ar+ + +For example, to convert the word `bieber` into `beaver` would require the +following steps: + +* Substitute `v` for `b`: +bie**B**er+ -> +bie**V**er+ +* Substitute `a` for `i`: +b**I**ever+ -> +b**A**ever+ +* Transpose `a` and `e`: +b**AE**ver+ -> +b**EA**ver+ + +These 3 steps represent a +http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance[Damerau-Levenshtein edit distance] +of 3. + +Clearly, `bieber` is a long way from `beaver` -- they are too far apart to be +considered a simple misspelling. Damerau observed that 80% of human +misspellings have an edit distance of 1. In other words, 80% of misspellings +could be corrected with a *single edit* to the original string. + +Elasticsearch supports a maximum edit distance, specified with the `fuzziness` +parameter, of 2. + +Of course, the impact that a single edit has on a string depends upon the +length of the string. Two edits to the word `hat` can produce `mad`, so +allowing two edits on a string of length three is overkill. The `fuzziness` +parameter can be set to `AUTO`, which results in a maximum edit distance of: + +* `0` for strings of 1 or 2 characters. +* `1` for strings of 3, 4, or 5 characters. +* `2` for strings of more than 5 characters. + +Of course, you may find that an edit distance of `2` is still overkill, and +returns results which don't appear to be related. You may get better results, +and better performance, with a maximum `fuzziness` of `1`. diff --git a/270_Fuzzy_matching/30_Fuzzy_query.asciidoc b/270_Fuzzy_matching/30_Fuzzy_query.asciidoc new file mode 100644 index 000000000..71115126d --- /dev/null +++ b/270_Fuzzy_matching/30_Fuzzy_query.asciidoc @@ -0,0 +1,88 @@ +[[fuzzy-query]] +=== Fuzzy query + +The {ref}query-dsl-fuzzy-query.html[`fuzzy` query] is the fuzzy equivalent of +the `term` query. You will seldom use it directly yourself, but understanding +how it works will help you to use fuzziness in the higher level `match` query. + +To understand how it works, we will first index some documents: + +[source,json] +----------------------------------- +POST /my_index/my_type/_bulk +{ "index": { "_id": 1 }} +{ "text": "Surprise me!"} +{ "index": { "_id": 2 }} +{ "text": "That was surprising."} +{ "index": { "_id": 3 }} +{ "text": "I wasn't surprised."} +----------------------------------- + +Now we can run a `fuzzy` query for the term `surprize`: + +[source,json] +----------------------------------- +GET /my_index/my_type/_search +{ + "query": { + "fuzzy": { + "text": "surprize" + } + } +} +----------------------------------- + +The `fuzzy` query is a term-level query so it doesn't do any analysis. It +takes a single term and finds all terms in the term dictionary which are +within the specified `fuzziness`. The default `fuzziness` is `AUTO`. + +In our example, `surprize` is within an edit distance of 2 from both +`surprise` and `surprised`, so documents 1 and 3 match. We could reduce the +matches to just `surprise` with the following query: + +[source,json] +----------------------------------- +GET /my_index/my_type/_search +{ + "query": { + "fuzzy": { + "text": { + "value": "surprize", + "fuzziness": 1 + } + } + } +} +----------------------------------- + +==== Improving performance + +The `fuzzy` query works by taking the original term and building a +_Levenshtein Automaton_ -- like a big graph representing all of the strings +that are within the specified edit distance of the original string. + +It then steps uses the automaton to step efficiently through all of the terms +in the term dictionary to see if they match. Once it has collected all of the +matching terms that exist in the term dictionary, it can compute the list of +matching documents. + +Of course, depending on the type of data stored in the index, a fuzzy query +with an edit distance of two can match a very large number of terms and +perform very badly. There are two parameters which can be used to limit the +performance impact: + +`prefix_length`:: + +The number of initial characters which will not be ``fuzzified''. Most +spelling errors occur towards the end of the word, not towards the beginning. +By using a `prefix_length` of `3`, for example, you can signficantly reduce +the number of matching terms. + +`max_expansions`:: + +If a fuzzy query expands to 3 or 4 fuzzy options, the new options may be +meaningful. If it produces a thousand options then they are essentially +meaningless. Use `max_expansions` to limit the total number of options that +will be produced. The fuzzy query will just collect matching terms until it +runs out of terms or reaches the `max_expansions` limit. + diff --git a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc new file mode 100644 index 000000000..12f1e9084 --- /dev/null +++ b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc @@ -0,0 +1,47 @@ +[[fuzzy-match-query]] +=== Fuzzy `match` query + +The `match` query supports fuzzy matching out of the box: + +[source,json] +----------------------------------- +GET /my_index/my_type/_search +{ + "query": { + "match": { + "text": { + "query": "SURPRIZE ME!", + "fuzziness": "AUTO", + "operator": "and" + } + } + } +} +----------------------------------- + +The query string is first analyzed, to produce the terms `[surprize, me]`, +then each term is fuzzified using the specified `fuzziness`. + +Similarly, the `multi_match` query also supports `fuzziness`, but only when +executing with type `best_fields` or `most_fields`: + +[source,json] +----------------------------------- +GET /my_index/my_type/_search +{ + "query": { + "multi_match": { + "fields": [ "text", "title" ], + "query": "SURPRIZE ME!", + "fuzziness": "AUTO" + } + } +} +----------------------------------- + +Both the `match` and `multi_match` queries also support the `prefix_length` +and `max_expansions` parameters. + +TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It +doesn't work with phrase matching, common terms or `cross_fields` matches. + diff --git a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc new file mode 100644 index 000000000..3da3bb2e1 --- /dev/null +++ b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc @@ -0,0 +1,33 @@ +[[fuzzy-scoring]] +=== Scoring fuzziness + +Users love fuzzy queries -- they assume that it will somehow magically find +the right combination of proper spellings. Unfortunately, the truth is +somewhat more prosaic. + +Imagine that we have 1,000 documents containing ``Schwarzenegger'', and just +one document with the misspelling ``Schwarzeneger''. According to the theory +of <>, the misspelling is +much more relevant than the correct spelling, because it appears in far fewer +documents! + +In other words, if we were to treat fuzzy matches like any other match, we +would favour misspellings over correct spellings, which would make for grumpy +users. + +TIP: Fuzzy matching should not be used for scoring purposes -- only to widen +the net of matching terms in case there are misspellings. + +By default, the `match` query gives all fuzzy matches the constant score of 1. +This is sufficient to add potential matches on to the end of the result list, +without interfering with the relevance scoring of non-fuzzy queries. + +.Use suggesters, rather than fuzzy queries +************************************* + +Fuzzy queries alone are much less useful than they initially appear. They are +better used as part of a ``bigger'' feature, such as the _search-as-you-type_ +{ref}search-suggesters-completion.html[`completion` suggester] or the +_did-you-mean_ {ref}search-suggesters-phrase.html[`phrase` suggester]. + +************************************* diff --git a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc new file mode 100644 index 000000000..e43789909 --- /dev/null +++ b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc @@ -0,0 +1,134 @@ +[[phonetic-matching]] +=== Phonetic matching + +In a last, desperate, attempt to match something, anything, we could resort to +searching for words that sound similar, even if their spelling differs. + +A number of algorithms exist for converting words into some phonetic +representation. The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is +the granddaddy of them all, and most other phonetic algorithms are +improvements or specializations of Soundex, such as +http://en.wikipedia.org/wiki/Metaphone[Metaphone] and +http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] +(which expands phonetic matching to languages other than English), +http://en.wikipedia.org/wiki/Caverphone[Caverphone] for matching names in New +Zealand, the +http://en.wikipedia.org/wiki/Beider%E2%80%93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] algorithm, which adopts the Soundex algorithm +for better matching of German and Yiddish names, and the +http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] for better +handling of German words. + +The thing to take away from this list is that phonetic algorithms are fairly +crude, and very specific to the languages they were designed for, usually +either English or German. This limits their usefulness. Still, for certain +purposes, and in combination with other techniques, phonetic matching can be a +useful tool. + +First, you will need to install the Phonetic Analysis plugin from +https://github.com/elasticsearch/elasticsearch-analysis-phonetic on very node +in the cluster, and restart each node. + +Once restarted, you can create a custom analyzer which uses one of the +phonetic token filters and try it out: + +[source,json] +----------------------------------- +PUT /my_index +{ + "settings": { + "analysis": { + "filter": { + "dbl_metaphone": { <1> + "type": "phonetic", + "encoder": "double_metaphone" + } + }, + "analyzer": { + "dbl_metaphone": { + "tokenizer": "standard", + "filter": "dbl_metaphone" <2> + } + } + } + } +} +----------------------------------- +<1> First, configure a custom `phonetic` token filter which uses the + `double_metaphone` encoder. +<2> Then use the custom token filter in a custom analyzer. + +Now we can test it out with the `analyze` API: + + +[source,json] +----------------------------------- +GET /my_index/_analyze?analyzer=dbl_metaphone +Smith Smythe +----------------------------------- + +Each of `Smith` and `Smythe` produce two tokens in the same position: `SM0` +and `XMT`. Running `John`, `Jon`, and `Johnnie` through the analyzer will all +produce the two tokens: `JN` and `AN`, while `Jonathon` results in the tokens +`JN0N` and `ANTN`. + +The phonetic analyzer can be used just like any other analyzer. First map a +field to use it, then index some data: + +[source,json] +----------------------------------- +PUT /my_index/_mapping/my_type +{ + "properties": { + "name": { + "type": "string", + "fields": { + "phonetic": { <1> + "type": "string", + "analyzer": "dbl_metaphone" + } + } + } + } +} + +PUT /my_index/my_type/1 +{ + "name": "John Smith" +} + +PUT /my_index/my_type/2 +{ + "name": "Jonnie Smythe" +} +----------------------------------- +<1> The `name.phonetic` field uses the custom `dbl_metaphone` analyzer. + +The `match` query can be used for searching: + +[source,json] +----------------------------------- +GET /my_index/my_type/_search +{ + "query": { + "match": { + "name.phonetic": { + "query": "Jahnnie Smeeth", + "operator": "and" + } + } + } +} +----------------------------------- + +This query returns both documents, demonstrating just how coarse phonetic +matching is. Scoring with a phonetic algorithm is pretty much worthless. The +purpose of phonetic matching is not to increase precision, but to increase +recall -- to spread the net wide enough to catch any documents which might +possibly match. + +It usually makes more sense to use phonetic algorithms when retrieving results +which will be consumed and post-processed by another computer, rather than by +human users. + + +