Added fuzzy matching chapter

zhaofanfan2019 · Aug 24, 2014 · 4a71320 · 4a71320
1 parent 224c403
commit 4a71320
Show file tree

Hide file tree

Showing 7 changed files with 368 additions and 27 deletions.
diff --git a/270_Fuzzy_matching.asciidoc b/270_Fuzzy_matching.asciidoc
@@ -1,22 +1,14 @@
-[[fuzzy-matching]]
-== Fuzzy matching (TODO)
+include::270_Fuzzy_matching/10_Intro.asciidoc[]
 
-TODO
+include::270_Fuzzy_matching/20_Fuzziness.asciidoc[]
 
-=== What is fuzzy matching?
+include::270_Fuzzy_matching/30_Fuzzy_query.asciidoc[]
 
-TODO
+include::270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc[]
 
-=== Fuzzy terms
+include::270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc[]
 
-fuzzy query
+include::270_Fuzzy_matching/60_Phonetic_matching.asciidoc[]
 
-=== Searching typos
-
-fuzziness in match and query_string
-
-=== Phonetic matching
-
-TODO
 
 
diff --git a/270_Fuzzy_matching/010_Intro.asciidoc → 270_Fuzzy_matching/10_Intro.asciidoc b/270_Fuzzy_matching/010_Intro.asciidoc → 270_Fuzzy_matching/10_Intro.asciidoc
@@ -1,5 +1,5 @@
 [[fuzzy-matching]]
-== Fuzzy matching
+== Typoes and mispelings
 
 We expect a query on structured data like dates and prices to only return
 documents that match exactly.  However, good full text search shouldn't have the
@@ -19,17 +19,11 @@ included further down the list.  If no documents match exactly, at least we
 can show the user potential matches -- they may even be what the user
 originally intended!
 
-There are several lines of attack:
-
-*   Language specific stemmer token filters reduce each word to its root form,
-    indexing ``foxes'' as `fox`, or ``jumping'', ``jumps'' and ``jumped'' as
-    `jump`.
-
-*   Synonym token filters can add synonyms into the token stream, allowing a
-    query for ``quick'' to match ``fast'' or ``rapid'', or a query for ``UK``
-    to match ``United Kingdom``.
-
-*   Fuzzy queries
-*   Phonetic token filters can
+We have already looked at diacritic-free matching in <<token-normalization>>,
+word stemming in <<stemming>>, and synonyms in <<synonyms>>, but all of those
+approaches presuppose that words are spelled correctly, or that there is only
+one way to spell each word.
 
+Fuzzy matching allows for query-time matching of misspelled words, while
+phonetic token filters at index time can be used for _sounds-like_ matching.
 
diff --git a/270_Fuzzy_matching/20_Fuzziness.asciidoc b/270_Fuzzy_matching/20_Fuzziness.asciidoc
@@ -0,0 +1,53 @@
+[[fuzziness]]
+=== Fuzziness
+
+Fuzzy matching treats two words which are ``fuzzily'' similar as if they were
+the same word. First, we need to define what we mean by _fuzziness_.
+
+In 1965, Vladimir Levenshtein developed the
+http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance], which
+measures the number of single character edits which are required to transform
+one word into the other. He proposed three types of one character edits:
+
+* *substitution* of one character for another: +**F**ox+ -> +**B**ox+
+
+* *insertion* of a new character: +sic+ -> +sic**K**+
+
+* *deletion* of a character:: +b**L**ack+ -> +back+
+
+http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau]
+later expanded these operations to include:
+
+* *transposition* of two adjacent characters: +**ST**ar+ -> +**TS**ar+
+
+For example, to convert the word `bieber` into `beaver` would require the
+following steps:
+
+* Substitute `v` for `b`: +bie**B**er+ -> +bie**V**er+
+* Substitute `a` for `i`: +b**I**ever+ -> +b**A**ever+
+* Transpose `a` and `e`:  +b**AE**ver+ -> +b**EA**ver+
+
+These 3 steps represent a
+http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance[Damerau-Levenshtein edit distance]
+of 3.
+
+Clearly, `bieber` is a long way from `beaver` -- they are too far apart to be
+considered a simple misspelling.  Damerau observed that 80% of human
+misspellings have an edit distance of 1. In other words, 80% of misspellings
+could be corrected with a *single edit* to the original string.
+
+Elasticsearch supports a maximum edit distance, specified with the `fuzziness`
+parameter, of 2.
+
+Of course, the impact that a single edit has on a string depends upon the
+length of the string.  Two edits to the word `hat` can produce `mad`, so
+allowing two edits on a string of length three is overkill. The `fuzziness`
+parameter can be set to `AUTO`, which results in a maximum edit distance of:
+
+* `0` for strings of 1 or 2 characters.
+* `1` for strings of 3, 4, or 5 characters.
+* `2` for strings of more than 5 characters.
+
+Of course, you may find that an edit distance of `2` is still overkill, and
+returns results which don't appear to be related. You may get better results,
+and better performance, with a maximum `fuzziness` of `1`.
diff --git a/270_Fuzzy_matching/30_Fuzzy_query.asciidoc b/270_Fuzzy_matching/30_Fuzzy_query.asciidoc
@@ -0,0 +1,88 @@
+[[fuzzy-query]]
+=== Fuzzy query
+
+The {ref}query-dsl-fuzzy-query.html[`fuzzy` query] is the fuzzy equivalent of
+the `term` query. You will seldom use it directly yourself, but understanding
+how it works will help you to use fuzziness in the higher level `match` query.
+
+To understand how it works, we will first index some documents:
+
+[source,json]
+-----------------------------------
+POST /my_index/my_type/_bulk
+{ "index": { "_id": 1 }}
+{ "text": "Surprise me!"}
+{ "index": { "_id": 2 }}
+{ "text": "That was surprising."}
+{ "index": { "_id": 3 }}
+{ "text": "I wasn't surprised."}
+-----------------------------------
+
+Now we can run a `fuzzy` query for the term `surprize`:
+
+[source,json]
+-----------------------------------
+GET /my_index/my_type/_search
+{
+  "query": {
+    "fuzzy": {
+      "text": "surprize"
+    }
+  }
+}
+-----------------------------------
+
+The `fuzzy` query is a term-level query so it doesn't do any analysis.  It
+takes a single term and finds all terms in the term dictionary which are
+within the specified `fuzziness`. The default `fuzziness` is `AUTO`.
+
+In our example, `surprize` is within an edit distance of 2 from both
+`surprise` and `surprised`, so documents 1 and 3 match. We could reduce the
+matches to just `surprise` with the following query:
+
+[source,json]
+-----------------------------------
+GET /my_index/my_type/_search
+{
+  "query": {
+    "fuzzy": {
+      "text": {
+        "value": "surprize",
+        "fuzziness": 1
+      }
+    }
+  }
+}
+-----------------------------------
+
+==== Improving performance
+
+The `fuzzy` query works by taking the original term and building a
+_Levenshtein Automaton_  -- like a big graph representing all of the strings
+that are within the specified edit distance of the original string.
+
+It then steps uses the automaton to step efficiently through all of the terms
+in the term dictionary to see if they match.  Once it has collected all of the
+matching terms that exist in the term dictionary, it can compute the list of
+matching documents.
+
+Of course, depending on the type of data stored in the index, a fuzzy query
+with an edit distance of two can match a very large number of terms and
+perform very badly. There are two parameters which can be used to limit the
+performance impact:
+
+`prefix_length`::
+
+The number of initial characters which will not be ``fuzzified''.  Most
+spelling errors occur towards the end of the word, not towards the beginning.
+By using a `prefix_length` of `3`, for example, you can signficantly reduce
+the number of matching terms.
+
+`max_expansions`::
+
+If a fuzzy query expands to 3 or 4 fuzzy options, the new options may be
+meaningful.  If it produces a thousand options then they are essentially
+meaningless.  Use `max_expansions` to limit the total number of options that
+will be produced. The fuzzy query will just collect matching terms until it
+runs out of terms or reaches the `max_expansions` limit.
+
diff --git a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc
@@ -0,0 +1,47 @@
+[[fuzzy-match-query]]
+=== Fuzzy `match` query
+
+The `match` query supports fuzzy matching out of the box:
+
+[source,json]
+-----------------------------------
+GET /my_index/my_type/_search
+{
+  "query": {
+    "match": {
+      "text": {
+        "query":     "SURPRIZE ME!",
+        "fuzziness": "AUTO",
+        "operator":  "and"
+      }
+    }
+  }
+}
+-----------------------------------
+
+The query string is first analyzed, to produce the terms `[surprize, me]`,
+then each term is fuzzified using the specified `fuzziness`.
+
+Similarly, the `multi_match` query also supports `fuzziness`, but only when
+executing with type `best_fields` or `most_fields`:
+
+[source,json]
+-----------------------------------
+GET /my_index/my_type/_search
+{
+  "query": {
+    "multi_match": {
+      "fields":  [ "text", "title" ],
+      "query":     "SURPRIZE ME!",
+      "fuzziness": "AUTO"
+    }
+  }
+}
+-----------------------------------
+
+Both the `match` and `multi_match` queries  also support the `prefix_length`
+and `max_expansions` parameters.
+
+TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It
+doesn't work with phrase matching, common terms or `cross_fields` matches.
+
diff --git a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc
@@ -0,0 +1,33 @@
+[[fuzzy-scoring]]
+=== Scoring fuzziness
+
+Users love fuzzy queries -- they assume that it will somehow magically find
+the right combination of proper spellings.  Unfortunately, the truth is
+somewhat more prosaic.
+
+Imagine that we have 1,000 documents containing ``Schwarzenegger'', and just
+one document with the misspelling ``Schwarzeneger''.  According to the theory
+of <<tfidf,Term frequency/Inverse document frequency>>, the misspelling is
+much more relevant than the correct spelling, because it appears in far fewer
+documents!
+
+In other words, if we were to treat fuzzy matches like any other match, we
+would favour misspellings over correct spellings, which would make for grumpy
+users.
+
+TIP: Fuzzy matching should not be used for scoring purposes -- only to widen
+the net of matching terms in case there are misspellings.
+
+By default, the `match` query gives all fuzzy matches the constant score of 1.
+This is sufficient to add potential matches on to the end of the result list,
+without interfering with the relevance scoring of non-fuzzy queries.
+
+.Use suggesters, rather than fuzzy queries
+*************************************
+
+Fuzzy queries alone are much less useful than they initially appear.  They are
+better used as part of a ``bigger'' feature, such as the _search-as-you-type_
+{ref}search-suggesters-completion.html[`completion` suggester] or the
+_did-you-mean_ {ref}search-suggesters-phrase.html[`phrase` suggester].
+
+*************************************