forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
224c403
commit 4a71320
Showing
7 changed files
with
368 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,14 @@ | ||
[[fuzzy-matching]] | ||
== Fuzzy matching (TODO) | ||
include::270_Fuzzy_matching/10_Intro.asciidoc[] | ||
|
||
TODO | ||
include::270_Fuzzy_matching/20_Fuzziness.asciidoc[] | ||
|
||
=== What is fuzzy matching? | ||
include::270_Fuzzy_matching/30_Fuzzy_query.asciidoc[] | ||
|
||
TODO | ||
include::270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc[] | ||
|
||
=== Fuzzy terms | ||
include::270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc[] | ||
|
||
fuzzy query | ||
include::270_Fuzzy_matching/60_Phonetic_matching.asciidoc[] | ||
|
||
=== Searching typos | ||
|
||
fuzziness in match and query_string | ||
|
||
=== Phonetic matching | ||
|
||
TODO | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
[[fuzziness]] | ||
=== Fuzziness | ||
|
||
Fuzzy matching treats two words which are ``fuzzily'' similar as if they were | ||
the same word. First, we need to define what we mean by _fuzziness_. | ||
|
||
In 1965, Vladimir Levenshtein developed the | ||
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance], which | ||
measures the number of single character edits which are required to transform | ||
one word into the other. He proposed three types of one character edits: | ||
|
||
* *substitution* of one character for another: +**F**ox+ -> +**B**ox+ | ||
|
||
* *insertion* of a new character: +sic+ -> +sic**K**+ | ||
|
||
* *deletion* of a character:: +b**L**ack+ -> +back+ | ||
|
||
http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau] | ||
later expanded these operations to include: | ||
|
||
* *transposition* of two adjacent characters: +**ST**ar+ -> +**TS**ar+ | ||
|
||
For example, to convert the word `bieber` into `beaver` would require the | ||
following steps: | ||
|
||
* Substitute `v` for `b`: +bie**B**er+ -> +bie**V**er+ | ||
* Substitute `a` for `i`: +b**I**ever+ -> +b**A**ever+ | ||
* Transpose `a` and `e`: +b**AE**ver+ -> +b**EA**ver+ | ||
|
||
These 3 steps represent a | ||
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance[Damerau-Levenshtein edit distance] | ||
of 3. | ||
|
||
Clearly, `bieber` is a long way from `beaver` -- they are too far apart to be | ||
considered a simple misspelling. Damerau observed that 80% of human | ||
misspellings have an edit distance of 1. In other words, 80% of misspellings | ||
could be corrected with a *single edit* to the original string. | ||
|
||
Elasticsearch supports a maximum edit distance, specified with the `fuzziness` | ||
parameter, of 2. | ||
|
||
Of course, the impact that a single edit has on a string depends upon the | ||
length of the string. Two edits to the word `hat` can produce `mad`, so | ||
allowing two edits on a string of length three is overkill. The `fuzziness` | ||
parameter can be set to `AUTO`, which results in a maximum edit distance of: | ||
|
||
* `0` for strings of 1 or 2 characters. | ||
* `1` for strings of 3, 4, or 5 characters. | ||
* `2` for strings of more than 5 characters. | ||
|
||
Of course, you may find that an edit distance of `2` is still overkill, and | ||
returns results which don't appear to be related. You may get better results, | ||
and better performance, with a maximum `fuzziness` of `1`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
[[fuzzy-query]] | ||
=== Fuzzy query | ||
|
||
The {ref}query-dsl-fuzzy-query.html[`fuzzy` query] is the fuzzy equivalent of | ||
the `term` query. You will seldom use it directly yourself, but understanding | ||
how it works will help you to use fuzziness in the higher level `match` query. | ||
|
||
To understand how it works, we will first index some documents: | ||
|
||
[source,json] | ||
----------------------------------- | ||
POST /my_index/my_type/_bulk | ||
{ "index": { "_id": 1 }} | ||
{ "text": "Surprise me!"} | ||
{ "index": { "_id": 2 }} | ||
{ "text": "That was surprising."} | ||
{ "index": { "_id": 3 }} | ||
{ "text": "I wasn't surprised."} | ||
----------------------------------- | ||
|
||
Now we can run a `fuzzy` query for the term `surprize`: | ||
|
||
[source,json] | ||
----------------------------------- | ||
GET /my_index/my_type/_search | ||
{ | ||
"query": { | ||
"fuzzy": { | ||
"text": "surprize" | ||
} | ||
} | ||
} | ||
----------------------------------- | ||
|
||
The `fuzzy` query is a term-level query so it doesn't do any analysis. It | ||
takes a single term and finds all terms in the term dictionary which are | ||
within the specified `fuzziness`. The default `fuzziness` is `AUTO`. | ||
|
||
In our example, `surprize` is within an edit distance of 2 from both | ||
`surprise` and `surprised`, so documents 1 and 3 match. We could reduce the | ||
matches to just `surprise` with the following query: | ||
|
||
[source,json] | ||
----------------------------------- | ||
GET /my_index/my_type/_search | ||
{ | ||
"query": { | ||
"fuzzy": { | ||
"text": { | ||
"value": "surprize", | ||
"fuzziness": 1 | ||
} | ||
} | ||
} | ||
} | ||
----------------------------------- | ||
|
||
==== Improving performance | ||
|
||
The `fuzzy` query works by taking the original term and building a | ||
_Levenshtein Automaton_ -- like a big graph representing all of the strings | ||
that are within the specified edit distance of the original string. | ||
|
||
It then steps uses the automaton to step efficiently through all of the terms | ||
in the term dictionary to see if they match. Once it has collected all of the | ||
matching terms that exist in the term dictionary, it can compute the list of | ||
matching documents. | ||
|
||
Of course, depending on the type of data stored in the index, a fuzzy query | ||
with an edit distance of two can match a very large number of terms and | ||
perform very badly. There are two parameters which can be used to limit the | ||
performance impact: | ||
|
||
`prefix_length`:: | ||
|
||
The number of initial characters which will not be ``fuzzified''. Most | ||
spelling errors occur towards the end of the word, not towards the beginning. | ||
By using a `prefix_length` of `3`, for example, you can signficantly reduce | ||
the number of matching terms. | ||
|
||
`max_expansions`:: | ||
|
||
If a fuzzy query expands to 3 or 4 fuzzy options, the new options may be | ||
meaningful. If it produces a thousand options then they are essentially | ||
meaningless. Use `max_expansions` to limit the total number of options that | ||
will be produced. The fuzzy query will just collect matching terms until it | ||
runs out of terms or reaches the `max_expansions` limit. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
[[fuzzy-match-query]] | ||
=== Fuzzy `match` query | ||
|
||
The `match` query supports fuzzy matching out of the box: | ||
|
||
[source,json] | ||
----------------------------------- | ||
GET /my_index/my_type/_search | ||
{ | ||
"query": { | ||
"match": { | ||
"text": { | ||
"query": "SURPRIZE ME!", | ||
"fuzziness": "AUTO", | ||
"operator": "and" | ||
} | ||
} | ||
} | ||
} | ||
----------------------------------- | ||
|
||
The query string is first analyzed, to produce the terms `[surprize, me]`, | ||
then each term is fuzzified using the specified `fuzziness`. | ||
|
||
Similarly, the `multi_match` query also supports `fuzziness`, but only when | ||
executing with type `best_fields` or `most_fields`: | ||
|
||
[source,json] | ||
----------------------------------- | ||
GET /my_index/my_type/_search | ||
{ | ||
"query": { | ||
"multi_match": { | ||
"fields": [ "text", "title" ], | ||
"query": "SURPRIZE ME!", | ||
"fuzziness": "AUTO" | ||
} | ||
} | ||
} | ||
----------------------------------- | ||
|
||
Both the `match` and `multi_match` queries also support the `prefix_length` | ||
and `max_expansions` parameters. | ||
|
||
TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It | ||
doesn't work with phrase matching, common terms or `cross_fields` matches. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
[[fuzzy-scoring]] | ||
=== Scoring fuzziness | ||
|
||
Users love fuzzy queries -- they assume that it will somehow magically find | ||
the right combination of proper spellings. Unfortunately, the truth is | ||
somewhat more prosaic. | ||
|
||
Imagine that we have 1,000 documents containing ``Schwarzenegger'', and just | ||
one document with the misspelling ``Schwarzeneger''. According to the theory | ||
of <<tfidf,Term frequency/Inverse document frequency>>, the misspelling is | ||
much more relevant than the correct spelling, because it appears in far fewer | ||
documents! | ||
|
||
In other words, if we were to treat fuzzy matches like any other match, we | ||
would favour misspellings over correct spellings, which would make for grumpy | ||
users. | ||
|
||
TIP: Fuzzy matching should not be used for scoring purposes -- only to widen | ||
the net of matching terms in case there are misspellings. | ||
|
||
By default, the `match` query gives all fuzzy matches the constant score of 1. | ||
This is sufficient to add potential matches on to the end of the result list, | ||
without interfering with the relevance scoring of non-fuzzy queries. | ||
|
||
.Use suggesters, rather than fuzzy queries | ||
************************************* | ||
Fuzzy queries alone are much less useful than they initially appear. They are | ||
better used as part of a ``bigger'' feature, such as the _search-as-you-type_ | ||
{ref}search-suggesters-completion.html[`completion` suggester] or the | ||
_did-you-mean_ {ref}search-suggesters-phrase.html[`phrase` suggester]. | ||
************************************* |
Oops, something went wrong.