Skip to content

Commit

Permalink
Added fuzzy matching chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Aug 24, 2014
1 parent 224c403 commit 4a71320
Show file tree
Hide file tree
Showing 7 changed files with 368 additions and 27 deletions.
20 changes: 6 additions & 14 deletions 270_Fuzzy_matching.asciidoc
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
[[fuzzy-matching]]
== Fuzzy matching (TODO)
include::270_Fuzzy_matching/10_Intro.asciidoc[]

TODO
include::270_Fuzzy_matching/20_Fuzziness.asciidoc[]

=== What is fuzzy matching?
include::270_Fuzzy_matching/30_Fuzzy_query.asciidoc[]

TODO
include::270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc[]

=== Fuzzy terms
include::270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc[]

fuzzy query
include::270_Fuzzy_matching/60_Phonetic_matching.asciidoc[]

=== Searching typos

fuzziness in match and query_string

=== Phonetic matching

TODO


Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[[fuzzy-matching]]
== Fuzzy matching
== Typoes and mispelings

We expect a query on structured data like dates and prices to only return
documents that match exactly. However, good full text search shouldn't have the
Expand All @@ -19,17 +19,11 @@ included further down the list. If no documents match exactly, at least we
can show the user potential matches -- they may even be what the user
originally intended!

There are several lines of attack:

* Language specific stemmer token filters reduce each word to its root form,
indexing ``foxes'' as `fox`, or ``jumping'', ``jumps'' and ``jumped'' as
`jump`.

* Synonym token filters can add synonyms into the token stream, allowing a
query for ``quick'' to match ``fast'' or ``rapid'', or a query for ``UK``
to match ``United Kingdom``.

* Fuzzy queries
* Phonetic token filters can
We have already looked at diacritic-free matching in <<token-normalization>>,
word stemming in <<stemming>>, and synonyms in <<synonyms>>, but all of those
approaches presuppose that words are spelled correctly, or that there is only
one way to spell each word.

Fuzzy matching allows for query-time matching of misspelled words, while
phonetic token filters at index time can be used for _sounds-like_ matching.

53 changes: 53 additions & 0 deletions 270_Fuzzy_matching/20_Fuzziness.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
[[fuzziness]]
=== Fuzziness

Fuzzy matching treats two words which are ``fuzzily'' similar as if they were
the same word. First, we need to define what we mean by _fuzziness_.

In 1965, Vladimir Levenshtein developed the
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance], which
measures the number of single character edits which are required to transform
one word into the other. He proposed three types of one character edits:

* *substitution* of one character for another: +**F**ox+ -> +**B**ox+

* *insertion* of a new character: +sic+ -> +sic**K**+

* *deletion* of a character:: +b**L**ack+ -> +back+

http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau]
later expanded these operations to include:

* *transposition* of two adjacent characters: +**ST**ar+ -> +**TS**ar+

For example, to convert the word `bieber` into `beaver` would require the
following steps:

* Substitute `v` for `b`: +bie**B**er+ -> +bie**V**er+
* Substitute `a` for `i`: +b**I**ever+ -> +b**A**ever+
* Transpose `a` and `e`: +b**AE**ver+ -> +b**EA**ver+

These 3 steps represent a
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance[Damerau-Levenshtein edit distance]
of 3.

Clearly, `bieber` is a long way from `beaver` -- they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of misspellings
could be corrected with a *single edit* to the original string.

Elasticsearch supports a maximum edit distance, specified with the `fuzziness`
parameter, of 2.

Of course, the impact that a single edit has on a string depends upon the
length of the string. Two edits to the word `hat` can produce `mad`, so
allowing two edits on a string of length three is overkill. The `fuzziness`
parameter can be set to `AUTO`, which results in a maximum edit distance of:

* `0` for strings of 1 or 2 characters.
* `1` for strings of 3, 4, or 5 characters.
* `2` for strings of more than 5 characters.

Of course, you may find that an edit distance of `2` is still overkill, and
returns results which don't appear to be related. You may get better results,
and better performance, with a maximum `fuzziness` of `1`.
88 changes: 88 additions & 0 deletions 270_Fuzzy_matching/30_Fuzzy_query.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
[[fuzzy-query]]
=== Fuzzy query

The {ref}query-dsl-fuzzy-query.html[`fuzzy` query] is the fuzzy equivalent of
the `term` query. You will seldom use it directly yourself, but understanding
how it works will help you to use fuzziness in the higher level `match` query.

To understand how it works, we will first index some documents:

[source,json]
-----------------------------------
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "text": "Surprise me!"}
{ "index": { "_id": 2 }}
{ "text": "That was surprising."}
{ "index": { "_id": 3 }}
{ "text": "I wasn't surprised."}
-----------------------------------

Now we can run a `fuzzy` query for the term `surprize`:

[source,json]
-----------------------------------
GET /my_index/my_type/_search
{
"query": {
"fuzzy": {
"text": "surprize"
}
}
}
-----------------------------------

The `fuzzy` query is a term-level query so it doesn't do any analysis. It
takes a single term and finds all terms in the term dictionary which are
within the specified `fuzziness`. The default `fuzziness` is `AUTO`.

In our example, `surprize` is within an edit distance of 2 from both
`surprise` and `surprised`, so documents 1 and 3 match. We could reduce the
matches to just `surprise` with the following query:

[source,json]
-----------------------------------
GET /my_index/my_type/_search
{
"query": {
"fuzzy": {
"text": {
"value": "surprize",
"fuzziness": 1
}
}
}
}
-----------------------------------

==== Improving performance

The `fuzzy` query works by taking the original term and building a
_Levenshtein Automaton_ -- like a big graph representing all of the strings
that are within the specified edit distance of the original string.

It then steps uses the automaton to step efficiently through all of the terms
in the term dictionary to see if they match. Once it has collected all of the
matching terms that exist in the term dictionary, it can compute the list of
matching documents.

Of course, depending on the type of data stored in the index, a fuzzy query
with an edit distance of two can match a very large number of terms and
perform very badly. There are two parameters which can be used to limit the
performance impact:

`prefix_length`::

The number of initial characters which will not be ``fuzzified''. Most
spelling errors occur towards the end of the word, not towards the beginning.
By using a `prefix_length` of `3`, for example, you can signficantly reduce
the number of matching terms.

`max_expansions`::

If a fuzzy query expands to 3 or 4 fuzzy options, the new options may be
meaningful. If it produces a thousand options then they are essentially
meaningless. Use `max_expansions` to limit the total number of options that
will be produced. The fuzzy query will just collect matching terms until it
runs out of terms or reaches the `max_expansions` limit.

47 changes: 47 additions & 0 deletions 270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[[fuzzy-match-query]]
=== Fuzzy `match` query

The `match` query supports fuzzy matching out of the box:

[source,json]
-----------------------------------
GET /my_index/my_type/_search
{
"query": {
"match": {
"text": {
"query": "SURPRIZE ME!",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
}
-----------------------------------

The query string is first analyzed, to produce the terms `[surprize, me]`,
then each term is fuzzified using the specified `fuzziness`.

Similarly, the `multi_match` query also supports `fuzziness`, but only when
executing with type `best_fields` or `most_fields`:

[source,json]
-----------------------------------
GET /my_index/my_type/_search
{
"query": {
"multi_match": {
"fields": [ "text", "title" ],
"query": "SURPRIZE ME!",
"fuzziness": "AUTO"
}
}
}
-----------------------------------

Both the `match` and `multi_match` queries also support the `prefix_length`
and `max_expansions` parameters.

TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It
doesn't work with phrase matching, common terms or `cross_fields` matches.

33 changes: 33 additions & 0 deletions 270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
[[fuzzy-scoring]]
=== Scoring fuzziness

Users love fuzzy queries -- they assume that it will somehow magically find
the right combination of proper spellings. Unfortunately, the truth is
somewhat more prosaic.

Imagine that we have 1,000 documents containing ``Schwarzenegger'', and just
one document with the misspelling ``Schwarzeneger''. According to the theory
of <<tfidf,Term frequency/Inverse document frequency>>, the misspelling is
much more relevant than the correct spelling, because it appears in far fewer
documents!

In other words, if we were to treat fuzzy matches like any other match, we
would favour misspellings over correct spellings, which would make for grumpy
users.

TIP: Fuzzy matching should not be used for scoring purposes -- only to widen
the net of matching terms in case there are misspellings.

By default, the `match` query gives all fuzzy matches the constant score of 1.
This is sufficient to add potential matches on to the end of the result list,
without interfering with the relevance scoring of non-fuzzy queries.

.Use suggesters, rather than fuzzy queries
*************************************
Fuzzy queries alone are much less useful than they initially appear. They are
better used as part of a ``bigger'' feature, such as the _search-as-you-type_
{ref}search-suggesters-completion.html[`completion` suggester] or the
_did-you-mean_ {ref}search-suggesters-phrase.html[`phrase` suggester].
*************************************
Loading

0 comments on commit 4a71320

Please sign in to comment.