Skip to content

Commit

Permalink
Added the synonyms chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Jul 12, 2014
1 parent f45808f commit 034349a
Show file tree
Hide file tree
Showing 10 changed files with 642 additions and 15 deletions.
1 change: 1 addition & 0 deletions 050_Search/20_Query_string.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ your search results if you query specific fields instead of the `_all`
field. When the `_all` field is no longer useful to you, you can
disable it, as explained in <<all-field>>.

[[query-string-query]]
==== More complicated queries

The next query searches for tweets:
Expand Down
23 changes: 8 additions & 15 deletions 260_Synonyms.asciidoc
Original file line number Diff line number Diff line change
@@ -1,24 +1,17 @@
[[synonyms]]
== Synonyms
:ref: http://foo.com/

TODO

=== Defining synonyms
include::260_Synonyms/10_Intro.asciidoc[]

TODO
include::260_Synonyms/20_Using_synonyms.asciidoc[]

=== Expand or Contract
include::260_Synonyms/30_Synonym_formats.asciidoc[]

Compare two approaches
include::260_Synonyms/40_Expand_contract.asciidoc[]

=== Beware the graph
include::260_Synonyms/50_Analysis_chain.asciidoc[]

TODO
include::260_Synonyms/60_Multi_word_synonyms.asciidoc[]

=== Synonyms and `query_string`
include::260_Synonyms/70_Symbol_synonyms.asciidoc[]

TODO

=== Symbol synonyms

mapping char filter
42 changes: 42 additions & 0 deletions 260_Synonyms/10_Intro.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
[[synonyms]]
== Synonyms

While stemming helps to broaden the scope of search by simplifying inflected
words to their root form, synonyms broaden the scope by relating concepts and
ideas. Perhaps no documents match a query for ``English queen'', but documents
which contain ``British monarch'' would probably be considered a good match.

A user might search for ``the US'' and expect to find documents which contain
``United States'', ``USA'', ``U.S.A.'', ``America'', or ``the States''.
However, they wouldn't expect to see results about ``the states of matter'' or
``state machines''.

This example provides a valuable lesson. It demonstrates how simple it is for
a human to distinguish between separate concepts, and how tricky it can be for
mere machines. The natural tendency is to try to provide synonyms for every
word in the language, to ensure that any document is findable with even the
most remotely related terms.

This is a mistake. In the same way that we prefer light or minimal stemming
to aggressive stemming, synonyms should be used only where necessary. Users
understand why their results are limited to the words in their search query.
They are less understanding when their results seems almost random.

Synonyms can be used to conflate words that have pretty much the same meaning,
such as `jump`, `leap`, and `hop`, or `pamphlet`, `leaflet` and `brochure`.
Alternatively, they can be used to make a word more generic. For instance,
`bird` could be used as a more general synonym for `owl` or `pigeon`, `adult`
could be used for `man` or `woman`.

Synonyms appear to be a simple concept but they are quite tricky to get right.
In this chapter will will explain the mechanics of using synonyms and discuss
the limitations and gotchas.

IMPORTANT: Synonyms are used to broaden the scope of what is considered a
matching document. Just like with <<stemming,stemming>> or <<partial-
matching,partial matching>>, synonym fields should not be used alone but
should be combined with a query on a ``main'' field which contains the
original text in unadulterated form. See <<most-fields>> for an explanation
of how to maintain relevance when using synonyms.


3 changes: 3 additions & 0 deletions 260_Synonyms/20_Types_of_synonyms.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[[synonym-types]]
=== Synonym types

87 changes: 87 additions & 0 deletions 260_Synonyms/20_Using_synonyms.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
[[using-synonyms]]
=== Using synonyms

Synonyms can replace existing tokens or be added to the token stream using the
{ref}analysis-synonym-tokenfilter.html[`synonym` token filter]:

[source,json]
-------------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym", <1>
"synonyms": [ <2>
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter" <3>
]
}
}
}
}
}
-------------------------------------
<1> First, we define a token filter of type `synonym`.
<2> We will discuss synonym formats in <<synonym-formats>>.
<3> Then we create a custom analyzer which uses the `my_synonym_filter`.

**************************************
Synonyms can be specified inline with the `synonyms` parameter, or in a
synonyms file which must be present on every node in the cluster. The path to
the synonyms file should be specified with the `synonyms_path` parameter, and
should be either absolute or relative to the Elasticsearch `config` directory.
See <<updating-stopwords>> for techniques that can be used to refresh the
synonyms list.
**************************************

Testing out our analyzer with the `analyze` API shows the following:

[source,json]
-------------------------------------
GET /my_index/_analyze?analyzer=my_synonyms
Elizabeth is the English queen
-------------------------------------

[source,text]
------------------------------------
Pos 1: (elizabeth)
Pos 2: (is)
Pos 3: (the)
Pos 4: (british,english) <1>
Pos 5: (queen,monarch) <1>
------------------------------------
<1> All synonyms occupy the same position as the original term.

A document like this will match queries for any of: ``English queen'',
``British queen'', ``English monarch'' or ``British monarch'' will be able to
find this document. Even a phrase query will work, because the position of
each term has been preserved.

[IMPORTANT]
.Index time vs search time
======================================
Using the same `synonym` token filter at both index time and search time is
redundant. If, at index time, we replace `English` with the two terms
`english` and `british`, then at search time we only need to search for one of
those terms. Alternatively, if we don't use synonyms at index time then, at
search time, we would need to convert a query for `English` into a query for
`english OR british`.
Whether to do synonym expansion at search or index time can be a difficult
choice. We will explore the options more in <<expand-vs-contract>>.
======================================
47 changes: 47 additions & 0 deletions 260_Synonyms/30_Synonym_formats.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[[synonym-formats]]
=== Formatting synonyms

In their simplest form, synonyms are listed as comma-separated values, such
as:

"jump,leap,hop"

If any of these terms is encountered it is replaced by all of the listed
synonyms. For instance:

[source,text]
--------------------------
Original terms: Replaced by:
────────────────────────────────
jump → (jump,leap,hop)
leap → (jump,leap,hop)
hop → (jump,leap,hop)
--------------------------

Alternatively, with the `=>` syntax, it is possible to specify a list of terms
to match (on the left-hand side), and a list of one or more replacements (on
the right-hand side):

"u s a,united states,united states of america => usa"
"g b,gb,great britain => britain,england,scotland,wales"

[source,text]
--------------------------
Original terms: Replaced by:
────────────────────────────────
u s a → (usa)
united states → (usa)
great britain → (britain,england,scotland,wales)
--------------------------

If multiple rules for the same synonyms are specified, they are merged
together. The order of rules is not respected. Instead, the longest matching
rule wins. Take the following rules as an example:

"united states => usa",
"united states of america => usa"

If these rules conflicted, Elasticsearch would turn ``United States of
America'' into the terms `(usa),(of),(america)`. Instead, the longest
sequence wins and we end up with just the term `(usa)`.

129 changes: 129 additions & 0 deletions 260_Synonyms/40_Expand_contract.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
[[synonyms-expand-or-contract]]
=== Expand or contract

In <<synonym-formats>> we have seen that it is possible to replace synonyms by
_simple expansion_, _simple contraction_ or _generic expansion_. We will look
at the tradeoffs of each of these techniques below.

IMPORTANT: This section deals with single-word synonyms only. Multi-word
synonyms add another layer of complexity and are discussed later in
<<multi-word-synonyms>>.

[[synonyms-expansion]]
==== Simple expansion

With simple expansions, any of the listed synonyms is expanded into *all* of
the listed synonyms:

"jump,hop,leap"

It can be applied either at index time or at query time. Each has advantages
(⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance vs
flexibility:

[options="header",cols="h,d,d"]
|===================================================
| | Index time | Query time

| Index size |
⬇︎ Bigger index because all synonyms must be indexed. |
⬆︎ Normal.

| Relevance |
⬇︎ All synonyms will have the same IDF (see <<relevance-intro>>) meaning
that more commonly used words will have the same weight as less commonly
used words. |
⬆︎ The IDF for each synonym will be correct.

| Performance |
⬆︎ A query only needs to find the single term specified in the query string. |
⬇︎ A query for a single term is rewritten to look up all synonyms, which
decreases performance.

| Flexibility |
⬇︎ The synonym rules can't be changed for existing documents.For the new rules
to have effect, existing documents have to be reindexed. |
⬆︎ Synonym rules can be updated without reindexing documents.
|===================================================

[[synonyms-contraction]]
==== Simple contraction

Simple contraction maps a group of synonyms on the left-hand side to a single
value on the right-hand side:

"leap,hop => jump"

It must be applied both at index time and at query time, to ensure that query
terms are mapped to the same single value that exists in the index.

This approach has some advantages and some disadvantages compared to the simple expansion approach:

Index size::

⬆︎ The index size is normal as only a single term is indexed.

Relevance::

⬇︎ The IDF for all terms is the same, so you can't distinguish between more
commonly used words and less commonly used words.

Performance::

⬆︎ A query only needs to find the single term that appears in the index.

Flexibility::
+
--

⬆︎ New synonyms can be added to the left-hand side of the rule and applied at
query time. For instance, imagine that we wanted to add the word `bound` to
the rule specified above. The following rule would work for queries which
contain `bound` or for newly added documents which contain `bound`:

"leap,hop,bound => jump"

But we could expand the effect to also take into account *existing* documents
which contain `bound` by writing the rule as follows:

"leap,hop,bound => jump,bound"

When you reindex your documents, you could revert to the previous rule to gain
the performance benefit of querying only a single term.

--

[[synonyms-genres]]
==== Genre expansion

Genre expansion is quite different from simple contraction or expansion.
Instead of treating all synonyms as equal, genre expansion widens the meaning
of a term to be more generic. Take these rules for example:

"cat => cat,pet",
"kitten => kitten,cat,pet",
"dog => dog,pet"
"puppy => puppy,dog,pet"

By applying genre expansion at index time:

* a query for `kitten` would find just documents about kittens.
* a query for `cat` would find documents abouts kittens and cats.
* a query for `pet` would find documents about kittens, cats, puppies, dogs
or pets.

Alternatively, by applying genre expansion at query time, a query for `kitten`
would be expanded to return documents which mention kittens, cats or pets
specifically.

You could also have the best of both worlds by applying expansion at index
time to ensure that the genres are present in the index. Then, at query time,
you can choose to not apply synonyms (so that a query for `kitten` only
returns documents about kittens) or to apply synonyms in order to match
kittens, cats and pets (including the canine variety).

With the example rules above, the IDF for `kitten` will be correct, while the
IDF for `cat` and `pet` will be artificially deflated. However, this actually
works in your favour -- a genre-expanded query for `kitten OR cat OR pet` will
rank documents with `kitten` highest, followed by documents with `cat`, and
documents with `pet` would be right at the bottom.
Loading

0 comments on commit 034349a

Please sign in to comment.