forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f45808f
commit 034349a
Showing
10 changed files
with
642 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,17 @@ | ||
[[synonyms]] | ||
== Synonyms | ||
:ref: http://foo.com/ | ||
|
||
TODO | ||
|
||
=== Defining synonyms | ||
include::260_Synonyms/10_Intro.asciidoc[] | ||
|
||
TODO | ||
include::260_Synonyms/20_Using_synonyms.asciidoc[] | ||
|
||
=== Expand or Contract | ||
include::260_Synonyms/30_Synonym_formats.asciidoc[] | ||
|
||
Compare two approaches | ||
include::260_Synonyms/40_Expand_contract.asciidoc[] | ||
|
||
=== Beware the graph | ||
include::260_Synonyms/50_Analysis_chain.asciidoc[] | ||
|
||
TODO | ||
include::260_Synonyms/60_Multi_word_synonyms.asciidoc[] | ||
|
||
=== Synonyms and `query_string` | ||
include::260_Synonyms/70_Symbol_synonyms.asciidoc[] | ||
|
||
TODO | ||
|
||
=== Symbol synonyms | ||
|
||
mapping char filter |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
[[synonyms]] | ||
== Synonyms | ||
|
||
While stemming helps to broaden the scope of search by simplifying inflected | ||
words to their root form, synonyms broaden the scope by relating concepts and | ||
ideas. Perhaps no documents match a query for ``English queen'', but documents | ||
which contain ``British monarch'' would probably be considered a good match. | ||
|
||
A user might search for ``the US'' and expect to find documents which contain | ||
``United States'', ``USA'', ``U.S.A.'', ``America'', or ``the States''. | ||
However, they wouldn't expect to see results about ``the states of matter'' or | ||
``state machines''. | ||
|
||
This example provides a valuable lesson. It demonstrates how simple it is for | ||
a human to distinguish between separate concepts, and how tricky it can be for | ||
mere machines. The natural tendency is to try to provide synonyms for every | ||
word in the language, to ensure that any document is findable with even the | ||
most remotely related terms. | ||
|
||
This is a mistake. In the same way that we prefer light or minimal stemming | ||
to aggressive stemming, synonyms should be used only where necessary. Users | ||
understand why their results are limited to the words in their search query. | ||
They are less understanding when their results seems almost random. | ||
|
||
Synonyms can be used to conflate words that have pretty much the same meaning, | ||
such as `jump`, `leap`, and `hop`, or `pamphlet`, `leaflet` and `brochure`. | ||
Alternatively, they can be used to make a word more generic. For instance, | ||
`bird` could be used as a more general synonym for `owl` or `pigeon`, `adult` | ||
could be used for `man` or `woman`. | ||
|
||
Synonyms appear to be a simple concept but they are quite tricky to get right. | ||
In this chapter will will explain the mechanics of using synonyms and discuss | ||
the limitations and gotchas. | ||
|
||
IMPORTANT: Synonyms are used to broaden the scope of what is considered a | ||
matching document. Just like with <<stemming,stemming>> or <<partial- | ||
matching,partial matching>>, synonym fields should not be used alone but | ||
should be combined with a query on a ``main'' field which contains the | ||
original text in unadulterated form. See <<most-fields>> for an explanation | ||
of how to maintain relevance when using synonyms. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[[synonym-types]] | ||
=== Synonym types | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
[[using-synonyms]] | ||
=== Using synonyms | ||
|
||
Synonyms can replace existing tokens or be added to the token stream using the | ||
{ref}analysis-synonym-tokenfilter.html[`synonym` token filter]: | ||
|
||
[source,json] | ||
------------------------------------- | ||
PUT /my_index | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"filter": { | ||
"my_synonym_filter": { | ||
"type": "synonym", <1> | ||
"synonyms": [ <2> | ||
"british,english", | ||
"queen,monarch" | ||
] | ||
} | ||
}, | ||
"analyzer": { | ||
"my_synonyms": { | ||
"tokenizer": "standard", | ||
"filter": [ | ||
"lowercase", | ||
"my_synonym_filter" <3> | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
------------------------------------- | ||
<1> First, we define a token filter of type `synonym`. | ||
<2> We will discuss synonym formats in <<synonym-formats>>. | ||
<3> Then we create a custom analyzer which uses the `my_synonym_filter`. | ||
|
||
************************************** | ||
Synonyms can be specified inline with the `synonyms` parameter, or in a | ||
synonyms file which must be present on every node in the cluster. The path to | ||
the synonyms file should be specified with the `synonyms_path` parameter, and | ||
should be either absolute or relative to the Elasticsearch `config` directory. | ||
See <<updating-stopwords>> for techniques that can be used to refresh the | ||
synonyms list. | ||
************************************** | ||
|
||
Testing out our analyzer with the `analyze` API shows the following: | ||
|
||
[source,json] | ||
------------------------------------- | ||
GET /my_index/_analyze?analyzer=my_synonyms | ||
Elizabeth is the English queen | ||
------------------------------------- | ||
|
||
[source,text] | ||
------------------------------------ | ||
Pos 1: (elizabeth) | ||
Pos 2: (is) | ||
Pos 3: (the) | ||
Pos 4: (british,english) <1> | ||
Pos 5: (queen,monarch) <1> | ||
------------------------------------ | ||
<1> All synonyms occupy the same position as the original term. | ||
|
||
A document like this will match queries for any of: ``English queen'', | ||
``British queen'', ``English monarch'' or ``British monarch'' will be able to | ||
find this document. Even a phrase query will work, because the position of | ||
each term has been preserved. | ||
|
||
[IMPORTANT] | ||
.Index time vs search time | ||
====================================== | ||
Using the same `synonym` token filter at both index time and search time is | ||
redundant. If, at index time, we replace `English` with the two terms | ||
`english` and `british`, then at search time we only need to search for one of | ||
those terms. Alternatively, if we don't use synonyms at index time then, at | ||
search time, we would need to convert a query for `English` into a query for | ||
`english OR british`. | ||
Whether to do synonym expansion at search or index time can be a difficult | ||
choice. We will explore the options more in <<expand-vs-contract>>. | ||
====================================== |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
[[synonym-formats]] | ||
=== Formatting synonyms | ||
|
||
In their simplest form, synonyms are listed as comma-separated values, such | ||
as: | ||
|
||
"jump,leap,hop" | ||
|
||
If any of these terms is encountered it is replaced by all of the listed | ||
synonyms. For instance: | ||
|
||
[source,text] | ||
-------------------------- | ||
Original terms: Replaced by: | ||
──────────────────────────────── | ||
jump → (jump,leap,hop) | ||
leap → (jump,leap,hop) | ||
hop → (jump,leap,hop) | ||
-------------------------- | ||
|
||
Alternatively, with the `=>` syntax, it is possible to specify a list of terms | ||
to match (on the left-hand side), and a list of one or more replacements (on | ||
the right-hand side): | ||
|
||
"u s a,united states,united states of america => usa" | ||
"g b,gb,great britain => britain,england,scotland,wales" | ||
|
||
[source,text] | ||
-------------------------- | ||
Original terms: Replaced by: | ||
──────────────────────────────── | ||
u s a → (usa) | ||
united states → (usa) | ||
great britain → (britain,england,scotland,wales) | ||
-------------------------- | ||
|
||
If multiple rules for the same synonyms are specified, they are merged | ||
together. The order of rules is not respected. Instead, the longest matching | ||
rule wins. Take the following rules as an example: | ||
|
||
"united states => usa", | ||
"united states of america => usa" | ||
|
||
If these rules conflicted, Elasticsearch would turn ``United States of | ||
America'' into the terms `(usa),(of),(america)`. Instead, the longest | ||
sequence wins and we end up with just the term `(usa)`. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
[[synonyms-expand-or-contract]] | ||
=== Expand or contract | ||
|
||
In <<synonym-formats>> we have seen that it is possible to replace synonyms by | ||
_simple expansion_, _simple contraction_ or _generic expansion_. We will look | ||
at the tradeoffs of each of these techniques below. | ||
|
||
IMPORTANT: This section deals with single-word synonyms only. Multi-word | ||
synonyms add another layer of complexity and are discussed later in | ||
<<multi-word-synonyms>>. | ||
|
||
[[synonyms-expansion]] | ||
==== Simple expansion | ||
|
||
With simple expansions, any of the listed synonyms is expanded into *all* of | ||
the listed synonyms: | ||
|
||
"jump,hop,leap" | ||
|
||
It can be applied either at index time or at query time. Each has advantages | ||
(⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance vs | ||
flexibility: | ||
|
||
[options="header",cols="h,d,d"] | ||
|=================================================== | ||
| | Index time | Query time | ||
|
||
| Index size | | ||
⬇︎ Bigger index because all synonyms must be indexed. | | ||
⬆︎ Normal. | ||
|
||
| Relevance | | ||
⬇︎ All synonyms will have the same IDF (see <<relevance-intro>>) meaning | ||
that more commonly used words will have the same weight as less commonly | ||
used words. | | ||
⬆︎ The IDF for each synonym will be correct. | ||
|
||
| Performance | | ||
⬆︎ A query only needs to find the single term specified in the query string. | | ||
⬇︎ A query for a single term is rewritten to look up all synonyms, which | ||
decreases performance. | ||
|
||
| Flexibility | | ||
⬇︎ The synonym rules can't be changed for existing documents.For the new rules | ||
to have effect, existing documents have to be reindexed. | | ||
⬆︎ Synonym rules can be updated without reindexing documents. | ||
|=================================================== | ||
|
||
[[synonyms-contraction]] | ||
==== Simple contraction | ||
|
||
Simple contraction maps a group of synonyms on the left-hand side to a single | ||
value on the right-hand side: | ||
|
||
"leap,hop => jump" | ||
|
||
It must be applied both at index time and at query time, to ensure that query | ||
terms are mapped to the same single value that exists in the index. | ||
|
||
This approach has some advantages and some disadvantages compared to the simple expansion approach: | ||
|
||
Index size:: | ||
|
||
⬆︎ The index size is normal as only a single term is indexed. | ||
|
||
Relevance:: | ||
|
||
⬇︎ The IDF for all terms is the same, so you can't distinguish between more | ||
commonly used words and less commonly used words. | ||
|
||
Performance:: | ||
|
||
⬆︎ A query only needs to find the single term that appears in the index. | ||
|
||
Flexibility:: | ||
+ | ||
-- | ||
|
||
⬆︎ New synonyms can be added to the left-hand side of the rule and applied at | ||
query time. For instance, imagine that we wanted to add the word `bound` to | ||
the rule specified above. The following rule would work for queries which | ||
contain `bound` or for newly added documents which contain `bound`: | ||
|
||
"leap,hop,bound => jump" | ||
|
||
But we could expand the effect to also take into account *existing* documents | ||
which contain `bound` by writing the rule as follows: | ||
|
||
"leap,hop,bound => jump,bound" | ||
|
||
When you reindex your documents, you could revert to the previous rule to gain | ||
the performance benefit of querying only a single term. | ||
|
||
-- | ||
|
||
[[synonyms-genres]] | ||
==== Genre expansion | ||
|
||
Genre expansion is quite different from simple contraction or expansion. | ||
Instead of treating all synonyms as equal, genre expansion widens the meaning | ||
of a term to be more generic. Take these rules for example: | ||
|
||
"cat => cat,pet", | ||
"kitten => kitten,cat,pet", | ||
"dog => dog,pet" | ||
"puppy => puppy,dog,pet" | ||
|
||
By applying genre expansion at index time: | ||
|
||
* a query for `kitten` would find just documents about kittens. | ||
* a query for `cat` would find documents abouts kittens and cats. | ||
* a query for `pet` would find documents about kittens, cats, puppies, dogs | ||
or pets. | ||
|
||
Alternatively, by applying genre expansion at query time, a query for `kitten` | ||
would be expanded to return documents which mention kittens, cats or pets | ||
specifically. | ||
|
||
You could also have the best of both worlds by applying expansion at index | ||
time to ensure that the genres are present in the index. Then, at query time, | ||
you can choose to not apply synonyms (so that a query for `kitten` only | ||
returns documents about kittens) or to apply synonyms in order to match | ||
kittens, cats and pets (including the canine variety). | ||
|
||
With the example rules above, the IDF for `kitten` will be correct, while the | ||
IDF for `cat` and `pet` will be artificially deflated. However, this actually | ||
works in your favour -- a genre-expanded query for `kitten OR cat OR pet` will | ||
rank documents with `kitten` highest, followed by documents with `cat`, and | ||
documents with `pet` would be right at the bottom. |
Oops, something went wrong.