Added the synonyms chapter

zhaofanfan2019 · Jul 12, 2014 · 034349a · 034349a
1 parent f45808f
commit 034349a
Show file tree

Hide file tree

Showing 10 changed files with 642 additions and 15 deletions.
diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc
@@ -90,6 +90,7 @@ your search results if you query specific fields instead of the `_all`
 field.  When the `_all` field is no longer useful to you, you can
 disable it, as explained in <<all-field>>.
 
+[[query-string-query]]
 ==== More complicated queries
 
 The next query searches for tweets:

diff --git a/260_Synonyms.asciidoc b/260_Synonyms.asciidoc
@@ -1,24 +1,17 @@
-[[synonyms]]
-== Synonyms
+:ref: http://foo.com/
 
-TODO
 
-=== Defining synonyms
+include::260_Synonyms/10_Intro.asciidoc[]
 
-TODO
+include::260_Synonyms/20_Using_synonyms.asciidoc[]
 
-=== Expand or Contract
+include::260_Synonyms/30_Synonym_formats.asciidoc[]
 
-Compare two approaches
+include::260_Synonyms/40_Expand_contract.asciidoc[]
 
-=== Beware the graph
+include::260_Synonyms/50_Analysis_chain.asciidoc[]
 
-TODO
+include::260_Synonyms/60_Multi_word_synonyms.asciidoc[]
 
-=== Synonyms and `query_string`
+include::260_Synonyms/70_Symbol_synonyms.asciidoc[]
 
-TODO
-
-=== Symbol synonyms
-
-mapping char filter
diff --git a/260_Synonyms/10_Intro.asciidoc b/260_Synonyms/10_Intro.asciidoc
@@ -0,0 +1,42 @@
+[[synonyms]]
+== Synonyms
+
+While stemming helps to broaden the scope of search by simplifying inflected
+words to their root form, synonyms broaden the scope by relating concepts and
+ideas. Perhaps no documents match a query for ``English queen'', but documents
+which contain ``British monarch'' would probably be considered a good match.
+
+A user might search for ``the US'' and expect to find documents which contain
+``United States'', ``USA'', ``U.S.A.'', ``America'', or ``the States''.
+However, they wouldn't expect to see results about ``the states of matter'' or
+``state machines''.
+
+This example provides a valuable lesson. It demonstrates how simple it is for
+a human to distinguish between separate concepts, and how tricky it can be for
+mere machines. The natural tendency is to try to provide synonyms for every
+word in the language, to ensure that any document is findable with even the
+most remotely related terms.
+
+This is a mistake.  In the same way that we prefer light or minimal stemming
+to aggressive stemming, synonyms should be used only where necessary. Users
+understand why their results are limited to the words in their search query.
+They are less understanding when their results seems almost random.
+
+Synonyms can be used to conflate words that have pretty much the same meaning,
+such as `jump`, `leap`, and `hop`, or `pamphlet`, `leaflet` and `brochure`.
+Alternatively, they can be used to make a word more generic.  For instance,
+`bird` could be used as a more general synonym for `owl` or `pigeon`, `adult`
+could be used for `man` or `woman`.
+
+Synonyms appear to be a simple concept but they are quite tricky to get right.
+In this chapter will will explain the mechanics of using synonyms and discuss
+the limitations and gotchas.
+
+IMPORTANT: Synonyms are used to broaden the scope of what is considered a
+matching document.  Just like with <<stemming,stemming>> or <<partial-
+matching,partial matching>>, synonym fields should not be used alone but
+should be combined with a query on a ``main'' field which contains the
+original text in unadulterated form.  See <<most-fields>> for an explanation
+of how to maintain relevance when using synonyms.
+
+
diff --git a/260_Synonyms/20_Types_of_synonyms.asciidoc b/260_Synonyms/20_Types_of_synonyms.asciidoc
@@ -0,0 +1,3 @@
+[[synonym-types]]
+=== Synonym types
+
diff --git a/260_Synonyms/20_Using_synonyms.asciidoc b/260_Synonyms/20_Using_synonyms.asciidoc
@@ -0,0 +1,87 @@
+[[using-synonyms]]
+=== Using synonyms
+
+Synonyms can replace existing tokens or be added to the token stream using the
+{ref}analysis-synonym-tokenfilter.html[`synonym` token filter]:
+
+[source,json]
+-------------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_synonym_filter": {
+          "type": "synonym", <1>
+          "synonyms": [ <2>
+            "british,english",
+            "queen,monarch"
+          ]
+        }
+      },
+      "analyzer": {
+        "my_synonyms": {
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "my_synonym_filter" <3>
+          ]
+        }
+      }
+    }
+  }
+}
+-------------------------------------
+<1> First, we define a token filter of type `synonym`.
+<2> We will discuss synonym formats in <<synonym-formats>>.
+<3> Then we create a custom analyzer which uses the `my_synonym_filter`.
+
+**************************************
+
+Synonyms can be specified inline with the `synonyms` parameter, or in a
+synonyms file which must be present on every node in the cluster. The path to
+the synonyms file should be specified with the `synonyms_path` parameter, and
+should be either absolute or relative to the Elasticsearch `config` directory.
+See <<updating-stopwords>> for techniques that can be used to refresh the
+synonyms list.
+
+**************************************
+
+Testing out our analyzer with the `analyze` API shows the following:
+
+[source,json]
+-------------------------------------
+GET /my_index/_analyze?analyzer=my_synonyms
+Elizabeth is the English queen
+-------------------------------------
+
+[source,text]
+------------------------------------
+Pos 1: (elizabeth)
+Pos 2: (is)
+Pos 3: (the)
+Pos 4: (british,english) <1>
+Pos 5: (queen,monarch) <1>
+------------------------------------
+<1> All synonyms occupy the same position as the original term.
+
+A document like this will match queries for any of: ``English queen'',
+``British queen'', ``English monarch'' or ``British monarch'' will be able to
+find this document.  Even a phrase query will work, because the position of
+each term has been preserved.
+
+[IMPORTANT]
+.Index time vs search time
+======================================
+
+Using the same `synonym` token filter at both index time and search time is
+redundant.  If, at index time, we replace `English` with the two terms
+`english` and `british`, then at search time we only need to search for one of
+those terms.  Alternatively, if we don't use synonyms at index time then, at
+search time, we would need to convert a query for `English` into a query for
+`english OR british`.
+
+Whether to do synonym expansion at search or index time can be a difficult
+choice.  We will explore the options more in <<expand-vs-contract>>.
+
+======================================
diff --git a/260_Synonyms/30_Synonym_formats.asciidoc b/260_Synonyms/30_Synonym_formats.asciidoc
@@ -0,0 +1,47 @@
+[[synonym-formats]]
+=== Formatting synonyms
+
+In their simplest form, synonyms are listed as comma-separated values, such
+as:
+
+    "jump,leap,hop"
+
+If any of these terms is encountered it is replaced by all of the listed
+synonyms.  For instance:
+
+[source,text]
+--------------------------
+Original terms:   Replaced by:
+────────────────────────────────
+jump            → (jump,leap,hop)
+leap            → (jump,leap,hop)
+hop             → (jump,leap,hop)
+--------------------------
+
+Alternatively, with the `=>` syntax, it is possible to specify a list of terms
+to match (on the left-hand side), and a list of one or more replacements (on
+the right-hand side):
+
+    "u s a,united states,united states of america => usa"
+    "g b,gb,great britain => britain,england,scotland,wales"
+
+[source,text]
+--------------------------
+Original terms:   Replaced by:
+────────────────────────────────
+u s a           → (usa)
+united states   → (usa)
+great britain   → (britain,england,scotland,wales)
+--------------------------
+
+If multiple rules for the same synonyms are specified, they are merged
+together.  The order of rules is not respected.  Instead, the longest matching
+rule wins.  Take the following rules as an example:
+
+    "united states            => usa",
+    "united states of america => usa"
+
+If these rules conflicted, Elasticsearch would turn ``United States of
+America'' into the terms `(usa),(of),(america)`.  Instead, the longest
+sequence wins and we end up with just the term `(usa)`.
+
diff --git a/260_Synonyms/40_Expand_contract.asciidoc b/260_Synonyms/40_Expand_contract.asciidoc
@@ -0,0 +1,129 @@
+[[synonyms-expand-or-contract]]
+=== Expand or contract
+
+In <<synonym-formats>> we have seen that it is possible to replace synonyms by
+_simple expansion_, _simple contraction_ or _generic expansion_.  We will look
+at the tradeoffs of each of these techniques below.
+
+IMPORTANT: This section deals with single-word synonyms only.  Multi-word
+synonyms add another layer of complexity and are discussed later in
+<<multi-word-synonyms>>.
+
+[[synonyms-expansion]]
+==== Simple expansion
+
+With simple expansions, any of the listed synonyms is expanded into *all* of
+the listed synonyms:
+
+    "jump,hop,leap"
+
+It can be applied either at index time or at query time.  Each has advantages
+(⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance vs
+flexibility:
+
+[options="header",cols="h,d,d"]
+|===================================================
+|                   | Index time             | Query time
+
+| Index size        |
+      ⬇︎ Bigger index because all synonyms must be indexed. |
+      ⬆︎ Normal.
+
+| Relevance         |
+      ⬇︎ All synonyms will have the same IDF (see <<relevance-intro>>) meaning
+      that more commonly used words will have the same weight as less commonly
+      used words. |
+      ⬆︎ The IDF for each synonym will be correct.
+
+| Performance |
+      ⬆︎ A query only needs to find the single term specified in the query string. |
+      ⬇︎ A query for a single term is rewritten to look up all synonyms, which
+      decreases performance.
+
+| Flexibility       |
+      ⬇︎ The synonym rules can't be changed for existing documents.For the new rules
+      to have effect, existing documents have to be reindexed. |
+      ⬆︎ Synonym rules can be updated without reindexing documents.
+|===================================================
+
+[[synonyms-contraction]]
+==== Simple contraction
+
+Simple contraction maps a group of synonyms on the left-hand side to a single
+value on the right-hand side:
+
+    "leap,hop => jump"
+
+It must be applied both at index time and at query time, to ensure that query
+terms are mapped to the same single value that exists in the index.
+
+This approach has some advantages and some disadvantages compared to the simple expansion approach:
+
+Index size::
+
+⬆︎ The index size is normal as only a single term is indexed.
+
+Relevance::
+
+⬇︎ The IDF for all terms is the same, so you can't distinguish between more
+commonly used words and less commonly used words.
+
+Performance::
+
+⬆︎ A query only needs to find the single term that appears in the index.
+
+Flexibility::
++
+--
+
+⬆︎ New synonyms can be added to the left-hand side of the rule and applied at
+query time. For instance, imagine that we wanted to add the word `bound` to
+the rule specified above. The following rule would work for queries which
+contain `bound` or for newly added documents which contain `bound`:
+
+    "leap,hop,bound => jump"
+
+But we could expand the effect to also take into account *existing* documents
+which contain `bound` by writing the rule as follows:
+
+    "leap,hop,bound => jump,bound"
+
+When you reindex your documents, you could revert to the previous rule to gain
+the performance benefit of querying only a single term.
+
+--
+
+[[synonyms-genres]]
+==== Genre expansion
+
+Genre expansion is quite different from simple contraction or expansion.
+Instead of treating all synonyms as equal, genre expansion widens the meaning
+of a term to be more generic. Take these rules for example:
+
+    "cat    => cat,pet",
+    "kitten => kitten,cat,pet",
+    "dog    => dog,pet"
+    "puppy  => puppy,dog,pet"
+
+By applying genre expansion at index time:
+
+* a query for `kitten` would find just documents about kittens.
+* a query for `cat` would find documents abouts kittens and cats.
+* a query for `pet` would find documents about kittens, cats, puppies, dogs
+  or pets.
+
+Alternatively, by applying genre expansion at query time, a query for `kitten`
+would be expanded to return documents which mention kittens, cats or pets
+specifically.
+
+You could also have the best of both worlds by applying expansion at index
+time to ensure that the genres are present in the index. Then, at query time,
+you can choose to not apply synonyms (so that a query for `kitten` only
+returns documents about kittens) or to apply synonyms in order to match
+kittens, cats and pets (including the canine variety).
+
+With the example rules above, the IDF for `kitten` will be correct, while the
+IDF for `cat` and `pet` will be artificially deflated.  However, this actually
+works in your favour -- a genre-expanded query for `kitten OR cat OR pet` will
+rank documents with `kitten` highest, followed by documents with `cat`, and
+documents with `pet` would be right at the bottom.