Added Stopwords intro and using stopwords

zhaofanfan2019 · Jun 21, 2014 · ae0fbfb · ae0fbfb
1 parent 9c88cf7
commit ae0fbfb
Show file tree

Hide file tree

Showing 4 changed files with 337 additions and 15 deletions.
diff --git a/230_Stemming/50_Controlling_stemming.asciidoc b/230_Stemming/50_Controlling_stemming.asciidoc
@@ -71,18 +71,10 @@ sky skies skiing skis <1>
 
 While the language analyzers only allow us to specify an array of words in the
 `stem_exclusion` parameter, the `keyword_marker` token filter also accepts a
-`keyword_path` parameter which allows us to store all of our keywords in a
+`keywords_path` parameter which allows us to store all of our keywords in a
 file. The file should contain one word per line, and must be present on every
-node in the cluster.
-
-This file can be updated later on, adding or removing keywords.  However, it
-is important to note that:
-
-*   changes to the file will not take effect until either each node has been
-    restarted, or the index has been closed and reopened.
-    (see {ref}indices-open-close.html[open/close index])
-*   changing the file will not have any effect on documents that have already
-    been indexed.
+node in the cluster. See <<updating-stopwords>> for tips on how to update this
+file.
 
 ==========================================
 

diff --git a/240_Stopwords.asciidoc b/240_Stopwords.asciidoc
@@ -1,8 +1,14 @@
-[[stopwords]]
-== Stopwords: performance vs precision
 
-stop token filter
-elision token filter
+include::240_Stopwords/10_Intro.asciidoc[]
+
+include::240_Stopwords/20_Using_stopwords.asciidoc[]
+
 
 common terms query
+match query
+
+relevance
+
+bm25
+
 common grams token filter
diff --git a/240_Stopwords/10_Intro.asciidoc b/240_Stopwords/10_Intro.asciidoc
@@ -0,0 +1,85 @@
+[[stopwords]]
+== Stopwords: performance vs precision
+
+Back in the early days of information retrieval,  disk space and memory were
+limited to a tiny fraction of what we are accustomed to today. It was
+essential to make your index as small as possible.  Every kilobyte saved meant
+a significant improvement in performance. Stemming (see <<stemming>>) was
+important, not just for making searches broader and increasing retrieval in
+the same way that we use it today, but also as a tool for compressing index
+size.
+
+Another way to reduce index size is simply to *index fewer words*.  For search
+purposes, some words are more important than others. A significant reduction
+in index size can be achieved by only indexing the more important terms.
+
+So which terms can be left out?  We can divide terms roughly into two groups:
+
+Low frequency terms::
+
+Words that appear in relatively few documents in the corpus.  Because of their
+rarity, they have a high value or _weight_.
+
+High frequency terms::
+
+Common words that appear in many documents in the index, like `the`, `and` and
+`is`. These words  have a low weight and contribute little to the relevance
+score.
+
+**********************************************
+
+Of course, frequency is really a scale rather than just two points labelled
+_low_ and _high_. We just draw a line at some arbitrary point and say that any
+terms below that line are low frequency and above the line are high frequency.
+
+**********************************************
+
+Which terms are low or high frequency depend on the documents themselves.  The
+word `and` may be a low frequency term if all of the documents are in Chinese.
+In a collection of documents about databases, the word `database` may be a
+high frequency term with little value as a search term for that particular
+corpus.
+
+That said, for any language there are a number of words which occur very
+commonly and which seldom add value to a search.  The default English
+stopwords used in Elasticsearch are:
+
+    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
+    no, not, of, on, or, such, that, the, their, then, there, these,
+    they, this, to, was, will, with
+
+These _stopwords_ can usually be filtered out before indexing with little
+negative impact on retrieval. But is it a good idea to do so?
+
+[float]
+=== Pros and cons of stopwords
+
+We have more disk space, more RAM, and better compression algorithms than
+existed back in the day. Excluding the above 33 common words from the index
+will only save about 4MB per million documents.  Using stopwords for the sake
+of reducing index size is no longer a valid reason.
+
+On top of that, by removing words from the index we are reducing our ability
+to perform certain types of search.  Filtering out the above stopwords
+prevents us from:
+
+* distinguishing ``happy'' from ``not happy''.
+* searching for the band ``The The''.
+* finding Shakespeare's play ``To be or not to be''.
+* using the country code for Norway: `no`.
+
+The primary advantage of removing stopwords is performance.  Imagine that we
+search an index with 1 million documents for the word `fox`.  Perhaps `fox`
+appears in only 20 of them, which means that Elastisearch has to calculate the
+relevance `_score` for 20 documents in order to return the top 10. Now, we
+change that to a search for `the OR fox`. The word `the` probably occurs in
+almost all of the documents, which means that Elasticsearch has to calculate
+the `_score` for all 1 million documents.  This second query simply cannot
+perform as well as the first.
+
+Fortunately, there are techniques which we can use to keep common words
+searchable, while benefiting from the performance gain of stopwords. First,
+let's start with how to use stopwords.
+
+
+
diff --git a/240_Stopwords/20_Using_stopwords.asciidoc b/240_Stopwords/20_Using_stopwords.asciidoc
@@ -0,0 +1,239 @@
+:ref: http://foo.com/
+
+[[using-stopwords]]
+=== Using stopwords
+
+The removal of stopwords is handled by the
+{ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used
+when creating a `custom` analyzer, as described below in <<stop-token-filter>>.
+However, some out-of-the-box analyzers have the `stop` filter integrated
+already:
+
+{ref}analysis-lang-analyzer.html[Language analyzers]::
+
+    Each language analyzer defaults to using the appropriate stopwords list
+    for that language. For instance, the `english` analyzer uses the
+    `_english_` stopwords list.
+
+{ref}analysis-standard-analyzer.html[`standard` analyzer]::
+
+    Defaults to the empty stopwords list: `_none_`, essentially disabling
+    stopwords.
+
+{ref}analysis-pattern-analyzer.html[`pattern` analyzer]::
+
+    Defaults to `_none_`, like the `standard` analyzer.
+
+==== Stopwords and the `standard` analyzer
+
+To use custom stopwords in conjunction with the `standard` analyzer, all we
+need to do is to create a configured version of the analyzer and pass in the
+list of `stopwords that we require:
+
+[source,json]
+---------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "my_analyzer": { <1>
+          "type": "standard", <2>
+          "stopwords": [ <3>
+            "and",<3>
+            "the"
+          ]
+}}}}}
+---------------------------------
+<1> This is a custom analyzer called `my_analyzer`.
+<2> This analyzer is the `standard` analyzer with some custom configuration.
+<3> The stopwords to filter out are `and` and `the`.
+
+TIP: The same technique can be used to configure custom stopword lists for
+any of the language analyzers.
+
+==== Maintaining positions
+
+The output from the `analyze` API is quite interesting:
+
+[source,json]
+---------------------------------
+GET /my_index/_analyze?analyzer=my_analyzer
+The quick and the dead
+---------------------------------
+
+[source,json]
+---------------------------------
+{
+   "tokens": [
+      {
+         "token":        "quick",
+         "start_offset": 4,
+         "end_offset":   9,
+         "type":         "<ALPHANUM>",
+         "position":     2 <1>
+      },
+      {
+         "token":        "dead",
+         "start_offset": 18,
+         "end_offset":   22,
+         "type":         "<ALPHANUM>",
+         "position":     5 <1>
+      }
+   ]
+}
+---------------------------------
+<1> Note the `position` of each token.
+
+The stopwords have been filtered out, as expected, but the interesting part is
+that the `position` of the two remaining terms is unchanged: `quick` is the
+second word in the original sentence, and `dead` is the fifth. This is
+important for phrase queries -- if the positions of each term had been
+adjusted, then a phrase query for `"quick dead"` would have matched the above
+example incorrectly.
+
+==== Specifying stopwords
+
+Stopwords can be passed inline, as we did in the previous example, by
+specifying an array:
+
+[source,json]
+---------------------------------
+"stopwords": [ "and", "the" ]
+---------------------------------
+
+The default stopword list for a particular language can be specified using the
+`_lang_` notation:
+
+[source,json]
+---------------------------------
+"stopwords": "_english_"
+---------------------------------
+
+TIP: The predefined language-specific stopword lists available in
+Elasticsearch can be found in the
+{ref}analysis-stop-tokenfilter.html[`stop` token filter] documentation.
+
+Stopwords can be disabled by specifying the special list: `_none_`.  For
+instance, to use the `english` analyzer without stopwords, you can do the
+following:
+
+[source,json]
+---------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "my_english": {
+          "type":      "english", <1>
+          "stopwords": "_none_" <2>
+        }
+      }
+    }
+  }
+}
+---------------------------------
+<1> The `my_english` analyzer is based on the `english` analyzer.
+<2> But stopwords are disabled.
+
+Finally, stopwords can also be listed in a file with one word per line.  The
+file must be present on all nodes in the cluster, and the path can be
+specified with the `stopwords_path` parameter:
+
+[source,json]
+---------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "my_english": {
+          "type":           "english",
+          "stopwords_path": "config/stopwords/english.txt" <1>
+        }
+      }
+    }
+  }
+}
+---------------------------------
+<1> The path to the stopwords file, relative to the Elasticsearch directory.
+
+[[stop-token-filter]]
+==== Using the `stop` token filter
+
+The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used
+directly when you need to create a `custom` analyzer.  For instance, let's say
+that we wanted to create a Spanish analyzer with a custom stopwords list
+and the `light_spanish` stemmer, which also
+<<asciifolding-token-filter,removes diacritics>>.
+
+We could set that up as follows:
+
+[source,json]
+---------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "spanish_stop": {
+          "type":        "stop",
+          "stopwords": [ "si", "esta", "el", "la" ]  <1>
+        },
+        "light_spanish": { <2>
+          "type":     "stemmer",
+          "language": "light_spanish"
+        }
+      },
+      "analyzer": {
+        "my_spanish": {
+          "tokenizer": "spanish",
+          "filter": [ <3>
+            "lowercase",
+            "asciifolding",
+            "spanish_stop",
+            "light_spanish"
+          ]
+        }
+      }
+    }
+  }
+}
+---------------------------------
+<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
+    parameters as the `standard` analyzer.
+<2> See <<using-an-algorithmic-stemmer>>.
+<3> The order of token filters is important, see below.
+
+The `spanish_stop` filter comes after the `asciifolding` filter.  This means
+that `esta`, `èsta` and ++està++ will first have their diacritics removed to
+become just `esta`, which is removed as a stopword. If, instead, we wanted to
+remove `esta` and `èsta`, but not ++està++,  then we would have to put the
+`spanish_stop` filter *before* the `asciifolding` filter, and specify both
+words in the stopwords list.
+
+[[updating-stopwords]]
+==== Updating stopwords
+
+There are a few techniques which can be used to update the list of stopwords
+in use. Analyzers are instantiated at index creation time, when a node is
+restarted, or when a closed index is reopened.
+
+If you specify stopwords inline with the `stopwords` parameter, then your
+only option is to close the index, update the analyzer configuration with the
+{ref}indices-update-settings.html[update index settings API], then reopen
+the index.
+
+Updating stopwords is easier if you specify them in a file with the
+`stopwords_path` parameter.  You can just update the file (on every node in
+the cluster) then force the analyzers to be recreated by:
+
+* closing and reopening the index
+  (see {ref}indices-open-close.html[open/close index]), or
+* restarting each node in the cluster, one by one.
+
+Of course, updating the stopwords list will not change any documents that have
+already been indexed.  It will only apply to searches and to new or updated
+documents.  To apply the changes to existing documents you will need to
+reindex your data. See <<reindex>>