Skip to content

Commit

Permalink
Added Stopwords intro and using stopwords
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Jun 21, 2014
1 parent 9c88cf7 commit ae0fbfb
Show file tree
Hide file tree
Showing 4 changed files with 337 additions and 15 deletions.
14 changes: 3 additions & 11 deletions 230_Stemming/50_Controlling_stemming.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -71,18 +71,10 @@ sky skies skiing skis <1>
While the language analyzers only allow us to specify an array of words in the
`stem_exclusion` parameter, the `keyword_marker` token filter also accepts a
`keyword_path` parameter which allows us to store all of our keywords in a
`keywords_path` parameter which allows us to store all of our keywords in a
file. The file should contain one word per line, and must be present on every
node in the cluster.
This file can be updated later on, adding or removing keywords. However, it
is important to note that:
* changes to the file will not take effect until either each node has been
restarted, or the index has been closed and reopened.
(see {ref}indices-open-close.html[open/close index])
* changing the file will not have any effect on documents that have already
been indexed.
node in the cluster. See <<updating-stopwords>> for tips on how to update this
file.
==========================================

Expand Down
14 changes: 10 additions & 4 deletions 240_Stopwords.asciidoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
[[stopwords]]
== Stopwords: performance vs precision

stop token filter
elision token filter
include::240_Stopwords/10_Intro.asciidoc[]

include::240_Stopwords/20_Using_stopwords.asciidoc[]


common terms query
match query

relevance

bm25

common grams token filter
85 changes: 85 additions & 0 deletions 240_Stopwords/10_Intro.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
[[stopwords]]
== Stopwords: performance vs precision

Back in the early days of information retrieval, disk space and memory were
limited to a tiny fraction of what we are accustomed to today. It was
essential to make your index as small as possible. Every kilobyte saved meant
a significant improvement in performance. Stemming (see <<stemming>>) was
important, not just for making searches broader and increasing retrieval in
the same way that we use it today, but also as a tool for compressing index
size.

Another way to reduce index size is simply to *index fewer words*. For search
purposes, some words are more important than others. A significant reduction
in index size can be achieved by only indexing the more important terms.

So which terms can be left out? We can divide terms roughly into two groups:

Low frequency terms::

Words that appear in relatively few documents in the corpus. Because of their
rarity, they have a high value or _weight_.

High frequency terms::

Common words that appear in many documents in the index, like `the`, `and` and
`is`. These words have a low weight and contribute little to the relevance
score.

**********************************************
Of course, frequency is really a scale rather than just two points labelled
_low_ and _high_. We just draw a line at some arbitrary point and say that any
terms below that line are low frequency and above the line are high frequency.
**********************************************

Which terms are low or high frequency depend on the documents themselves. The
word `and` may be a low frequency term if all of the documents are in Chinese.
In a collection of documents about databases, the word `database` may be a
high frequency term with little value as a search term for that particular
corpus.

That said, for any language there are a number of words which occur very
commonly and which seldom add value to a search. The default English
stopwords used in Elasticsearch are:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with

These _stopwords_ can usually be filtered out before indexing with little
negative impact on retrieval. But is it a good idea to do so?

[float]
=== Pros and cons of stopwords

We have more disk space, more RAM, and better compression algorithms than
existed back in the day. Excluding the above 33 common words from the index
will only save about 4MB per million documents. Using stopwords for the sake
of reducing index size is no longer a valid reason.

On top of that, by removing words from the index we are reducing our ability
to perform certain types of search. Filtering out the above stopwords
prevents us from:

* distinguishing ``happy'' from ``not happy''.
* searching for the band ``The The''.
* finding Shakespeare's play ``To be or not to be''.
* using the country code for Norway: `no`.

The primary advantage of removing stopwords is performance. Imagine that we
search an index with 1 million documents for the word `fox`. Perhaps `fox`
appears in only 20 of them, which means that Elastisearch has to calculate the
relevance `_score` for 20 documents in order to return the top 10. Now, we
change that to a search for `the OR fox`. The word `the` probably occurs in
almost all of the documents, which means that Elasticsearch has to calculate
the `_score` for all 1 million documents. This second query simply cannot
perform as well as the first.

Fortunately, there are techniques which we can use to keep common words
searchable, while benefiting from the performance gain of stopwords. First,
let's start with how to use stopwords.



239 changes: 239 additions & 0 deletions 240_Stopwords/20_Using_stopwords.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
:ref: http://foo.com/

[[using-stopwords]]
=== Using stopwords

The removal of stopwords is handled by the
{ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used
when creating a `custom` analyzer, as described below in <<stop-token-filter>>.
However, some out-of-the-box analyzers have the `stop` filter integrated
already:

{ref}analysis-lang-analyzer.html[Language analyzers]::

Each language analyzer defaults to using the appropriate stopwords list
for that language. For instance, the `english` analyzer uses the
`_english_` stopwords list.

{ref}analysis-standard-analyzer.html[`standard` analyzer]::

Defaults to the empty stopwords list: `_none_`, essentially disabling
stopwords.

{ref}analysis-pattern-analyzer.html[`pattern` analyzer]::

Defaults to `_none_`, like the `standard` analyzer.

==== Stopwords and the `standard` analyzer

To use custom stopwords in conjunction with the `standard` analyzer, all we
need to do is to create a configured version of the analyzer and pass in the
list of `stopwords that we require:

[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": { <1>
"type": "standard", <2>
"stopwords": [ <3>
"and",<3>
"the"
]
}}}}}
---------------------------------
<1> This is a custom analyzer called `my_analyzer`.
<2> This analyzer is the `standard` analyzer with some custom configuration.
<3> The stopwords to filter out are `and` and `the`.

TIP: The same technique can be used to configure custom stopword lists for
any of the language analyzers.

==== Maintaining positions

The output from the `analyze` API is quite interesting:

[source,json]
---------------------------------
GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead
---------------------------------

[source,json]
---------------------------------
{
"tokens": [
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2 <1>
},
{
"token": "dead",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 5 <1>
}
]
}
---------------------------------
<1> Note the `position` of each token.

The stopwords have been filtered out, as expected, but the interesting part is
that the `position` of the two remaining terms is unchanged: `quick` is the
second word in the original sentence, and `dead` is the fifth. This is
important for phrase queries -- if the positions of each term had been
adjusted, then a phrase query for `"quick dead"` would have matched the above
example incorrectly.

==== Specifying stopwords

Stopwords can be passed inline, as we did in the previous example, by
specifying an array:

[source,json]
---------------------------------
"stopwords": [ "and", "the" ]
---------------------------------

The default stopword list for a particular language can be specified using the
`_lang_` notation:

[source,json]
---------------------------------
"stopwords": "_english_"
---------------------------------

TIP: The predefined language-specific stopword lists available in
Elasticsearch can be found in the
{ref}analysis-stop-tokenfilter.html[`stop` token filter] documentation.

Stopwords can be disabled by specifying the special list: `_none_`. For
instance, to use the `english` analyzer without stopwords, you can do the
following:

[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english", <1>
"stopwords": "_none_" <2>
}
}
}
}
}
---------------------------------
<1> The `my_english` analyzer is based on the `english` analyzer.
<2> But stopwords are disabled.

Finally, stopwords can also be listed in a file with one word per line. The
file must be present on all nodes in the cluster, and the path can be
specified with the `stopwords_path` parameter:

[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords_path": "config/stopwords/english.txt" <1>
}
}
}
}
}
---------------------------------
<1> The path to the stopwords file, relative to the Elasticsearch directory.

[[stop-token-filter]]
==== Using the `stop` token filter

The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used
directly when you need to create a `custom` analyzer. For instance, let's say
that we wanted to create a Spanish analyzer with a custom stopwords list
and the `light_spanish` stemmer, which also
<<asciifolding-token-filter,removes diacritics>>.

We could set that up as follows:

[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ] <1>
},
"light_spanish": { <2>
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [ <3>
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
---------------------------------
<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
parameters as the `standard` analyzer.
<2> See <<using-an-algorithmic-stemmer>>.
<3> The order of token filters is important, see below.

The `spanish_stop` filter comes after the `asciifolding` filter. This means
that `esta`, `èsta` and ++està++ will first have their diacritics removed to
become just `esta`, which is removed as a stopword. If, instead, we wanted to
remove `esta` and `èsta`, but not ++està++, then we would have to put the
`spanish_stop` filter *before* the `asciifolding` filter, and specify both
words in the stopwords list.

[[updating-stopwords]]
==== Updating stopwords

There are a few techniques which can be used to update the list of stopwords
in use. Analyzers are instantiated at index creation time, when a node is
restarted, or when a closed index is reopened.

If you specify stopwords inline with the `stopwords` parameter, then your
only option is to close the index, update the analyzer configuration with the
{ref}indices-update-settings.html[update index settings API], then reopen
the index.

Updating stopwords is easier if you specify them in a file with the
`stopwords_path` parameter. You can just update the file (on every node in
the cluster) then force the analyzers to be recreated by:

* closing and reopening the index
(see {ref}indices-open-close.html[open/close index]), or
* restarting each node in the cluster, one by one.

Of course, updating the stopwords list will not change any documents that have
already been indexed. It will only apply to searches and to new or updated
documents. To apply the changes to existing documents you will need to
reindex your data. See <<reindex>>

0 comments on commit ae0fbfb

Please sign in to comment.