Updated stemming to incorporate comments from Robert

zhaofanfan2019 · Jun 19, 2014 · ed891a3 · ed891a3
1 parent 8d06ba4
commit ed891a3
Show file tree

Hide file tree

Showing 7 changed files with 129 additions and 64 deletions.
diff --git a/230_Stemming.asciidoc b/230_Stemming.asciidoc
@@ -6,6 +6,8 @@ include::230_Stemming/20_Dictionary_stemmers.asciidoc[]
 
 include::230_Stemming/30_Hunspell_stemmer.asciidoc[]
 
-include::230_Stemming/40_Controlling_stemming.asciidoc[]
+include::230_Stemming/40_Choosing_a_stemmer.asciidoc[]
 
-include::230_Stemming/50_Stemming_in_situ.asciidoc[]
+include::230_Stemming/50_Controlling_stemming.asciidoc[]
+
+include::230_Stemming/60_Stemming_in_situ.asciidoc[]
diff --git a/230_Stemming/00_Intro.asciidoc b/230_Stemming/00_Intro.asciidoc
@@ -62,13 +62,16 @@ them.
 
 Lemmatisation is a much more complicated and expensive process that needs to
 understand the context in which words to appear in order to make decisions
-about what they mean. For now, stemmers are the best tools that we have
-available.
+about what they mean. In practice, stemming appears to be just as effective
+as lemmatisation, but with a much lower cost.
 
 **********************************************
 
-There are two types of stemmers available: algorithmic stemmers and dictionary
-stemmers.
+First we will discuss the two classes of stemmers available in Elasticsearch
+-- <<algorithmic-stemmers>> and <<dictionary-stemmers>> -- then look at how to
+choose the right stemmer for your needs in <<choosing-a-stemmer>>.  Finally,
+we will discuss options for tailoring stemming in <<controlling-stemming>> and
+<<stemming-in-situ>>.
 
 
 
diff --git a/230_Stemming/10_Algorithmic_stemmers.asciidoc b/230_Stemming/10_Algorithmic_stemmers.asciidoc
@@ -148,44 +148,3 @@ PUT /my_index
     `light_english` stemmer.
 <2> Added the `asciifolding` token filter.
 
-==== Choosing an algorithmic stemmer
-
-The documentation for the
-{ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter]
-lists multiple stemmers for some languages.  For Portuguese we have:
-
-* `portuguese`
-* `light_portuguese`
-* `minimal_portuguese`
-* `portuguese_rslp`
-
-For English we have:
-
-* `english`
-* `light_english`
-* `minimal_english`
-* `lovins`
-* `porter`
-* `porter2`
-* `possessive_english`
-
-One thing is for sure: whenever more than one solution exists for a problem,
-it means that none of the solutions solves the problem adequately. This
-certainly applies to stemming -- each stemmer is based on a different
-algorithm which overstems and understems words to a different degree.
-
-The {ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter] reference
-documentation highlights the recommended choice for each language in bold,
-but the recommended stemmer may not be appropriate for all use cases. It is
-usually chosen because it offers a reasonable compromise between performance
-and accuracy.  You may find that, for your particular use case, the
-recommended stemmer is either too aggressive or not aggressive enough, in
-which case you may want to try a different stemmer.
-
-The `light_` stemmers are less aggressive than the standard stemmers, and the
-`minimal_` stemmers are less aggressive still. The Snowball-based stemmers
-tend to be slower than the other hand-coded stemmers, although that very much
-depends upon the implementation.
-
-Choosing the ``best'' stemmer is largely a case of trying each one out and
-selecting the one that seems to produce the best results for your documents.
diff --git a/230_Stemming/30_Hunspell_stemmer.asciidoc b/230_Stemming/30_Hunspell_stemmer.asciidoc
@@ -176,23 +176,6 @@ shards which use the same Hunspell analyzer share the same instance.
 
 ***********************************************
 
-==== When to use Hunspell
-
-In theory, the Hunspell stemmer promises accurate, configurable stemming.  The
-reality, sadly, falls short of the theory. The main problem is the difficulty
-of finding high quality, up to date dictionaries, with friendly licenses. Most
-dictionaries are incomplete and out of date.
-
-Hunspell tends to stem quite aggressively, reducing every word to the shortest
-form possible.  While this does increase recall, it also reduces precision. Of
-course, you can control the stemming process if you are willing to customize
-your own dictionary, but that requires a lot of effort and research.
-
-In practice, if a good algorithmic stemmer is available for your language, it
-makes more sense to use that rather than Hunspell.  It will be faster, consume
-less memory and the results will generally be as good or better than with
-Hunspell.
-
 [[hunspell-dictionary-format]]
 ==== Hunspell dictionary format
 

diff --git a/230_Stemming/40_Choosing_a_stemmer.asciidoc b/230_Stemming/40_Choosing_a_stemmer.asciidoc
@@ -0,0 +1,118 @@
+:ref: http://foo.com/
+[[choosing-a-stemmer]]
+=== Choosing a stemmer
+
+The documentation for the
+{ref}analysis-stemmer-tokenfilter.html[`stemmer`] token filter
+lists multiple stemmers for some languages.  For English we have:
+
+[horizontal]
+`english`::
+    The {ref}analysis-porterstem-tokenfilter.html[`porter_stem`] token filter.
+
+`light_english`::
+    The {ref}analysis-kstem-tokenfilter.html[`kstem`] token filter.
+
+`minimal_english`::
+    The `EnglishMinimalStemmer` in Lucene, which removes plurals.
+
+`lovins`::
+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
+    http://snowball.tartarus.org/algorithms/lovins/stemmer.html[Lovins]
+    stemmer, the first stemmer ever produced.
+
+`porter`::
+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
+    http://snowball.tartarus.org/algorithms/porter/stemmer.html[Porter] stemmer.
+
+`porter2`::
+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
+    http://snowball.tartarus.org/algorithms/english/stemmer.html[Porter2] stemmer.
+
+`possessive_english`::
+    The `EnglishPossessiveFilter` in Lucene which removes `'s`.
+
+Add to that list the Hunspell stemmer with the various English dictionaries
+which are available.
+
+One thing is for sure: whenever more than one solution exists for a problem,
+it means that none of the solutions solves the problem adequately. This
+certainly applies to stemming -- each stemmer uses a different approach which
+overstems and understems words to a different degree.
+
+The `stemmer` documentation page highights the ``recommended'' stemmer for
+each language in bold, usually because it offers a reasonable compromise
+between performance and quality. That said, the recommended stemmer may not be
+appropriate for all use cases. There is no single right answer to the question
+of which is the ``best'' stemmer -- it depends very much on your requirements.
+There are three factors to take into account when making a choice:
+performance, quality and degree:
+
+[[stemmer-performance]]
+==== Stemmer performance
+
+Algorithmic stemmers are typically four or five times faster than Hunspell
+stemmers. ``Hand crafted'' algorithmic stemmers are usually, but not always,
+faster than their Snowball equivalents.  For instance, the `porter_stem` token
+filter is significantly faster than the Snowball implementation of the Porter
+stemmer.
+
+Hunspell stemmers have to load all words, prefixes and suffixes into memory,
+which can consume a few megabytes of RAM.  Algorithmic stemmers, on the other
+hand, consist of a small amount of code and consume very little memory.
+
+[[stemmer-quality]]
+==== Stemmer quality
+
+All languages, except Esperanto, are irregular. While more formal words tend
+to follow a regular pattern, the most commonly used words often have their
+irregular rules. Some stemming algorithms have been developed over years of
+research and produce reasonably high quality results. Others have been
+assembled more quickly with less research and deal only with the most common
+cases.
+
+While Hunspell offers the promise of dealing precisely with irregular words,
+it often falls short in practice. A dictionary stemmer is only as good as its
+dictionary.   If Hunspell comes across a word which isn't in its dictionary it
+can do nothing with it. Hunspell requires an extensive, high quality, up to
+date dictionary in order to produce good results -- dictionaries of this
+calibre are few and far between. An algorithmic stemmer, on the other hand,
+will happily deal with new words that didn't exist when the designer created
+the algorithm.
+
+If a good algorithmic stemmer is available for your language, it makes sense
+to use it rather than Hunspell.  It will be faster, consume less memory and
+will generall be as good or better than the Hunspell equivalent.
+
+If accuracy and customizability is very important to you, and you need (and
+have the resources) to maintain a custom dictionary, then Hunspell gives you
+greater flexibility than the algorithmic stemmers. (See
+<<controlling-stemming>> for customization techniques which can be used with
+any stemmer.)
+
+[[stemmer-degree]]
+==== Stemmer degree
+
+Different stemmers overstem and understem to a different degree.  The `light_`
+stemmers stem less aggressively than the standard stemmers, and the `minimal_`
+stemmers less aggressively still.  Hunspell stems aggressively.
+
+Whether you want aggressive or light stemming depends on your use case.  If
+your search results are being consumed by a clustering algorithm, you may
+prefer to match more widely (and, thus, stem more aggressively).  If your
+search results are intended for human consumption, lighter stemming usually
+produces better results.  Stemming nouns and adjectives is more important for
+search than stemming verbs, but this also depends on the language.
+
+The other factor to take into account is the size of your document corpus.
+With a small corpus such as a catalog of 10,000 products, you probably want to
+stem more aggressively to ensure that you match at least some documents.  If
+your corpus is large, then it is likely you will get good matches with lighter
+stemming.
+
+==== Making a choice
+
+Start out with a recommended stemmer.  If it works well enough, then there is
+no need to change it.  If it doesn't, then you will need to spend some time
+investigating and comparing the stemmers available for language in order to
+find the one that suits your purposes best.
diff --git a/...Stemming/40_Controlling_stemming.asciidoc → ...Stemming/50_Controlling_stemming.asciidoc b/...Stemming/40_Controlling_stemming.asciidoc → ...Stemming/50_Controlling_stemming.asciidoc
diff --git a/230_Stemming/50_Stemming_in_situ.asciidoc → 230_Stemming/60_Stemming_in_situ.asciidoc b/230_Stemming/50_Stemming_in_situ.asciidoc → 230_Stemming/60_Stemming_in_situ.asciidoc