Skip to content

Commit

Permalink
Updated stemming to incorporate comments from Robert
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Jun 19, 2014
1 parent 8d06ba4 commit ed891a3
Show file tree
Hide file tree
Showing 7 changed files with 129 additions and 64 deletions.
6 changes: 4 additions & 2 deletions 230_Stemming.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ include::230_Stemming/20_Dictionary_stemmers.asciidoc[]

include::230_Stemming/30_Hunspell_stemmer.asciidoc[]

include::230_Stemming/40_Controlling_stemming.asciidoc[]
include::230_Stemming/40_Choosing_a_stemmer.asciidoc[]

include::230_Stemming/50_Stemming_in_situ.asciidoc[]
include::230_Stemming/50_Controlling_stemming.asciidoc[]

include::230_Stemming/60_Stemming_in_situ.asciidoc[]
11 changes: 7 additions & 4 deletions 230_Stemming/00_Intro.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,16 @@ them.
Lemmatisation is a much more complicated and expensive process that needs to
understand the context in which words to appear in order to make decisions
about what they mean. For now, stemmers are the best tools that we have
available.
about what they mean. In practice, stemming appears to be just as effective
as lemmatisation, but with a much lower cost.
**********************************************

There are two types of stemmers available: algorithmic stemmers and dictionary
stemmers.
First we will discuss the two classes of stemmers available in Elasticsearch
-- <<algorithmic-stemmers>> and <<dictionary-stemmers>> -- then look at how to
choose the right stemmer for your needs in <<choosing-a-stemmer>>. Finally,
we will discuss options for tailoring stemming in <<controlling-stemming>> and
<<stemming-in-situ>>.



41 changes: 0 additions & 41 deletions 230_Stemming/10_Algorithmic_stemmers.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -148,44 +148,3 @@ PUT /my_index
`light_english` stemmer.
<2> Added the `asciifolding` token filter.

==== Choosing an algorithmic stemmer

The documentation for the
{ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter]
lists multiple stemmers for some languages. For Portuguese we have:

* `portuguese`
* `light_portuguese`
* `minimal_portuguese`
* `portuguese_rslp`

For English we have:

* `english`
* `light_english`
* `minimal_english`
* `lovins`
* `porter`
* `porter2`
* `possessive_english`

One thing is for sure: whenever more than one solution exists for a problem,
it means that none of the solutions solves the problem adequately. This
certainly applies to stemming -- each stemmer is based on a different
algorithm which overstems and understems words to a different degree.

The {ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter] reference
documentation highlights the recommended choice for each language in bold,
but the recommended stemmer may not be appropriate for all use cases. It is
usually chosen because it offers a reasonable compromise between performance
and accuracy. You may find that, for your particular use case, the
recommended stemmer is either too aggressive or not aggressive enough, in
which case you may want to try a different stemmer.

The `light_` stemmers are less aggressive than the standard stemmers, and the
`minimal_` stemmers are less aggressive still. The Snowball-based stemmers
tend to be slower than the other hand-coded stemmers, although that very much
depends upon the implementation.

Choosing the ``best'' stemmer is largely a case of trying each one out and
selecting the one that seems to produce the best results for your documents.
17 changes: 0 additions & 17 deletions 230_Stemming/30_Hunspell_stemmer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -176,23 +176,6 @@ shards which use the same Hunspell analyzer share the same instance.
***********************************************

==== When to use Hunspell

In theory, the Hunspell stemmer promises accurate, configurable stemming. The
reality, sadly, falls short of the theory. The main problem is the difficulty
of finding high quality, up to date dictionaries, with friendly licenses. Most
dictionaries are incomplete and out of date.

Hunspell tends to stem quite aggressively, reducing every word to the shortest
form possible. While this does increase recall, it also reduces precision. Of
course, you can control the stemming process if you are willing to customize
your own dictionary, but that requires a lot of effort and research.

In practice, if a good algorithmic stemmer is available for your language, it
makes more sense to use that rather than Hunspell. It will be faster, consume
less memory and the results will generally be as good or better than with
Hunspell.

[[hunspell-dictionary-format]]
==== Hunspell dictionary format

Expand Down
118 changes: 118 additions & 0 deletions 230_Stemming/40_Choosing_a_stemmer.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
:ref: http://foo.com/
[[choosing-a-stemmer]]
=== Choosing a stemmer

The documentation for the
{ref}analysis-stemmer-tokenfilter.html[`stemmer`] token filter
lists multiple stemmers for some languages. For English we have:

[horizontal]
`english`::
The {ref}analysis-porterstem-tokenfilter.html[`porter_stem`] token filter.

`light_english`::
The {ref}analysis-kstem-tokenfilter.html[`kstem`] token filter.

`minimal_english`::
The `EnglishMinimalStemmer` in Lucene, which removes plurals.

`lovins`::
The {ref}analysis-snowball-tokenfilter.html[Snowball] based
http://snowball.tartarus.org/algorithms/lovins/stemmer.html[Lovins]
stemmer, the first stemmer ever produced.

`porter`::
The {ref}analysis-snowball-tokenfilter.html[Snowball] based
http://snowball.tartarus.org/algorithms/porter/stemmer.html[Porter] stemmer.

`porter2`::
The {ref}analysis-snowball-tokenfilter.html[Snowball] based
http://snowball.tartarus.org/algorithms/english/stemmer.html[Porter2] stemmer.

`possessive_english`::
The `EnglishPossessiveFilter` in Lucene which removes `'s`.

Add to that list the Hunspell stemmer with the various English dictionaries
which are available.

One thing is for sure: whenever more than one solution exists for a problem,
it means that none of the solutions solves the problem adequately. This
certainly applies to stemming -- each stemmer uses a different approach which
overstems and understems words to a different degree.

The `stemmer` documentation page highights the ``recommended'' stemmer for
each language in bold, usually because it offers a reasonable compromise
between performance and quality. That said, the recommended stemmer may not be
appropriate for all use cases. There is no single right answer to the question
of which is the ``best'' stemmer -- it depends very much on your requirements.
There are three factors to take into account when making a choice:
performance, quality and degree:

[[stemmer-performance]]
==== Stemmer performance

Algorithmic stemmers are typically four or five times faster than Hunspell
stemmers. ``Hand crafted'' algorithmic stemmers are usually, but not always,
faster than their Snowball equivalents. For instance, the `porter_stem` token
filter is significantly faster than the Snowball implementation of the Porter
stemmer.

Hunspell stemmers have to load all words, prefixes and suffixes into memory,
which can consume a few megabytes of RAM. Algorithmic stemmers, on the other
hand, consist of a small amount of code and consume very little memory.

[[stemmer-quality]]
==== Stemmer quality

All languages, except Esperanto, are irregular. While more formal words tend
to follow a regular pattern, the most commonly used words often have their
irregular rules. Some stemming algorithms have been developed over years of
research and produce reasonably high quality results. Others have been
assembled more quickly with less research and deal only with the most common
cases.

While Hunspell offers the promise of dealing precisely with irregular words,
it often falls short in practice. A dictionary stemmer is only as good as its
dictionary. If Hunspell comes across a word which isn't in its dictionary it
can do nothing with it. Hunspell requires an extensive, high quality, up to
date dictionary in order to produce good results -- dictionaries of this
calibre are few and far between. An algorithmic stemmer, on the other hand,
will happily deal with new words that didn't exist when the designer created
the algorithm.

If a good algorithmic stemmer is available for your language, it makes sense
to use it rather than Hunspell. It will be faster, consume less memory and
will generall be as good or better than the Hunspell equivalent.

If accuracy and customizability is very important to you, and you need (and
have the resources) to maintain a custom dictionary, then Hunspell gives you
greater flexibility than the algorithmic stemmers. (See
<<controlling-stemming>> for customization techniques which can be used with
any stemmer.)

[[stemmer-degree]]
==== Stemmer degree

Different stemmers overstem and understem to a different degree. The `light_`
stemmers stem less aggressively than the standard stemmers, and the `minimal_`
stemmers less aggressively still. Hunspell stems aggressively.

Whether you want aggressive or light stemming depends on your use case. If
your search results are being consumed by a clustering algorithm, you may
prefer to match more widely (and, thus, stem more aggressively). If your
search results are intended for human consumption, lighter stemming usually
produces better results. Stemming nouns and adjectives is more important for
search than stemming verbs, but this also depends on the language.

The other factor to take into account is the size of your document corpus.
With a small corpus such as a catalog of 10,000 products, you probably want to
stem more aggressively to ensure that you match at least some documents. If
your corpus is large, then it is likely you will get good matches with lighter
stemming.

==== Making a choice

Start out with a recommended stemmer. If it works well enough, then there is
no need to change it. If it doesn't, then you will need to spend some time
investigating and comparing the stemmers available for language in order to
find the one that suits your purposes best.
File renamed without changes.

0 comments on commit ed891a3

Please sign in to comment.