forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Updated stemming to incorporate comments from Robert
- Loading branch information
1 parent
8d06ba4
commit ed891a3
Showing
7 changed files
with
129 additions
and
64 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
:ref: http://foo.com/ | ||
[[choosing-a-stemmer]] | ||
=== Choosing a stemmer | ||
|
||
The documentation for the | ||
{ref}analysis-stemmer-tokenfilter.html[`stemmer`] token filter | ||
lists multiple stemmers for some languages. For English we have: | ||
|
||
[horizontal] | ||
`english`:: | ||
The {ref}analysis-porterstem-tokenfilter.html[`porter_stem`] token filter. | ||
|
||
`light_english`:: | ||
The {ref}analysis-kstem-tokenfilter.html[`kstem`] token filter. | ||
|
||
`minimal_english`:: | ||
The `EnglishMinimalStemmer` in Lucene, which removes plurals. | ||
|
||
`lovins`:: | ||
The {ref}analysis-snowball-tokenfilter.html[Snowball] based | ||
http://snowball.tartarus.org/algorithms/lovins/stemmer.html[Lovins] | ||
stemmer, the first stemmer ever produced. | ||
|
||
`porter`:: | ||
The {ref}analysis-snowball-tokenfilter.html[Snowball] based | ||
http://snowball.tartarus.org/algorithms/porter/stemmer.html[Porter] stemmer. | ||
|
||
`porter2`:: | ||
The {ref}analysis-snowball-tokenfilter.html[Snowball] based | ||
http://snowball.tartarus.org/algorithms/english/stemmer.html[Porter2] stemmer. | ||
|
||
`possessive_english`:: | ||
The `EnglishPossessiveFilter` in Lucene which removes `'s`. | ||
|
||
Add to that list the Hunspell stemmer with the various English dictionaries | ||
which are available. | ||
|
||
One thing is for sure: whenever more than one solution exists for a problem, | ||
it means that none of the solutions solves the problem adequately. This | ||
certainly applies to stemming -- each stemmer uses a different approach which | ||
overstems and understems words to a different degree. | ||
|
||
The `stemmer` documentation page highights the ``recommended'' stemmer for | ||
each language in bold, usually because it offers a reasonable compromise | ||
between performance and quality. That said, the recommended stemmer may not be | ||
appropriate for all use cases. There is no single right answer to the question | ||
of which is the ``best'' stemmer -- it depends very much on your requirements. | ||
There are three factors to take into account when making a choice: | ||
performance, quality and degree: | ||
|
||
[[stemmer-performance]] | ||
==== Stemmer performance | ||
|
||
Algorithmic stemmers are typically four or five times faster than Hunspell | ||
stemmers. ``Hand crafted'' algorithmic stemmers are usually, but not always, | ||
faster than their Snowball equivalents. For instance, the `porter_stem` token | ||
filter is significantly faster than the Snowball implementation of the Porter | ||
stemmer. | ||
|
||
Hunspell stemmers have to load all words, prefixes and suffixes into memory, | ||
which can consume a few megabytes of RAM. Algorithmic stemmers, on the other | ||
hand, consist of a small amount of code and consume very little memory. | ||
|
||
[[stemmer-quality]] | ||
==== Stemmer quality | ||
|
||
All languages, except Esperanto, are irregular. While more formal words tend | ||
to follow a regular pattern, the most commonly used words often have their | ||
irregular rules. Some stemming algorithms have been developed over years of | ||
research and produce reasonably high quality results. Others have been | ||
assembled more quickly with less research and deal only with the most common | ||
cases. | ||
|
||
While Hunspell offers the promise of dealing precisely with irregular words, | ||
it often falls short in practice. A dictionary stemmer is only as good as its | ||
dictionary. If Hunspell comes across a word which isn't in its dictionary it | ||
can do nothing with it. Hunspell requires an extensive, high quality, up to | ||
date dictionary in order to produce good results -- dictionaries of this | ||
calibre are few and far between. An algorithmic stemmer, on the other hand, | ||
will happily deal with new words that didn't exist when the designer created | ||
the algorithm. | ||
|
||
If a good algorithmic stemmer is available for your language, it makes sense | ||
to use it rather than Hunspell. It will be faster, consume less memory and | ||
will generall be as good or better than the Hunspell equivalent. | ||
|
||
If accuracy and customizability is very important to you, and you need (and | ||
have the resources) to maintain a custom dictionary, then Hunspell gives you | ||
greater flexibility than the algorithmic stemmers. (See | ||
<<controlling-stemming>> for customization techniques which can be used with | ||
any stemmer.) | ||
|
||
[[stemmer-degree]] | ||
==== Stemmer degree | ||
|
||
Different stemmers overstem and understem to a different degree. The `light_` | ||
stemmers stem less aggressively than the standard stemmers, and the `minimal_` | ||
stemmers less aggressively still. Hunspell stems aggressively. | ||
|
||
Whether you want aggressive or light stemming depends on your use case. If | ||
your search results are being consumed by a clustering algorithm, you may | ||
prefer to match more widely (and, thus, stem more aggressively). If your | ||
search results are intended for human consumption, lighter stemming usually | ||
produces better results. Stemming nouns and adjectives is more important for | ||
search than stemming verbs, but this also depends on the language. | ||
|
||
The other factor to take into account is the size of your document corpus. | ||
With a small corpus such as a catalog of 10,000 products, you probably want to | ||
stem more aggressively to ensure that you match at least some documents. If | ||
your corpus is large, then it is likely you will get good matches with lighter | ||
stemming. | ||
|
||
==== Making a choice | ||
|
||
Start out with a recommended stemmer. If it works well enough, then there is | ||
no need to change it. If it doesn't, then you will need to spend some time | ||
investigating and comparing the stemmers available for language in order to | ||
find the one that suits your purposes best. |
File renamed without changes.
File renamed without changes.