From ed891a378746513069cb1ea58617d6b204a04548 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Thu, 19 Jun 2014 15:01:16 +0200 Subject: [PATCH] Updated stemming to incorporate comments from Robert --- 230_Stemming.asciidoc | 6 +- 230_Stemming/00_Intro.asciidoc | 11 +- 230_Stemming/10_Algorithmic_stemmers.asciidoc | 41 ------ 230_Stemming/30_Hunspell_stemmer.asciidoc | 17 --- 230_Stemming/40_Choosing_a_stemmer.asciidoc | 118 ++++++++++++++++++ ...iidoc => 50_Controlling_stemming.asciidoc} | 0 ....asciidoc => 60_Stemming_in_situ.asciidoc} | 0 7 files changed, 129 insertions(+), 64 deletions(-) create mode 100644 230_Stemming/40_Choosing_a_stemmer.asciidoc rename 230_Stemming/{40_Controlling_stemming.asciidoc => 50_Controlling_stemming.asciidoc} (100%) rename 230_Stemming/{50_Stemming_in_situ.asciidoc => 60_Stemming_in_situ.asciidoc} (100%) diff --git a/230_Stemming.asciidoc b/230_Stemming.asciidoc index 5704099f6..94a7aab51 100644 --- a/230_Stemming.asciidoc +++ b/230_Stemming.asciidoc @@ -6,6 +6,8 @@ include::230_Stemming/20_Dictionary_stemmers.asciidoc[] include::230_Stemming/30_Hunspell_stemmer.asciidoc[] -include::230_Stemming/40_Controlling_stemming.asciidoc[] +include::230_Stemming/40_Choosing_a_stemmer.asciidoc[] -include::230_Stemming/50_Stemming_in_situ.asciidoc[] +include::230_Stemming/50_Controlling_stemming.asciidoc[] + +include::230_Stemming/60_Stemming_in_situ.asciidoc[] diff --git a/230_Stemming/00_Intro.asciidoc b/230_Stemming/00_Intro.asciidoc index 8612c363f..95000c1d2 100644 --- a/230_Stemming/00_Intro.asciidoc +++ b/230_Stemming/00_Intro.asciidoc @@ -62,13 +62,16 @@ them. Lemmatisation is a much more complicated and expensive process that needs to understand the context in which words to appear in order to make decisions -about what they mean. For now, stemmers are the best tools that we have -available. +about what they mean. In practice, stemming appears to be just as effective +as lemmatisation, but with a much lower cost. ********************************************** -There are two types of stemmers available: algorithmic stemmers and dictionary -stemmers. +First we will discuss the two classes of stemmers available in Elasticsearch +-- <> and <> -- then look at how to +choose the right stemmer for your needs in <>. Finally, +we will discuss options for tailoring stemming in <> and +<>. diff --git a/230_Stemming/10_Algorithmic_stemmers.asciidoc b/230_Stemming/10_Algorithmic_stemmers.asciidoc index 9bb9b0837..c31e9443b 100644 --- a/230_Stemming/10_Algorithmic_stemmers.asciidoc +++ b/230_Stemming/10_Algorithmic_stemmers.asciidoc @@ -148,44 +148,3 @@ PUT /my_index `light_english` stemmer. <2> Added the `asciifolding` token filter. -==== Choosing an algorithmic stemmer - -The documentation for the -{ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter] -lists multiple stemmers for some languages. For Portuguese we have: - -* `portuguese` -* `light_portuguese` -* `minimal_portuguese` -* `portuguese_rslp` - -For English we have: - -* `english` -* `light_english` -* `minimal_english` -* `lovins` -* `porter` -* `porter2` -* `possessive_english` - -One thing is for sure: whenever more than one solution exists for a problem, -it means that none of the solutions solves the problem adequately. This -certainly applies to stemming -- each stemmer is based on a different -algorithm which overstems and understems words to a different degree. - -The {ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter] reference -documentation highlights the recommended choice for each language in bold, -but the recommended stemmer may not be appropriate for all use cases. It is -usually chosen because it offers a reasonable compromise between performance -and accuracy. You may find that, for your particular use case, the -recommended stemmer is either too aggressive or not aggressive enough, in -which case you may want to try a different stemmer. - -The `light_` stemmers are less aggressive than the standard stemmers, and the -`minimal_` stemmers are less aggressive still. The Snowball-based stemmers -tend to be slower than the other hand-coded stemmers, although that very much -depends upon the implementation. - -Choosing the ``best'' stemmer is largely a case of trying each one out and -selecting the one that seems to produce the best results for your documents. diff --git a/230_Stemming/30_Hunspell_stemmer.asciidoc b/230_Stemming/30_Hunspell_stemmer.asciidoc index fc41fcbe1..86a90a59b 100644 --- a/230_Stemming/30_Hunspell_stemmer.asciidoc +++ b/230_Stemming/30_Hunspell_stemmer.asciidoc @@ -176,23 +176,6 @@ shards which use the same Hunspell analyzer share the same instance. *********************************************** -==== When to use Hunspell - -In theory, the Hunspell stemmer promises accurate, configurable stemming. The -reality, sadly, falls short of the theory. The main problem is the difficulty -of finding high quality, up to date dictionaries, with friendly licenses. Most -dictionaries are incomplete and out of date. - -Hunspell tends to stem quite aggressively, reducing every word to the shortest -form possible. While this does increase recall, it also reduces precision. Of -course, you can control the stemming process if you are willing to customize -your own dictionary, but that requires a lot of effort and research. - -In practice, if a good algorithmic stemmer is available for your language, it -makes more sense to use that rather than Hunspell. It will be faster, consume -less memory and the results will generally be as good or better than with -Hunspell. - [[hunspell-dictionary-format]] ==== Hunspell dictionary format diff --git a/230_Stemming/40_Choosing_a_stemmer.asciidoc b/230_Stemming/40_Choosing_a_stemmer.asciidoc new file mode 100644 index 000000000..66ed8d8b6 --- /dev/null +++ b/230_Stemming/40_Choosing_a_stemmer.asciidoc @@ -0,0 +1,118 @@ +:ref: http://foo.com/ +[[choosing-a-stemmer]] +=== Choosing a stemmer + +The documentation for the +{ref}analysis-stemmer-tokenfilter.html[`stemmer`] token filter +lists multiple stemmers for some languages. For English we have: + +[horizontal] +`english`:: + The {ref}analysis-porterstem-tokenfilter.html[`porter_stem`] token filter. + +`light_english`:: + The {ref}analysis-kstem-tokenfilter.html[`kstem`] token filter. + +`minimal_english`:: + The `EnglishMinimalStemmer` in Lucene, which removes plurals. + +`lovins`:: + The {ref}analysis-snowball-tokenfilter.html[Snowball] based + http://snowball.tartarus.org/algorithms/lovins/stemmer.html[Lovins] + stemmer, the first stemmer ever produced. + +`porter`:: + The {ref}analysis-snowball-tokenfilter.html[Snowball] based + http://snowball.tartarus.org/algorithms/porter/stemmer.html[Porter] stemmer. + +`porter2`:: + The {ref}analysis-snowball-tokenfilter.html[Snowball] based + http://snowball.tartarus.org/algorithms/english/stemmer.html[Porter2] stemmer. + +`possessive_english`:: + The `EnglishPossessiveFilter` in Lucene which removes `'s`. + +Add to that list the Hunspell stemmer with the various English dictionaries +which are available. + +One thing is for sure: whenever more than one solution exists for a problem, +it means that none of the solutions solves the problem adequately. This +certainly applies to stemming -- each stemmer uses a different approach which +overstems and understems words to a different degree. + +The `stemmer` documentation page highights the ``recommended'' stemmer for +each language in bold, usually because it offers a reasonable compromise +between performance and quality. That said, the recommended stemmer may not be +appropriate for all use cases. There is no single right answer to the question +of which is the ``best'' stemmer -- it depends very much on your requirements. +There are three factors to take into account when making a choice: +performance, quality and degree: + +[[stemmer-performance]] +==== Stemmer performance + +Algorithmic stemmers are typically four or five times faster than Hunspell +stemmers. ``Hand crafted'' algorithmic stemmers are usually, but not always, +faster than their Snowball equivalents. For instance, the `porter_stem` token +filter is significantly faster than the Snowball implementation of the Porter +stemmer. + +Hunspell stemmers have to load all words, prefixes and suffixes into memory, +which can consume a few megabytes of RAM. Algorithmic stemmers, on the other +hand, consist of a small amount of code and consume very little memory. + +[[stemmer-quality]] +==== Stemmer quality + +All languages, except Esperanto, are irregular. While more formal words tend +to follow a regular pattern, the most commonly used words often have their +irregular rules. Some stemming algorithms have been developed over years of +research and produce reasonably high quality results. Others have been +assembled more quickly with less research and deal only with the most common +cases. + +While Hunspell offers the promise of dealing precisely with irregular words, +it often falls short in practice. A dictionary stemmer is only as good as its +dictionary. If Hunspell comes across a word which isn't in its dictionary it +can do nothing with it. Hunspell requires an extensive, high quality, up to +date dictionary in order to produce good results -- dictionaries of this +calibre are few and far between. An algorithmic stemmer, on the other hand, +will happily deal with new words that didn't exist when the designer created +the algorithm. + +If a good algorithmic stemmer is available for your language, it makes sense +to use it rather than Hunspell. It will be faster, consume less memory and +will generall be as good or better than the Hunspell equivalent. + +If accuracy and customizability is very important to you, and you need (and +have the resources) to maintain a custom dictionary, then Hunspell gives you +greater flexibility than the algorithmic stemmers. (See +<> for customization techniques which can be used with +any stemmer.) + +[[stemmer-degree]] +==== Stemmer degree + +Different stemmers overstem and understem to a different degree. The `light_` +stemmers stem less aggressively than the standard stemmers, and the `minimal_` +stemmers less aggressively still. Hunspell stems aggressively. + +Whether you want aggressive or light stemming depends on your use case. If +your search results are being consumed by a clustering algorithm, you may +prefer to match more widely (and, thus, stem more aggressively). If your +search results are intended for human consumption, lighter stemming usually +produces better results. Stemming nouns and adjectives is more important for +search than stemming verbs, but this also depends on the language. + +The other factor to take into account is the size of your document corpus. +With a small corpus such as a catalog of 10,000 products, you probably want to +stem more aggressively to ensure that you match at least some documents. If +your corpus is large, then it is likely you will get good matches with lighter +stemming. + +==== Making a choice + +Start out with a recommended stemmer. If it works well enough, then there is +no need to change it. If it doesn't, then you will need to spend some time +investigating and comparing the stemmers available for language in order to +find the one that suits your purposes best. diff --git a/230_Stemming/40_Controlling_stemming.asciidoc b/230_Stemming/50_Controlling_stemming.asciidoc similarity index 100% rename from 230_Stemming/40_Controlling_stemming.asciidoc rename to 230_Stemming/50_Controlling_stemming.asciidoc diff --git a/230_Stemming/50_Stemming_in_situ.asciidoc b/230_Stemming/60_Stemming_in_situ.asciidoc similarity index 100% rename from 230_Stemming/50_Stemming_in_situ.asciidoc rename to 230_Stemming/60_Stemming_in_situ.asciidoc