Moved the multilingual chapter to the beginning of languages

zhaofanfan2019 · May 23, 2014 · d7d3c8e · d7d3c8e
1 parent cefced4
commit d7d3c8e
Show file tree

Hide file tree

Showing 34 changed files with 462 additions and 73 deletions.
diff --git a/02_Dealing_with_language.asciidoc b/02_Dealing_with_language.asciidoc
@@ -7,11 +7,13 @@
 [quote,Matt Groening]
 ``I know all those words, but that sentence makes no sense to me.''
 
-In <<search-in-depth>> we covered the mechanics of search, but we didn't pay
-much attention to the words themselves. It is not enough for full text  search
-to just match the exact words that the user has queried. Instead, we need to
+Full text search is a battle between _precision_ -- returning as few
+irrelevant documents as possible -- and _recall_ -- returning as many relevant
+documents as possible. While matching only the exact words that the user has
+queried would be precise, it is not enough. We would miss out on many
+documents that the user would consider to be relevant. Instead, we need to
 spread the net wider, to also search for words that are not exactly the same
-as the original, but are related.
+as the original but are related.
 
 Wouldn't you expect a search for ``quick brown fox'' to match a document
 containing ``fast brown foxes'', ``Johnny Walker'' to match ``Johnnie
@@ -46,8 +48,9 @@ There are several lines of attack:
     that we know exists in the index, and a _did-you-mean_ suggester to
     redirect users who may have mistyped a search term. See <<suggesters>>.
 
-But before we can manipulate individual words, we need to divide text up into
-words, which means that we need to know what constitutes a _word_, which we
-will tackle in <<identifying-words>>.
+Before we can manipulate individual words, we need to divide text up into
+words, which means that we need to know what constitutes a _word_. We will
+tackle this in <<identifying-words>>.
 
+But first, let's take a look at how to get started quickly and easily.
 --
diff --git a/200_Identifying_words.asciidoc b/200_Identifying_words.asciidoc
diff --git a/200_Language_analyzers.asciidoc b/200_Language_analyzers.asciidoc
@@ -0,0 +1,10 @@
+include::200_Language_analyzers/00_Intro.asciidoc[]
+
+include::200_Language_analyzers/10_Using.asciidoc[]
+
+include::200_Language_analyzers/20_Configuring.asciidoc[]
+
+include::200_Language_analyzers/30_Multiple_languages.asciidoc[]
+
+
+
diff --git a/200_Language_analyzers/00_Intro.asciidoc b/200_Language_analyzers/00_Intro.asciidoc
@@ -0,0 +1,47 @@
+[[language-analyzers]]
+== Language analyzers
+
+Elasticsearch ships with a collection of language analyzers which provide
+good, basic, out-of-the-box support for a number of the world's most common
+languages:
+
+Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese,
+Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek,
+Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian,
+Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
+
+These analyzers typically perform four roles:
+
+* Tokenize text into individual words:
++
+`The quick brown foxes` -> [`The`, `quick`, `brown`, `foxes`]
+
+* Lowercase tokens:
++
+`The` -> `the`
+
+* Remove commom _stopwords_:
++
+&#91;`The`, `quick`, `brown`, `foxes`] -> [`quick`, `brown`, `foxes`]
+
+* Stem tokens to their root form:
++
+`foxes` -> `fox`
+
+Each analyzer may also apply other transformations specific to its language in
+order to make words from that language more searchable:
+
+* the `english` analyzer removes the possessive `'s`:
++
+`John's` -> `john`
+
+* the `french` analyzer removes _elisions_ like `l'` and `qu'` and
+  _diactrics_ like `¨` or `^`:
++
+`l'église` -> `eglis`
+
+* the `german` analyzer normalizes terms, replacing `ä` and `ae` with `a`, or
+  `ß` with `ss`, among others:
++
+`äußerst` -> `ausserst`
+
diff --git a/200_Language_analyzers/10_Using.asciidoc b/200_Language_analyzers/10_Using.asciidoc
@@ -0,0 +1,101 @@
+[[using-language-analyzers]]
+=== Using language analyzers
+
+The built-in language analyzers are available globally and don't need to be
+configured before being used.  They can be specified directly in the field
+mapping:
+
+[source,js]
+--------------------------------------------------
+PUT /my_index
+{
+  "mappings": {
+    "blog": {
+      "properties": {
+        "title": {
+          "type":     "string",
+          "analyzer": "english" <1>
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+<1> The `title` field will use the `english` analyzer instead of the default
+    `standard` analyzer.
+
+Of course, by passing text through the `english` analyzer, we lose
+information:
+
+[source,js]
+--------------------------------------------------
+GET /my_index/_analyze?field=title <1>
+I'm not happy about the foxes
+--------------------------------------------------
+<1> Emits tokens: `i'm`, `happi`, `about`, `fox`
+
+We can't tell if the document mentions one `fox` or many  `foxes`; the word
+`not` is a stopword and is removed, so we can't tell whether the document is
+happy about foxes or *not*. By using the `english` analyzer, we have increased
+recall as we can match more loosely, but we have reduced our ability to rank
+documents accurately.
+
+To get the best of both worlds, we can use <<multi-fields,multi-fields>> to
+index the `title` field twice: once with the `english` analyzer and once with
+the `standard` analyzer:
+
+[source,js]
+--------------------------------------------------
+PUT /my_index
+{
+  "mappings": {
+    "blog": {
+      "properties": {
+        "title": { <1>
+          "type": "string",
+          "fields": {
+            "english": { <2>
+              "type":     "string",
+              "analyzer": "english"
+            }
+          }
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+<1> The main `title` field uses the `standard` analyzer.
+<2> The `title.english` sub-field uses the `english` analyzer.
+
+With this mapping in place, we can index some test documents to demonstrate
+how to use both fields at query time:
+
+[source,js]
+--------------------------------------------------
+PUT /my_index/blog/1
+{ "title": "I'm happy for this fox" }
+
+PUT /my_index/blog/2
+{ "title": "I'm not happy about my fox problem" }
+
+GET /_search
+{
+  "query": {
+    "multi_match": {
+      "type":     "most_fields" <1>
+      "query":    "not happy foxes",
+      "fields": [ "title", "title.english" ]
+    }
+  }
+}
+--------------------------------------------------
+<1> Use the <<most-fields,`most_fields`>> query type to match the
+    same text in as many fields as possible.
+
+Even though neither of our documents contain the word `foxes`,  both documents
+are returned as results thanks to the word stemming on the `title.english`
+field.  The second document is ranked as more relevant, because the word `not`
+matches on the `title` field.
+
+
diff --git a/200_Language_analyzers/20_Configuring.asciidoc b/200_Language_analyzers/20_Configuring.asciidoc
@@ -0,0 +1,65 @@
+[[configuring-language-analyzers]]
+=== Configuring language analyzers
+
+While the language analyzers can be used out of the box without any
+configuration, most of them do allow you to control aspects of their
+behaviour, specifically:
+
+Stem word exclusion::
++
+Imagine, for instance, that users searching for the ``World Health
+Organization'' are instead getting results for ``organ health''. The reason
+for this confusion is that both ``organ'' and ``organization'' are stemmed to
+the same root word: `organ`. Often this isn't a problem, but in this
+particular collection of documents this leads to confusing results. We would
+like to prevent the words `organization` and `organizations` from being
+stemmed.
+
+Custom stopwords::
+
+The default list of stopwords used in English are:
++
+    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
+    no, not, of, on, or, such, that, the, their, then, there, these,
+    they, this, to, was, will, with
++
+The unusual thing about `no` and `not` is that they invert the meaning of the
+words that follow them. Perhaps we decide that these two words are important
+and that we shouldn't treat them as stopwords.
+
+In order to customize the behaviour of the `english` analyzer, we need to
+create a custom analyzer which uses the `english` analyzer as its base, but
+adds some configuration:
+
+[source,js]
+--------------------------------------------------
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "my_english": {
+          "type": "english",
+          "stem_exclusion": [ "organization", "organizations" ], <1>
+          "stopwords": [ <2>
+            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
+            "if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
+            "the", "their", "then", "there", "these", "they", "this", "to",
+            "was", "will", "with"
+          ]
+        }
+      }
+    }
+  }
+}
+
+GET /my_index/_analyze?analyzer=my_english <3>
+The World Health Organization does not sell organs.
+--------------------------------------------------
+<1> Prevents `organization` and `organizations` from being stemmed.
+<2> Specifies a custom list of stopwords.
+<3> Emits tokens `world`, `health`, `organization`, `doe`, `not`, `sell`, `organ`.
+
+We will discuss stemming and stopwords in much more detail in <<stemming>> and
+<<stopwords>> respectively.
+
diff --git a/200_Language_analyzers/30_Multiple_languages.asciidoc b/200_Language_analyzers/30_Multiple_languages.asciidoc
@@ -0,0 +1,12 @@
+[[multiple-languages]]
+=== Handling multiple languages
+
+If you only have to deal with a single language, count yourself lucky.
+Finding the right strategy for handling documents written in several languages
+can be challenging. There are three possible scenarios:
+
+* each document contains text in a single language
+* a document may contain more than one language, but each field contains
+  text in a single language
+* a single field may contain a mixture of languages
+
diff --git a/210_Identifying_words.asciidoc b/210_Identifying_words.asciidoc
@@ -0,0 +1,21 @@
+include::210_Identifying_words/00_Intro.asciidoc[]
+
+include::210_Identifying_words/10_Standard_analyzer.asciidoc[]
+
+include::210_Identifying_words/20_Standard_tokenizer.asciidoc[]
+
+include::210_Identifying_words/30_ICU_plugin.asciidoc[]
+
+include::210_Identifying_words/40_ICU_tokenizer.asciidoc[]
+
+include::210_Identifying_words/50_Tidying_text.asciidoc[]
+
+//////////////////
+
+Compound words
+
+language specific
+ - kuromoji
+ - chinese
+
+//////////////////
diff --git a/200_Identifying_words/00_Intro.asciidoc → 210_Identifying_words/00_Intro.asciidoc b/200_Identifying_words/00_Intro.asciidoc → 210_Identifying_words/00_Intro.asciidoc
@@ -15,7 +15,7 @@ constituent parts.
 Asian languages are even more complex: some have no whitespace between words,
 sentences or even paragraphs. Some words can be represented by a single
 character, but the same single character, when placed next to other
-characters, can form just a part of a longer word with a quite different
+characters, can form just one part of a longer word with a quite different
 meaning.
 
 It should be obvious that there is no ``silver bullet'' analyzer that will

diff --git a/...fying_words/10_Standard_analyzer.asciidoc → ...fying_words/10_Standard_analyzer.asciidoc b/...fying_words/10_Standard_analyzer.asciidoc → ...fying_words/10_Standard_analyzer.asciidoc
diff --git a/...ying_words/20_Standard_tokenizer.asciidoc → ...ying_words/20_Standard_tokenizer.asciidoc b/...ying_words/20_Standard_tokenizer.asciidoc → ...ying_words/20_Standard_tokenizer.asciidoc
diff --git a/200_Identifying_words/30_ICU_plugin.asciidoc → 210_Identifying_words/30_ICU_plugin.asciidoc b/200_Identifying_words/30_ICU_plugin.asciidoc → 210_Identifying_words/30_ICU_plugin.asciidoc
diff --git a/...entifying_words/40_ICU_tokenizer.asciidoc → ...entifying_words/40_ICU_tokenizer.asciidoc b/...entifying_words/40_ICU_tokenizer.asciidoc → ...entifying_words/40_ICU_tokenizer.asciidoc
diff --git a/...dentifying_words/50_Tidying_text.asciidoc → ...dentifying_words/50_Tidying_text.asciidoc b/...dentifying_words/50_Tidying_text.asciidoc → ...dentifying_words/50_Tidying_text.asciidoc
diff --git a/210_Token_normalization.asciidoc b/210_Token_normalization.asciidoc
diff --git a/220_Stemming.asciidoc b/220_Stemming.asciidoc
diff --git a/220_Token_normalization.asciidoc b/220_Token_normalization.asciidoc
@@ -0,0 +1,17 @@
+include::220_Token_normalization/00_Intro.asciidoc[]
+
+include::220_Token_normalization/10_Lowercasing.asciidoc[]
+
+include::220_Token_normalization/20_Removing_diacritics.asciidoc[]
+
+include::220_Token_normalization/30_Unicode_world.asciidoc[]
+
+include::220_Token_normalization/40_Case_folding.asciidoc[]
+
+include::220_Token_normalization/50_Character_folding.asciidoc[]
+
+// TODO: Add normalization character filter with ngram tokenizer for decompounding german
+// German ngrams should be 4, not 3
+
+include::220_Token_normalization/60_Sorting_and_collations.asciidoc[]
+
diff --git a/210_Token_normalization/00_Intro.asciidoc → 220_Token_normalization/00_Intro.asciidoc b/210_Token_normalization/00_Intro.asciidoc → 220_Token_normalization/00_Intro.asciidoc
diff --git a/...ken_normalization/10_Lowercasing.asciidoc → ...ken_normalization/10_Lowercasing.asciidoc b/...ken_normalization/10_Lowercasing.asciidoc → ...ken_normalization/10_Lowercasing.asciidoc
diff --git a/...alization/20_Removing_diacritics.asciidoc → ...alization/20_Removing_diacritics.asciidoc b/...alization/20_Removing_diacritics.asciidoc → ...alization/20_Removing_diacritics.asciidoc
diff --git a/...n_normalization/30_Unicode_world.asciidoc → ...n_normalization/30_Unicode_world.asciidoc b/...n_normalization/30_Unicode_world.asciidoc → ...n_normalization/30_Unicode_world.asciidoc
diff --git a/...en_normalization/40_Case_folding.asciidoc → ...en_normalization/40_Case_folding.asciidoc b/...en_normalization/40_Case_folding.asciidoc → ...en_normalization/40_Case_folding.asciidoc
diff --git a/...rmalization/50_Character_folding.asciidoc → ...rmalization/50_Character_folding.asciidoc b/...rmalization/50_Character_folding.asciidoc → ...rmalization/50_Character_folding.asciidoc
@@ -44,7 +44,7 @@ http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[_UnicodeSet
 characters may be folded.  For instance, to exclude the Swedish letters `å`,
 `ä`, `ö`, ++Å++, `Ä` and `Ö` from folding, you would specify a character class
 representing all Unicode characters, except for those letters: `[^åäöÅÄÖ]`
-(`^` means ``except'').
+(`^` means ``everything except'').
 
 [source,js]
 --------------------------------------------------

diff --git a/...zation/60_Sorting_and_collations.asciidoc → ...zation/60_Sorting_and_collations.asciidoc b/...zation/60_Sorting_and_collations.asciidoc → ...zation/60_Sorting_and_collations.asciidoc