forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Moved the multilingual chapter to the beginning of languages
- Loading branch information
1 parent
cefced4
commit d7d3c8e
Showing
34 changed files
with
462 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
include::200_Language_analyzers/00_Intro.asciidoc[] | ||
|
||
include::200_Language_analyzers/10_Using.asciidoc[] | ||
|
||
include::200_Language_analyzers/20_Configuring.asciidoc[] | ||
|
||
include::200_Language_analyzers/30_Multiple_languages.asciidoc[] | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
[[language-analyzers]] | ||
== Language analyzers | ||
|
||
Elasticsearch ships with a collection of language analyzers which provide | ||
good, basic, out-of-the-box support for a number of the world's most common | ||
languages: | ||
|
||
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, | ||
Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, | ||
Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, | ||
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai. | ||
|
||
These analyzers typically perform four roles: | ||
|
||
* Tokenize text into individual words: | ||
+ | ||
`The quick brown foxes` -> [`The`, `quick`, `brown`, `foxes`] | ||
|
||
* Lowercase tokens: | ||
+ | ||
`The` -> `the` | ||
|
||
* Remove commom _stopwords_: | ||
+ | ||
[`The`, `quick`, `brown`, `foxes`] -> [`quick`, `brown`, `foxes`] | ||
|
||
* Stem tokens to their root form: | ||
+ | ||
`foxes` -> `fox` | ||
|
||
Each analyzer may also apply other transformations specific to its language in | ||
order to make words from that language more searchable: | ||
|
||
* the `english` analyzer removes the possessive `'s`: | ||
+ | ||
`John's` -> `john` | ||
|
||
* the `french` analyzer removes _elisions_ like `l'` and `qu'` and | ||
_diactrics_ like `¨` or `^`: | ||
+ | ||
`l'église` -> `eglis` | ||
|
||
* the `german` analyzer normalizes terms, replacing `ä` and `ae` with `a`, or | ||
`ß` with `ss`, among others: | ||
+ | ||
`äußerst` -> `ausserst` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
[[using-language-analyzers]] | ||
=== Using language analyzers | ||
|
||
The built-in language analyzers are available globally and don't need to be | ||
configured before being used. They can be specified directly in the field | ||
mapping: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /my_index | ||
{ | ||
"mappings": { | ||
"blog": { | ||
"properties": { | ||
"title": { | ||
"type": "string", | ||
"analyzer": "english" <1> | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
<1> The `title` field will use the `english` analyzer instead of the default | ||
`standard` analyzer. | ||
|
||
Of course, by passing text through the `english` analyzer, we lose | ||
information: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /my_index/_analyze?field=title <1> | ||
I'm not happy about the foxes | ||
-------------------------------------------------- | ||
<1> Emits tokens: `i'm`, `happi`, `about`, `fox` | ||
|
||
We can't tell if the document mentions one `fox` or many `foxes`; the word | ||
`not` is a stopword and is removed, so we can't tell whether the document is | ||
happy about foxes or *not*. By using the `english` analyzer, we have increased | ||
recall as we can match more loosely, but we have reduced our ability to rank | ||
documents accurately. | ||
|
||
To get the best of both worlds, we can use <<multi-fields,multi-fields>> to | ||
index the `title` field twice: once with the `english` analyzer and once with | ||
the `standard` analyzer: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /my_index | ||
{ | ||
"mappings": { | ||
"blog": { | ||
"properties": { | ||
"title": { <1> | ||
"type": "string", | ||
"fields": { | ||
"english": { <2> | ||
"type": "string", | ||
"analyzer": "english" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
<1> The main `title` field uses the `standard` analyzer. | ||
<2> The `title.english` sub-field uses the `english` analyzer. | ||
|
||
With this mapping in place, we can index some test documents to demonstrate | ||
how to use both fields at query time: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /my_index/blog/1 | ||
{ "title": "I'm happy for this fox" } | ||
PUT /my_index/blog/2 | ||
{ "title": "I'm not happy about my fox problem" } | ||
GET /_search | ||
{ | ||
"query": { | ||
"multi_match": { | ||
"type": "most_fields" <1> | ||
"query": "not happy foxes", | ||
"fields": [ "title", "title.english" ] | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
<1> Use the <<most-fields,`most_fields`>> query type to match the | ||
same text in as many fields as possible. | ||
|
||
Even though neither of our documents contain the word `foxes`, both documents | ||
are returned as results thanks to the word stemming on the `title.english` | ||
field. The second document is ranked as more relevant, because the word `not` | ||
matches on the `title` field. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
[[configuring-language-analyzers]] | ||
=== Configuring language analyzers | ||
|
||
While the language analyzers can be used out of the box without any | ||
configuration, most of them do allow you to control aspects of their | ||
behaviour, specifically: | ||
|
||
Stem word exclusion:: | ||
+ | ||
Imagine, for instance, that users searching for the ``World Health | ||
Organization'' are instead getting results for ``organ health''. The reason | ||
for this confusion is that both ``organ'' and ``organization'' are stemmed to | ||
the same root word: `organ`. Often this isn't a problem, but in this | ||
particular collection of documents this leads to confusing results. We would | ||
like to prevent the words `organization` and `organizations` from being | ||
stemmed. | ||
|
||
Custom stopwords:: | ||
|
||
The default list of stopwords used in English are: | ||
+ | ||
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, | ||
no, not, of, on, or, such, that, the, their, then, there, these, | ||
they, this, to, was, will, with | ||
+ | ||
The unusual thing about `no` and `not` is that they invert the meaning of the | ||
words that follow them. Perhaps we decide that these two words are important | ||
and that we shouldn't treat them as stopwords. | ||
|
||
In order to customize the behaviour of the `english` analyzer, we need to | ||
create a custom analyzer which uses the `english` analyzer as its base, but | ||
adds some configuration: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /my_index | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"my_english": { | ||
"type": "english", | ||
"stem_exclusion": [ "organization", "organizations" ], <1> | ||
"stopwords": [ <2> | ||
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", | ||
"if", "in", "into", "is", "it", "of", "on", "or", "such", "that", | ||
"the", "their", "then", "there", "these", "they", "this", "to", | ||
"was", "will", "with" | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
GET /my_index/_analyze?analyzer=my_english <3> | ||
The World Health Organization does not sell organs. | ||
-------------------------------------------------- | ||
<1> Prevents `organization` and `organizations` from being stemmed. | ||
<2> Specifies a custom list of stopwords. | ||
<3> Emits tokens `world`, `health`, `organization`, `doe`, `not`, `sell`, `organ`. | ||
|
||
We will discuss stemming and stopwords in much more detail in <<stemming>> and | ||
<<stopwords>> respectively. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
[[multiple-languages]] | ||
=== Handling multiple languages | ||
|
||
If you only have to deal with a single language, count yourself lucky. | ||
Finding the right strategy for handling documents written in several languages | ||
can be challenging. There are three possible scenarios: | ||
|
||
* each document contains text in a single language | ||
* a document may contain more than one language, but each field contains | ||
text in a single language | ||
* a single field may contain a mixture of languages | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
include::210_Identifying_words/00_Intro.asciidoc[] | ||
|
||
include::210_Identifying_words/10_Standard_analyzer.asciidoc[] | ||
|
||
include::210_Identifying_words/20_Standard_tokenizer.asciidoc[] | ||
|
||
include::210_Identifying_words/30_ICU_plugin.asciidoc[] | ||
|
||
include::210_Identifying_words/40_ICU_tokenizer.asciidoc[] | ||
|
||
include::210_Identifying_words/50_Tidying_text.asciidoc[] | ||
|
||
////////////////// | ||
|
||
Compound words | ||
|
||
language specific | ||
- kuromoji | ||
- chinese | ||
|
||
////////////////// |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
include::220_Token_normalization/00_Intro.asciidoc[] | ||
|
||
include::220_Token_normalization/10_Lowercasing.asciidoc[] | ||
|
||
include::220_Token_normalization/20_Removing_diacritics.asciidoc[] | ||
|
||
include::220_Token_normalization/30_Unicode_world.asciidoc[] | ||
|
||
include::220_Token_normalization/40_Case_folding.asciidoc[] | ||
|
||
include::220_Token_normalization/50_Character_folding.asciidoc[] | ||
|
||
// TODO: Add normalization character filter with ngram tokenizer for decompounding german | ||
// German ngrams should be 4, not 3 | ||
|
||
include::220_Token_normalization/60_Sorting_and_collations.asciidoc[] | ||
|
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.