Skip to content

Commit

Permalink
Moved the multilingual chapter to the beginning of languages
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed May 23, 2014
1 parent cefced4 commit d7d3c8e
Show file tree
Hide file tree
Showing 34 changed files with 462 additions and 73 deletions.
17 changes: 10 additions & 7 deletions 02_Dealing_with_language.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@
[quote,Matt Groening]
``I know all those words, but that sentence makes no sense to me.''

In <<search-in-depth>> we covered the mechanics of search, but we didn't pay
much attention to the words themselves. It is not enough for full text search
to just match the exact words that the user has queried. Instead, we need to
Full text search is a battle between _precision_ -- returning as few
irrelevant documents as possible -- and _recall_ -- returning as many relevant
documents as possible. While matching only the exact words that the user has
queried would be precise, it is not enough. We would miss out on many
documents that the user would consider to be relevant. Instead, we need to
spread the net wider, to also search for words that are not exactly the same
as the original, but are related.
as the original but are related.

Wouldn't you expect a search for ``quick brown fox'' to match a document
containing ``fast brown foxes'', ``Johnny Walker'' to match ``Johnnie
Expand Down Expand Up @@ -46,8 +48,9 @@ There are several lines of attack:
that we know exists in the index, and a _did-you-mean_ suggester to
redirect users who may have mistyped a search term. See <<suggesters>>.

But before we can manipulate individual words, we need to divide text up into
words, which means that we need to know what constitutes a _word_, which we
will tackle in <<identifying-words>>.
Before we can manipulate individual words, we need to divide text up into
words, which means that we need to know what constitutes a _word_. We will
tackle this in <<identifying-words>>.

But first, let's take a look at how to get started quickly and easily.
--
12 changes: 0 additions & 12 deletions 200_Identifying_words.asciidoc

This file was deleted.

10 changes: 10 additions & 0 deletions 200_Language_analyzers.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
include::200_Language_analyzers/00_Intro.asciidoc[]

include::200_Language_analyzers/10_Using.asciidoc[]

include::200_Language_analyzers/20_Configuring.asciidoc[]

include::200_Language_analyzers/30_Multiple_languages.asciidoc[]



47 changes: 47 additions & 0 deletions 200_Language_analyzers/00_Intro.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[[language-analyzers]]
== Language analyzers

Elasticsearch ships with a collection of language analyzers which provide
good, basic, out-of-the-box support for a number of the world's most common
languages:

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese,
Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek,
Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian,
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

These analyzers typically perform four roles:

* Tokenize text into individual words:
+
`The quick brown foxes` -> [`The`, `quick`, `brown`, `foxes`]

* Lowercase tokens:
+
`The` -> `the`

* Remove commom _stopwords_:
+
&#91;`The`, `quick`, `brown`, `foxes`] -> [`quick`, `brown`, `foxes`]

* Stem tokens to their root form:
+
`foxes` -> `fox`

Each analyzer may also apply other transformations specific to its language in
order to make words from that language more searchable:

* the `english` analyzer removes the possessive `'s`:
+
`John's` -> `john`

* the `french` analyzer removes _elisions_ like `l'` and `qu'` and
_diactrics_ like `¨` or `^`:
+
`l'église` -> `eglis`

* the `german` analyzer normalizes terms, replacing `ä` and `ae` with `a`, or
`ß` with `ss`, among others:
+
`äußerst` -> `ausserst`

101 changes: 101 additions & 0 deletions 200_Language_analyzers/10_Using.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
[[using-language-analyzers]]
=== Using language analyzers

The built-in language analyzers are available globally and don't need to be
configured before being used. They can be specified directly in the field
mapping:

[source,js]
--------------------------------------------------
PUT /my_index
{
"mappings": {
"blog": {
"properties": {
"title": {
"type": "string",
"analyzer": "english" <1>
}
}
}
}
}
--------------------------------------------------
<1> The `title` field will use the `english` analyzer instead of the default
`standard` analyzer.

Of course, by passing text through the `english` analyzer, we lose
information:

[source,js]
--------------------------------------------------
GET /my_index/_analyze?field=title <1>
I'm not happy about the foxes
--------------------------------------------------
<1> Emits tokens: `i'm`, `happi`, `about`, `fox`

We can't tell if the document mentions one `fox` or many `foxes`; the word
`not` is a stopword and is removed, so we can't tell whether the document is
happy about foxes or *not*. By using the `english` analyzer, we have increased
recall as we can match more loosely, but we have reduced our ability to rank
documents accurately.

To get the best of both worlds, we can use <<multi-fields,multi-fields>> to
index the `title` field twice: once with the `english` analyzer and once with
the `standard` analyzer:

[source,js]
--------------------------------------------------
PUT /my_index
{
"mappings": {
"blog": {
"properties": {
"title": { <1>
"type": "string",
"fields": {
"english": { <2>
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
--------------------------------------------------
<1> The main `title` field uses the `standard` analyzer.
<2> The `title.english` sub-field uses the `english` analyzer.

With this mapping in place, we can index some test documents to demonstrate
how to use both fields at query time:

[source,js]
--------------------------------------------------
PUT /my_index/blog/1
{ "title": "I'm happy for this fox" }
PUT /my_index/blog/2
{ "title": "I'm not happy about my fox problem" }
GET /_search
{
"query": {
"multi_match": {
"type": "most_fields" <1>
"query": "not happy foxes",
"fields": [ "title", "title.english" ]
}
}
}
--------------------------------------------------
<1> Use the <<most-fields,`most_fields`>> query type to match the
same text in as many fields as possible.

Even though neither of our documents contain the word `foxes`, both documents
are returned as results thanks to the word stemming on the `title.english`
field. The second document is ranked as more relevant, because the word `not`
matches on the `title` field.


65 changes: 65 additions & 0 deletions 200_Language_analyzers/20_Configuring.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
[[configuring-language-analyzers]]
=== Configuring language analyzers

While the language analyzers can be used out of the box without any
configuration, most of them do allow you to control aspects of their
behaviour, specifically:

Stem word exclusion::
+
Imagine, for instance, that users searching for the ``World Health
Organization'' are instead getting results for ``organ health''. The reason
for this confusion is that both ``organ'' and ``organization'' are stemmed to
the same root word: `organ`. Often this isn't a problem, but in this
particular collection of documents this leads to confusing results. We would
like to prevent the words `organization` and `organizations` from being
stemmed.

Custom stopwords::

The default list of stopwords used in English are:
+
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with
+
The unusual thing about `no` and `not` is that they invert the meaning of the
words that follow them. Perhaps we decide that these two words are important
and that we shouldn't treat them as stopwords.

In order to customize the behaviour of the `english` analyzer, we need to
create a custom analyzer which uses the `english` analyzer as its base, but
adds some configuration:

[source,js]
--------------------------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stem_exclusion": [ "organization", "organizations" ], <1>
"stopwords": [ <2>
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
"if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
"the", "their", "then", "there", "these", "they", "this", "to",
"was", "will", "with"
]
}
}
}
}
}
GET /my_index/_analyze?analyzer=my_english <3>
The World Health Organization does not sell organs.
--------------------------------------------------
<1> Prevents `organization` and `organizations` from being stemmed.
<2> Specifies a custom list of stopwords.
<3> Emits tokens `world`, `health`, `organization`, `doe`, `not`, `sell`, `organ`.

We will discuss stemming and stopwords in much more detail in <<stemming>> and
<<stopwords>> respectively.

12 changes: 12 additions & 0 deletions 200_Language_analyzers/30_Multiple_languages.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[[multiple-languages]]
=== Handling multiple languages

If you only have to deal with a single language, count yourself lucky.
Finding the right strategy for handling documents written in several languages
can be challenging. There are three possible scenarios:

* each document contains text in a single language
* a document may contain more than one language, but each field contains
text in a single language
* a single field may contain a mixture of languages

21 changes: 21 additions & 0 deletions 210_Identifying_words.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
include::210_Identifying_words/00_Intro.asciidoc[]

include::210_Identifying_words/10_Standard_analyzer.asciidoc[]

include::210_Identifying_words/20_Standard_tokenizer.asciidoc[]

include::210_Identifying_words/30_ICU_plugin.asciidoc[]

include::210_Identifying_words/40_ICU_tokenizer.asciidoc[]

include::210_Identifying_words/50_Tidying_text.asciidoc[]

//////////////////

Compound words

language specific
- kuromoji
- chinese

//////////////////
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ constituent parts.
Asian languages are even more complex: some have no whitespace between words,
sentences or even paragraphs. Some words can be represented by a single
character, but the same single character, when placed next to other
characters, can form just a part of a longer word with a quite different
characters, can form just one part of a longer word with a quite different
meaning.

It should be obvious that there is no ``silver bullet'' analyzer that will
Expand Down
14 changes: 0 additions & 14 deletions 210_Token_normalization.asciidoc

This file was deleted.

20 changes: 0 additions & 20 deletions 220_Stemming.asciidoc

This file was deleted.

17 changes: 17 additions & 0 deletions 220_Token_normalization.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
include::220_Token_normalization/00_Intro.asciidoc[]

include::220_Token_normalization/10_Lowercasing.asciidoc[]

include::220_Token_normalization/20_Removing_diacritics.asciidoc[]

include::220_Token_normalization/30_Unicode_world.asciidoc[]

include::220_Token_normalization/40_Case_folding.asciidoc[]

include::220_Token_normalization/50_Character_folding.asciidoc[]

// TODO: Add normalization character filter with ngram tokenizer for decompounding german
// German ngrams should be 4, not 3

include::220_Token_normalization/60_Sorting_and_collations.asciidoc[]

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[_UnicodeSet
characters may be folded. For instance, to exclude the Swedish letters `å`,
`ä`, `ö`, ++Å++, `Ä` and `Ö` from folding, you would specify a character class
representing all Unicode characters, except for those letters: `[^åäöÅÄÖ]`
(`^` means ``except'').
(`^` means ``everything except'').

[source,js]
--------------------------------------------------
Expand Down
Loading

0 comments on commit d7d3c8e

Please sign in to comment.