forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5701e7c
commit 641b303
Showing
9 changed files
with
440 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
[[user-based]] | ||
=== User-based data | ||
|
||
Often, users start using Elasticsearch because they need to add full text | ||
search or analytics to an existing application. They create a single index | ||
which holds all of their documents. Gradually, others in the company realise | ||
how much benefit Elasticsearch brings, and they want to add their data to | ||
Elasticsearch as well. | ||
|
||
Fortunately, Elasticsearch supports | ||
http://en.wikipedia.org/wiki/Multitenancy[multitenancy] so each new user can | ||
have their own index in the same cluster. Occasionally, somebody will want to | ||
search across the documents for all users, which they can do by searching | ||
across all indices, but most of the time, users are only interested in their | ||
own documents. | ||
|
||
Some users have more documents than others and some users will have heavier | ||
search loads than others, so the ability to specify how many primary shards | ||
and replica shards each index should have fits well with the index-per-user | ||
model. Similarly, busier indices can be allocated to stronger boxes with shard | ||
allocation filtering. (See <<migrate-indices>>.) | ||
|
||
TIP: Don't just use the default number of primary shards for every index. | ||
Think about how much data that index needs to hold. It may be that all you | ||
need is one shard -- any more is a waste of resources. | ||
|
||
Most users of Elasticsearch can stop here. A simple index-per-user approach | ||
is sufficient for the majority of cases. | ||
|
||
In exceptional cases, you may find that you need to support a large number of | ||
users, all with similar needs. An example might be hosting a search engine | ||
for thousands of email forums. Some forums may have a huge amount of traffic, | ||
but the majority of forums are quite small. Dedicating an index with a single | ||
shard to a small forum is overkill -- a single shard could hold the data for | ||
many forums. | ||
|
||
What we need is a way to share resources across users, to give the impression | ||
that each user has their own index without wasting resources on small users. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
[[shared-index]] | ||
=== Shared index | ||
|
||
We can use a large shared index for the many smaller forums by indexing | ||
the forum identifier in a field and using it as a filter: | ||
|
||
[source,json] | ||
------------------------------ | ||
PUT /forums | ||
{ | ||
"settings": { | ||
"number_of_shards": 10 <1> | ||
}, | ||
"mappings": { | ||
"post": { | ||
"properties": { | ||
"forum_id": { <2> | ||
"type": "string", | ||
"index": "not_analyzed" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
PUT /forums/post/1 | ||
{ | ||
"forum_id": "baking", <2> | ||
"title": "Easy recipe for ginger nuts", | ||
... | ||
} | ||
------------------------------ | ||
<1> Create an index large enough to hold thousands of smaller forums. | ||
<2> Each post must include a `forum_id` to identify which forum it belongs | ||
to. | ||
|
||
We can use the `forum_id` as a filter to search within a single forum. The | ||
filter will exclude most of the documents in the index (those from other | ||
forums) and filter caching will ensure that responses are fast: | ||
|
||
[source,json] | ||
------------------------------ | ||
GET /forums/post/_search | ||
{ | ||
"query": { | ||
"filtered": { | ||
"query": { | ||
"match": { | ||
"title": "ginger nuts" | ||
} | ||
}, | ||
"filter": { | ||
"term": { <1> | ||
"forum_id": { | ||
"baking" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
------------------------------ | ||
<1> The `term` filter is cached by default. | ||
|
||
This approach works, but we can do better. The posts from a single forum | ||
would fit easily onto one shard but currently they are scattered across all 10 | ||
shards in the index. This means that every search request has to be forwarded | ||
to a primary or replica of all 10 shards. What would be ideal is to ensure | ||
that all the posts from a single forum are stored on the same shard. | ||
|
||
In <<routing-value>>, we explained that a document is allocated to a | ||
particular shard using this formula: | ||
|
||
shard = hash(routing) % number_of_primary_shards | ||
|
||
The `routing` value defaults to the document's `_id`, but we can override that | ||
and provide our own custom routing value, such as the `forum_id`. All | ||
documents with the same `routing` value will be stored on the same shard: | ||
|
||
[source,json] | ||
------------------------------ | ||
PUT /forums/post/1?routing=baking <1> | ||
{ | ||
"forum_id": "baking", <1> | ||
"title": "Easy recipe for ginger nuts", | ||
... | ||
} | ||
------------------------------ | ||
<1> Using the `forum_id` as the routing value ensures that all posts from the | ||
same forum are stored on the same shard. | ||
|
||
When we search for posts in a particular forum, we can pass the same `routing` | ||
value to ensure that the search request is only run on the single shard that | ||
holds our documents: | ||
|
||
[source,json] | ||
------------------------------ | ||
GET /forums/post/_search?routing=baking <1> | ||
{ | ||
"query": { | ||
"filtered": { | ||
"query": { | ||
"match": { | ||
"title": "ginger nuts" | ||
} | ||
}, | ||
"filter": { | ||
"term": { <2> | ||
"forum_id": { | ||
"baking" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
------------------------------ | ||
<1> The query is only run on the shard that corresponds to this `routing` value. | ||
<2> We still need the filter, as a single shard can hold posts from many forums. | ||
|
||
Multiple forums can be queried by passing a comma-separated list of `routing` | ||
values, and including each `forum_id` in a `terms` filter: | ||
|
||
[source,json] | ||
------------------------------ | ||
GET /forums/post/_search?routing=baking,cooking,recipes <1> | ||
{ | ||
"query": { | ||
"filtered": { | ||
"query": { | ||
"match": { | ||
"title": "ginger nuts" | ||
} | ||
}, | ||
"filter": { | ||
"terms": { | ||
"forum_id": { | ||
[ "baking", "cooking", "recipes" ] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
------------------------------ | ||
|
||
While this approach is technically efficient, it looks a bit clumsy because of | ||
the need to specify `routing` values and `terms` filters on every query or | ||
indexing request. Things look a lot better once we add index aliases into the | ||
mix. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
[[faking-it]] | ||
=== Faking index-per-user with aliases | ||
|
||
To keep things simple and clean, we would like our application to believe that | ||
we have a dedicated index per user -- or per forum in our example -- even if | ||
the reality is that we are using one big <<shared-index,shared index>>. To do | ||
that, we need some way to hide the `routing` value and the filter on | ||
`forum_id`. | ||
|
||
Index aliases allow us to do just that. When you associate an alias with an | ||
index, you can also specify a filter and routing values: | ||
|
||
[source,json] | ||
------------------------------ | ||
PUT /forums/_alias/baking | ||
{ | ||
"routing": "baking", | ||
"filter": { | ||
"term": { | ||
"forum_id": "baking" | ||
} | ||
} | ||
} | ||
------------------------------ | ||
|
||
Now, we can treat the `baking` alias as if it were its own index. Documents | ||
indexed into the `baking` alias automatically get the custom routing value | ||
applied: | ||
|
||
[source,json] | ||
------------------------------ | ||
PUT /baking/post/1 <1> | ||
{ | ||
"forum_id": "baking", <1> | ||
"title": "Easy recipe for ginger nuts", | ||
... | ||
} | ||
------------------------------ | ||
<1> We still need the `forum_id` field for the filter to work, but | ||
the custom routing value is now implicit. | ||
|
||
Queries run against the `baking` alias are run just on the shard associated | ||
with the custom routing value, and the results are automatically filtered by | ||
the filter we specified: | ||
|
||
[source,json] | ||
------------------------------ | ||
GET /baking/post/_search | ||
{ | ||
"query": { | ||
"match": { | ||
"title": "ginger nuts" | ||
} | ||
} | ||
} | ||
------------------------------ | ||
|
||
Multiple aliases can be specified when searching across multiple forums: | ||
|
||
[source,json] | ||
------------------------------ | ||
GET /baking,recipes/post/_search <1> | ||
{ | ||
"query": { | ||
"match": { | ||
"title": "ginger nuts" | ||
} | ||
} | ||
} | ||
------------------------------ | ||
<1> Both `routing` values are applied, and results can match either filter. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
[[one-big-user]] | ||
=== One big user | ||
|
||
Big popular forums start out as small forums. One day we will find that one | ||
shard in our shared index is doing a lot more work than the other shards, | ||
because it holds the documents for a forum which has become very popular. That | ||
forum now needs its own index. | ||
|
||
The index aliases that we're using to fake an index-per-user give us a clean | ||
migration path for the big forum. | ||
|
||
The first step is to create a new index dedicated to the forum, and with the | ||
appropriate number of shards to allow for expected growth: | ||
|
||
[source,json] | ||
------------------------------ | ||
PUT /baking_v1 | ||
{ | ||
"settings": { | ||
"number_of_shards": 3 | ||
} | ||
} | ||
------------------------------ | ||
|
||
The next step is to migrate the data from the shared index into the dedicated | ||
index, which can be done using <<scan-scroll,scan and scroll>> and the | ||
<<bulk,`bulk` API>>. As soon as the migration is finished, the index alias | ||
can be updated to point to the new index: | ||
|
||
[source,json] | ||
------------------------------ | ||
POST /_aliases | ||
{ | ||
"actions": [ | ||
{ "remove": { "alias": "baking", "index": "forums" }}, | ||
{ "add": { "alias": "baking", "index": "baking_v1" }} | ||
] | ||
} | ||
------------------------------ | ||
|
||
Updating the alias is atomic, it's like throwing a switch. Your application | ||
continues talking to the `baking` API and is completely unaware that it now | ||
points to a new dedicated index. | ||
|
||
The dedicated index no longer needs the filter or the routing values. We can | ||
just rely on the default sharding that Elasticsearch does using each | ||
document's `_id` field. | ||
|
||
The last step is to remove the old documents from the shared index, which can | ||
be done with a `delete-by-query` request, using the original routing value and | ||
forum ID: | ||
|
||
[source,json] | ||
------------------------------ | ||
DELETE /forums/post/_query?routing=baking | ||
{ | ||
"query": { | ||
"term": { | ||
"forum_id": "baking" | ||
} | ||
} | ||
} | ||
------------------------------ | ||
|
||
The beauty of this index-per-user model is that it allows you to reduce | ||
resources, keeping costs low, while still giving you the flexibility to scale | ||
out when necessary, and with zero downtime. |
Oops, something went wrong.