Skip to content

Commit

Permalink
Finished scaling chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Aug 3, 2014
1 parent 5701e7c commit 641b303
Show file tree
Hide file tree
Showing 9 changed files with 440 additions and 7 deletions.
7 changes: 7 additions & 0 deletions 410_Scaling.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,12 @@ include::410_Scaling/50_Index_templates.asciidoc[]

include::410_Scaling/55_Retiring_data.asciidoc[]

include::410_Scaling/60_Index_per_user.asciidoc[]

include::410_Scaling/65_Shared_index.asciidoc[]

include::410_Scaling/70_Faking_it.asciidoc[]

include::410_Scaling/75_One_big_user.asciidoc[]

include::410_Scaling/80_Scale_is_not_infinite.asciidoc[]
7 changes: 7 additions & 0 deletions 410_Scaling/10_Intro.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,12 @@ are aware of those limitations and work with them, the growing process will be
pleasant. If you treat Elasticsearch badly, you could be in for a world of
pain.

The default settings in Elasticsearch will take you a long way but, to get the
most bang for your buck, you need to think about how data flows through your
system. We will talk about two common data flows: <<time-based>> like log
events or social network streams where relevance is driven by recency, and
<<user-based>> where a large document corpus can be subdivided by user or
customer.

This chapter will help you to make the right decisions up front, to avoid
nasty surprises later on.
12 changes: 7 additions & 5 deletions 410_Scaling/45_Index_per_timeframe.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[[time-based]]
=== Time-based documents
=== Time-based data

One of the most common use cases for Elasticsearch is for logging, so common
in fact that Elasticsearch provides an integrated logging platform called the
Expand Down Expand Up @@ -48,16 +48,18 @@ DELETE /logs/event/_query
<1> Deletes all documents where Logstash's `@timestamp` field is
older than 90 days.

But this approach is very inefficient. Remember that when you delete a
But this approach is *very inefficient*. Remember that when you delete a
document, it is only marked as deleted (see <<deletes-and-updates>>). It won't
be physically deleted until the segment containing it is merged away.

Instead, use an _index per-timeframe_. You could start out with an index per
year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your
year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your
website gets really busy, you need to switch to an index per day --
`logs_2014-10-24`. The point is: you can adjust your indexing capacity as you
go. There is no need to make a final decision up front.
`logs_2014-10-24`. Purging old data is easy -- just delete old indices.

This approach has the advantage of allowing you to scale as and when you need
to. You don't have to make any difficult decisions up front. Every day is a
new opportunity to change your indexing timeframes to suit the current demand.
Apply the same logic to how big you make each index. Perhaps all you need is
one primary shard per week initially. Later, maybe you need 5 primary shards
per day. It doesn't matter -- you can adjust to new circumstances at any
Expand Down
39 changes: 39 additions & 0 deletions 410_Scaling/60_Index_per_user.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
[[user-based]]
=== User-based data

Often, users start using Elasticsearch because they need to add full text
search or analytics to an existing application. They create a single index
which holds all of their documents. Gradually, others in the company realise
how much benefit Elasticsearch brings, and they want to add their data to
Elasticsearch as well.

Fortunately, Elasticsearch supports
http://en.wikipedia.org/wiki/Multitenancy[multitenancy] so each new user can
have their own index in the same cluster. Occasionally, somebody will want to
search across the documents for all users, which they can do by searching
across all indices, but most of the time, users are only interested in their
own documents.

Some users have more documents than others and some users will have heavier
search loads than others, so the ability to specify how many primary shards
and replica shards each index should have fits well with the index-per-user
model. Similarly, busier indices can be allocated to stronger boxes with shard
allocation filtering. (See <<migrate-indices>>.)

TIP: Don't just use the default number of primary shards for every index.
Think about how much data that index needs to hold. It may be that all you
need is one shard -- any more is a waste of resources.

Most users of Elasticsearch can stop here. A simple index-per-user approach
is sufficient for the majority of cases.

In exceptional cases, you may find that you need to support a large number of
users, all with similar needs. An example might be hosting a search engine
for thousands of email forums. Some forums may have a huge amount of traffic,
but the majority of forums are quite small. Dedicating an index with a single
shard to a small forum is overkill -- a single shard could hold the data for
many forums.

What we need is a way to share resources across users, to give the impression
that each user has their own index without wasting resources on small users.

152 changes: 152 additions & 0 deletions 410_Scaling/65_Shared_index.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
[[shared-index]]
=== Shared index

We can use a large shared index for the many smaller forums by indexing
the forum identifier in a field and using it as a filter:

[source,json]
------------------------------
PUT /forums
{
"settings": {
"number_of_shards": 10 <1>
},
"mappings": {
"post": {
"properties": {
"forum_id": { <2>
  "type": "string",
"index": "not_analyzed"
}
}
}
}
}
PUT /forums/post/1
{
"forum_id": "baking", <2>
"title": "Easy recipe for ginger nuts",
...
}
------------------------------
<1> Create an index large enough to hold thousands of smaller forums.
<2> Each post must include a `forum_id` to identify which forum it belongs
to.

We can use the `forum_id` as a filter to search within a single forum. The
filter will exclude most of the documents in the index (those from other
forums) and filter caching will ensure that responses are fast:

[source,json]
------------------------------
GET /forums/post/_search
{
"query": {
"filtered": {
"query": {
"match": {
"title": "ginger nuts"
}
},
"filter": {
"term": { <1>
"forum_id": {
"baking"
}
}
}
}
}
}
------------------------------
<1> The `term` filter is cached by default.

This approach works, but we can do better. The posts from a single forum
would fit easily onto one shard but currently they are scattered across all 10
shards in the index. This means that every search request has to be forwarded
to a primary or replica of all 10 shards. What would be ideal is to ensure
that all the posts from a single forum are stored on the same shard.

In <<routing-value>>, we explained that a document is allocated to a
particular shard using this formula:

shard = hash(routing) % number_of_primary_shards

The `routing` value defaults to the document's `_id`, but we can override that
and provide our own custom routing value, such as the `forum_id`. All
documents with the same `routing` value will be stored on the same shard:

[source,json]
------------------------------
PUT /forums/post/1?routing=baking <1>
{
"forum_id": "baking", <1>
"title": "Easy recipe for ginger nuts",
...
}
------------------------------
<1> Using the `forum_id` as the routing value ensures that all posts from the
same forum are stored on the same shard.

When we search for posts in a particular forum, we can pass the same `routing`
value to ensure that the search request is only run on the single shard that
holds our documents:

[source,json]
------------------------------
GET /forums/post/_search?routing=baking <1>
{
"query": {
"filtered": {
"query": {
"match": {
"title": "ginger nuts"
}
},
"filter": {
"term": { <2>
"forum_id": {
"baking"
}
}
}
}
}
}
------------------------------
<1> The query is only run on the shard that corresponds to this `routing` value.
<2> We still need the filter, as a single shard can hold posts from many forums.

Multiple forums can be queried by passing a comma-separated list of `routing`
values, and including each `forum_id` in a `terms` filter:

[source,json]
------------------------------
GET /forums/post/_search?routing=baking,cooking,recipes <1>
{
"query": {
"filtered": {
"query": {
"match": {
"title": "ginger nuts"
}
},
"filter": {
"terms": {
"forum_id": {
[ "baking", "cooking", "recipes" ]
}
}
}
}
}
}
------------------------------

While this approach is technically efficient, it looks a bit clumsy because of
the need to specify `routing` values and `terms` filters on every query or
indexing request. Things look a lot better once we add index aliases into the
mix.


72 changes: 72 additions & 0 deletions 410_Scaling/70_Faking_it.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[[faking-it]]
=== Faking index-per-user with aliases

To keep things simple and clean, we would like our application to believe that
we have a dedicated index per user -- or per forum in our example -- even if
the reality is that we are using one big <<shared-index,shared index>>. To do
that, we need some way to hide the `routing` value and the filter on
`forum_id`.

Index aliases allow us to do just that. When you associate an alias with an
index, you can also specify a filter and routing values:

[source,json]
------------------------------
PUT /forums/_alias/baking
{
"routing": "baking",
"filter": {
"term": {
"forum_id": "baking"
}
}
}
------------------------------

Now, we can treat the `baking` alias as if it were its own index. Documents
indexed into the `baking` alias automatically get the custom routing value
applied:

[source,json]
------------------------------
PUT /baking/post/1 <1>
{
"forum_id": "baking", <1>
"title": "Easy recipe for ginger nuts",
...
}
------------------------------
<1> We still need the `forum_id` field for the filter to work, but
the custom routing value is now implicit.

Queries run against the `baking` alias are run just on the shard associated
with the custom routing value, and the results are automatically filtered by
the filter we specified:

[source,json]
------------------------------
GET /baking/post/_search
{
"query": {
"match": {
"title": "ginger nuts"
}
}
}
------------------------------

Multiple aliases can be specified when searching across multiple forums:

[source,json]
------------------------------
GET /baking,recipes/post/_search <1>
{
"query": {
"match": {
"title": "ginger nuts"
}
}
}
------------------------------
<1> Both `routing` values are applied, and results can match either filter.

67 changes: 67 additions & 0 deletions 410_Scaling/75_One_big_user.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
[[one-big-user]]
=== One big user

Big popular forums start out as small forums. One day we will find that one
shard in our shared index is doing a lot more work than the other shards,
because it holds the documents for a forum which has become very popular. That
forum now needs its own index.

The index aliases that we're using to fake an index-per-user give us a clean
migration path for the big forum.

The first step is to create a new index dedicated to the forum, and with the
appropriate number of shards to allow for expected growth:

[source,json]
------------------------------
PUT /baking_v1
{
"settings": {
"number_of_shards": 3
}
}
------------------------------

The next step is to migrate the data from the shared index into the dedicated
index, which can be done using <<scan-scroll,scan and scroll>> and the
<<bulk,`bulk` API>>. As soon as the migration is finished, the index alias
can be updated to point to the new index:

[source,json]
------------------------------
POST /_aliases
{
"actions": [
{ "remove": { "alias": "baking", "index": "forums" }},
{ "add": { "alias": "baking", "index": "baking_v1" }}
]
}
------------------------------

Updating the alias is atomic, it's like throwing a switch. Your application
continues talking to the `baking` API and is completely unaware that it now
points to a new dedicated index.

The dedicated index no longer needs the filter or the routing values. We can
just rely on the default sharding that Elasticsearch does using each
document's `_id` field.

The last step is to remove the old documents from the shared index, which can
be done with a `delete-by-query` request, using the original routing value and
forum ID:

[source,json]
------------------------------
DELETE /forums/post/_query?routing=baking
{
"query": {
"term": {
"forum_id": "baking"
}
}
}
------------------------------

The beauty of this index-per-user model is that it allows you to reduce
resources, keeping costs low, while still giving you the flexibility to scale
out when necessary, and with zero downtime.
Loading

0 comments on commit 641b303

Please sign in to comment.