diff --git a/410_Scaling.asciidoc b/410_Scaling.asciidoc index 447fddb7f..1c2db61f3 100644 --- a/410_Scaling.asciidoc +++ b/410_Scaling.asciidoc @@ -18,5 +18,12 @@ include::410_Scaling/50_Index_templates.asciidoc[] include::410_Scaling/55_Retiring_data.asciidoc[] +include::410_Scaling/60_Index_per_user.asciidoc[] +include::410_Scaling/65_Shared_index.asciidoc[] +include::410_Scaling/70_Faking_it.asciidoc[] + +include::410_Scaling/75_One_big_user.asciidoc[] + +include::410_Scaling/80_Scale_is_not_infinite.asciidoc[] diff --git a/410_Scaling/10_Intro.asciidoc b/410_Scaling/10_Intro.asciidoc index b4ba5e613..c8f78edc6 100644 --- a/410_Scaling/10_Intro.asciidoc +++ b/410_Scaling/10_Intro.asciidoc @@ -18,5 +18,12 @@ are aware of those limitations and work with them, the growing process will be pleasant. If you treat Elasticsearch badly, you could be in for a world of pain. +The default settings in Elasticsearch will take you a long way but, to get the +most bang for your buck, you need to think about how data flows through your +system. We will talk about two common data flows: <> like log +events or social network streams where relevance is driven by recency, and +<> where a large document corpus can be subdivided by user or +customer. + This chapter will help you to make the right decisions up front, to avoid nasty surprises later on. diff --git a/410_Scaling/45_Index_per_timeframe.asciidoc b/410_Scaling/45_Index_per_timeframe.asciidoc index 473263fc9..0b5de2c0a 100644 --- a/410_Scaling/45_Index_per_timeframe.asciidoc +++ b/410_Scaling/45_Index_per_timeframe.asciidoc @@ -1,5 +1,5 @@ [[time-based]] -=== Time-based documents +=== Time-based data One of the most common use cases for Elasticsearch is for logging, so common in fact that Elasticsearch provides an integrated logging platform called the @@ -48,16 +48,18 @@ DELETE /logs/event/_query <1> Deletes all documents where Logstash's `@timestamp` field is older than 90 days. -But this approach is very inefficient. Remember that when you delete a +But this approach is *very inefficient*. Remember that when you delete a document, it is only marked as deleted (see <>). It won't be physically deleted until the segment containing it is merged away. Instead, use an _index per-timeframe_. You could start out with an index per -year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your +year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your website gets really busy, you need to switch to an index per day -- -`logs_2014-10-24`. The point is: you can adjust your indexing capacity as you -go. There is no need to make a final decision up front. +`logs_2014-10-24`. Purging old data is easy -- just delete old indices. +This approach has the advantage of allowing you to scale as and when you need +to. You don't have to make any difficult decisions up front. Every day is a +new opportunity to change your indexing timeframes to suit the current demand. Apply the same logic to how big you make each index. Perhaps all you need is one primary shard per week initially. Later, maybe you need 5 primary shards per day. It doesn't matter -- you can adjust to new circumstances at any diff --git a/410_Scaling/60_Index_per_user.asciidoc b/410_Scaling/60_Index_per_user.asciidoc new file mode 100644 index 000000000..45dec5098 --- /dev/null +++ b/410_Scaling/60_Index_per_user.asciidoc @@ -0,0 +1,39 @@ +[[user-based]] +=== User-based data + +Often, users start using Elasticsearch because they need to add full text +search or analytics to an existing application. They create a single index +which holds all of their documents. Gradually, others in the company realise +how much benefit Elasticsearch brings, and they want to add their data to +Elasticsearch as well. + +Fortunately, Elasticsearch supports +http://en.wikipedia.org/wiki/Multitenancy[multitenancy] so each new user can +have their own index in the same cluster. Occasionally, somebody will want to +search across the documents for all users, which they can do by searching +across all indices, but most of the time, users are only interested in their +own documents. + +Some users have more documents than others and some users will have heavier +search loads than others, so the ability to specify how many primary shards +and replica shards each index should have fits well with the index-per-user +model. Similarly, busier indices can be allocated to stronger boxes with shard +allocation filtering. (See <>.) + +TIP: Don't just use the default number of primary shards for every index. +Think about how much data that index needs to hold. It may be that all you +need is one shard -- any more is a waste of resources. + +Most users of Elasticsearch can stop here. A simple index-per-user approach +is sufficient for the majority of cases. + +In exceptional cases, you may find that you need to support a large number of +users, all with similar needs. An example might be hosting a search engine +for thousands of email forums. Some forums may have a huge amount of traffic, +but the majority of forums are quite small. Dedicating an index with a single +shard to a small forum is overkill -- a single shard could hold the data for +many forums. + +What we need is a way to share resources across users, to give the impression +that each user has their own index without wasting resources on small users. + diff --git a/410_Scaling/65_Shared_index.asciidoc b/410_Scaling/65_Shared_index.asciidoc new file mode 100644 index 000000000..d793b29c7 --- /dev/null +++ b/410_Scaling/65_Shared_index.asciidoc @@ -0,0 +1,152 @@ +[[shared-index]] +=== Shared index + +We can use a large shared index for the many smaller forums by indexing +the forum identifier in a field and using it as a filter: + +[source,json] +------------------------------ +PUT /forums +{ + "settings": { + "number_of_shards": 10 <1> + }, + "mappings": { + "post": { + "properties": { + "forum_id": { <2> +   "type": "string", + "index": "not_analyzed" + } + } + } + } +} + +PUT /forums/post/1 +{ + "forum_id": "baking", <2> + "title": "Easy recipe for ginger nuts", + ... +} +------------------------------ +<1> Create an index large enough to hold thousands of smaller forums. +<2> Each post must include a `forum_id` to identify which forum it belongs + to. + +We can use the `forum_id` as a filter to search within a single forum. The +filter will exclude most of the documents in the index (those from other +forums) and filter caching will ensure that responses are fast: + +[source,json] +------------------------------ +GET /forums/post/_search +{ + "query": { + "filtered": { + "query": { + "match": { + "title": "ginger nuts" + } + }, + "filter": { + "term": { <1> + "forum_id": { + "baking" + } + } + } + } + } +} +------------------------------ +<1> The `term` filter is cached by default. + +This approach works, but we can do better. The posts from a single forum +would fit easily onto one shard but currently they are scattered across all 10 +shards in the index. This means that every search request has to be forwarded +to a primary or replica of all 10 shards. What would be ideal is to ensure +that all the posts from a single forum are stored on the same shard. + +In <>, we explained that a document is allocated to a +particular shard using this formula: + + shard = hash(routing) % number_of_primary_shards + +The `routing` value defaults to the document's `_id`, but we can override that +and provide our own custom routing value, such as the `forum_id`. All +documents with the same `routing` value will be stored on the same shard: + +[source,json] +------------------------------ +PUT /forums/post/1?routing=baking <1> +{ + "forum_id": "baking", <1> + "title": "Easy recipe for ginger nuts", + ... +} +------------------------------ +<1> Using the `forum_id` as the routing value ensures that all posts from the + same forum are stored on the same shard. + +When we search for posts in a particular forum, we can pass the same `routing` +value to ensure that the search request is only run on the single shard that +holds our documents: + +[source,json] +------------------------------ +GET /forums/post/_search?routing=baking <1> +{ + "query": { + "filtered": { + "query": { + "match": { + "title": "ginger nuts" + } + }, + "filter": { + "term": { <2> + "forum_id": { + "baking" + } + } + } + } + } +} +------------------------------ +<1> The query is only run on the shard that corresponds to this `routing` value. +<2> We still need the filter, as a single shard can hold posts from many forums. + +Multiple forums can be queried by passing a comma-separated list of `routing` +values, and including each `forum_id` in a `terms` filter: + +[source,json] +------------------------------ +GET /forums/post/_search?routing=baking,cooking,recipes <1> +{ + "query": { + "filtered": { + "query": { + "match": { + "title": "ginger nuts" + } + }, + "filter": { + "terms": { + "forum_id": { + [ "baking", "cooking", "recipes" ] + } + } + } + } + } +} +------------------------------ + +While this approach is technically efficient, it looks a bit clumsy because of +the need to specify `routing` values and `terms` filters on every query or +indexing request. Things look a lot better once we add index aliases into the +mix. + + diff --git a/410_Scaling/70_Faking_it.asciidoc b/410_Scaling/70_Faking_it.asciidoc new file mode 100644 index 000000000..b64da7b88 --- /dev/null +++ b/410_Scaling/70_Faking_it.asciidoc @@ -0,0 +1,72 @@ +[[faking-it]] +=== Faking index-per-user with aliases + +To keep things simple and clean, we would like our application to believe that +we have a dedicated index per user -- or per forum in our example -- even if +the reality is that we are using one big <>. To do +that, we need some way to hide the `routing` value and the filter on +`forum_id`. + +Index aliases allow us to do just that. When you associate an alias with an +index, you can also specify a filter and routing values: + +[source,json] +------------------------------ +PUT /forums/_alias/baking +{ + "routing": "baking", + "filter": { + "term": { + "forum_id": "baking" + } + } +} +------------------------------ + +Now, we can treat the `baking` alias as if it were its own index. Documents +indexed into the `baking` alias automatically get the custom routing value +applied: + +[source,json] +------------------------------ +PUT /baking/post/1 <1> +{ + "forum_id": "baking", <1> + "title": "Easy recipe for ginger nuts", + ... +} +------------------------------ +<1> We still need the `forum_id` field for the filter to work, but + the custom routing value is now implicit. + +Queries run against the `baking` alias are run just on the shard associated +with the custom routing value, and the results are automatically filtered by +the filter we specified: + +[source,json] +------------------------------ +GET /baking/post/_search +{ + "query": { + "match": { + "title": "ginger nuts" + } + } +} +------------------------------ + +Multiple aliases can be specified when searching across multiple forums: + +[source,json] +------------------------------ +GET /baking,recipes/post/_search <1> +{ + "query": { + "match": { + "title": "ginger nuts" + } + } +} +------------------------------ +<1> Both `routing` values are applied, and results can match either filter. + diff --git a/410_Scaling/75_One_big_user.asciidoc b/410_Scaling/75_One_big_user.asciidoc new file mode 100644 index 000000000..423ce5bba --- /dev/null +++ b/410_Scaling/75_One_big_user.asciidoc @@ -0,0 +1,67 @@ +[[one-big-user]] +=== One big user + +Big popular forums start out as small forums. One day we will find that one +shard in our shared index is doing a lot more work than the other shards, +because it holds the documents for a forum which has become very popular. That +forum now needs its own index. + +The index aliases that we're using to fake an index-per-user give us a clean +migration path for the big forum. + +The first step is to create a new index dedicated to the forum, and with the +appropriate number of shards to allow for expected growth: + +[source,json] +------------------------------ +PUT /baking_v1 +{ + "settings": { + "number_of_shards": 3 + } +} +------------------------------ + +The next step is to migrate the data from the shared index into the dedicated +index, which can be done using <> and the +<>. As soon as the migration is finished, the index alias +can be updated to point to the new index: + +[source,json] +------------------------------ +POST /_aliases +{ + "actions": [ + { "remove": { "alias": "baking", "index": "forums" }}, + { "add": { "alias": "baking", "index": "baking_v1" }} + ] +} +------------------------------ + +Updating the alias is atomic, it's like throwing a switch. Your application +continues talking to the `baking` API and is completely unaware that it now +points to a new dedicated index. + +The dedicated index no longer needs the filter or the routing values. We can +just rely on the default sharding that Elasticsearch does using each +document's `_id` field. + +The last step is to remove the old documents from the shared index, which can +be done with a `delete-by-query` request, using the original routing value and +forum ID: + +[source,json] +------------------------------ +DELETE /forums/post/_query?routing=baking +{ + "query": { + "term": { + "forum_id": "baking" + } + } +} +------------------------------ + +The beauty of this index-per-user model is that it allows you to reduce +resources, keeping costs low, while still giving you the flexibility to scale +out when necessary, and with zero downtime. diff --git a/410_Scaling/80_Scale_is_not_infinite.asciidoc b/410_Scaling/80_Scale_is_not_infinite.asciidoc new file mode 100644 index 000000000..67ea9e452 --- /dev/null +++ b/410_Scaling/80_Scale_is_not_infinite.asciidoc @@ -0,0 +1,89 @@ +[[finite-scale]] +=== Scale is not infinite + +Throughout this chapter we have spoken about many of the ways that +Elasticsearch can scale. Most scaling problems can be solved by adding more +nodes. But there is one resource which is finite and should be treated with +respect: the _cluster state_. + +The cluster state is a data structure which holds cluster-level information, +such as: + +* Cluster-level settings. +* Nodes that are part of the cluster. +* Indices, plus their settings, mappings, analyzers, warmers, and aliases. +* The shards associated with each index, plus the node on which they are + allocated. + +You can view the current cluster state with this request: + +[source,json] +------------------------------ +GET /_cluster/state +------------------------------ + +The cluster state exists on every node in the cluster, including client nodes. +This is how any node can forward a request directly to the node that holds the +requested data -- every node knows where every document lives. + +Only the master node is allowed to update the cluster state. Imagine that an +indexing request introduces a previously unknown field. The node holding the +primary shard for the document must forward the new mapping to the master +node. The master node incorporates the changes in the cluster state, and +publishes a new version to all of the nodes in the cluster. + +Search requests *use* the cluster state, but they don't change it. The same +applies to document-level CRUD requests unless, of course, they introduce a +new field which requires a mapping update. By and large the cluster state is +static and is not a bottleneck. + +However, remember that this same data structure has to exist in memory on +every node, and must be published to every node whenever it is updated. The +bigger it is, the longer that process will take. + +The most common problem that we see with the cluster state is the introduction +of too many fields. A user might decide to use a separate field for every IP +address, or every referer URL. The example below keeps track of the number of +times a page has been visited by using a different field name for every unique +referer: + +[source,json] +------------------------------ +POST /counters/pageview/home_page/_update +{ + "script": "ctx._source[referer]++", + "params": { + "referer": "http://www.foo.com/links?bar=baz" + } +} +------------------------------ + +This approach is catastrophically bad! It will result in millions of fields, +all of which have to be stored in the cluster state. Every time a new referer +is seen, a new field is added to the already bloated cluster state, which then +has to be published to every node in the cluster. + +A much better approach is to use <>, with one +field for the parameter name -- `referer` -- and another field for its +associated value -- `count`: + +[source,json] +------------------------------ + "counters": [ + { "referer": "http://www.foo.com/links?bar=baz", "count": 2 }, + { "referer": "http://www.linkbait.com/article_3", "count": 10 }, + ... + ] +------------------------------ + +The nested approach may increase the number of documents, but Elasticsearch is +built to handle that. The important thing is that it keeps the cluster state +small and agile. + +Eventually, despite your best intentions, you may find that the number of +nodes and indices and mappings that you have is just too much for one cluster. +At this stage, it is probably worth dividing the problem up between multiple +clusters. Thanks to {ref}modules-tribe.html[`tribe` nodes], you can even run +searches across multiple clusters as if they were one big cluster. + + diff --git a/book.asciidoc b/book.asciidoc index c44282268..0b60ae800 100644 --- a/book.asciidoc +++ b/book.asciidoc @@ -101,8 +101,6 @@ include::400_Relationships.asciidoc[] include::410_Scaling.asciidoc[] -include::430_Index_Per_User.asciidoc[] - // Part 7 [[administration]]