diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc index 5daeaa3e5..8a9408d1a 100644 --- a/010_Intro/10_Installing_ES.asciidoc +++ b/010_Intro/10_Installing_ES.asciidoc @@ -73,7 +73,10 @@ start experimenting with it. A _node_ is a running instance of Elasticsearch. ((("nodes", "defined"))) A _cluster_ is ((("clusters", "defined")))a group of nodes with the same `cluster.name` that are working together to share data and to provide failover and scale. (A single node, however, can form a cluster -all by itself.) +all by itself.) You can change the `cluster.name` in the `elasticsearch.yml` configuration +file that's loaded when you start a node. More information about this and other +<> is provided +in the Production Deployment section at the end of this book. TIP: See that View in Sense link at the bottom of the example? <> to run the examples in this book against your own Elasticsearch cluster and view the results. diff --git a/020_Distributed_Cluster/20_Add_failover.asciidoc b/020_Distributed_Cluster/20_Add_failover.asciidoc index de176ad35..817335903 100644 --- a/020_Distributed_Cluster/20_Add_failover.asciidoc +++ b/020_Distributed_Cluster/20_Add_failover.asciidoc @@ -13,11 +13,10 @@ in exactly the same way as you started the first one (see share the same directory. When you run a second node on the same machine, it automatically discovers -and joins the cluster as long as it has the same `cluster.name` as the first node (see -the `./config/elasticsearch.yml` file). However, for nodes running on different machines +and joins the cluster as long as it has the same `cluster.name` as the first node. +However, for nodes running on different machines to join the same cluster, you need to configure a list of unicast hosts the nodes can contact -to join the cluster. For more information about how Elasticsearch nodes find eachother, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] -in the Elasticsearch Reference. +to join the cluster. For more information, see <>. *************************************** diff --git a/052_Mapping_Analysis/25_Data_type_differences.asciidoc b/052_Mapping_Analysis/25_Data_type_differences.asciidoc index 0a06f4276..7f3c69444 100644 --- a/052_Mapping_Analysis/25_Data_type_differences.asciidoc +++ b/052_Mapping_Analysis/25_Data_type_differences.asciidoc @@ -44,7 +44,7 @@ This gives us the following: "properties": { "date": { "type": "date", - "format": "dateOptionalTime" + "format": "strict_date_optional_time||epoch_millis" }, "name": { "type": "string" diff --git a/052_Mapping_Analysis/45_Mapping.asciidoc b/052_Mapping_Analysis/45_Mapping.asciidoc index 98fefb322..71cae0442 100644 --- a/052_Mapping_Analysis/45_Mapping.asciidoc +++ b/052_Mapping_Analysis/45_Mapping.asciidoc @@ -75,7 +75,7 @@ Elasticsearch generated dynamically from the documents that we indexed: "properties": { "date": { "type": "date", - "format": "dateOptionalTime" + "format": "strict_date_optional_time||epoch_millis" }, "name": { "type": "string" diff --git a/060_Distributed_Search.asciidoc b/060_Distributed_Search.asciidoc index 7efc0136b..fc885386e 100644 --- a/060_Distributed_Search.asciidoc +++ b/060_Distributed_Search.asciidoc @@ -6,5 +6,5 @@ include::060_Distributed_Search/10_Fetch_phase.asciidoc[] include::060_Distributed_Search/15_Search_options.asciidoc[] -include::060_Distributed_Search/20_Scan_and_scroll.asciidoc[] +include::060_Distributed_Search/20_Scroll.asciidoc[] diff --git a/060_Distributed_Search/10_Fetch_phase.asciidoc b/060_Distributed_Search/10_Fetch_phase.asciidoc index 8758a67b1..dd5c4ffc3 100644 --- a/060_Distributed_Search/10_Fetch_phase.asciidoc +++ b/060_Distributed_Search/10_Fetch_phase.asciidoc @@ -57,7 +57,7 @@ culprits are usually bots or web spiders that tirelessly keep fetching page after page until your servers crumble at the knees. If you _do_ need to fetch large numbers of docs from your cluster, you can -do so efficiently by disabling sorting with the `scan` search type, +do so efficiently by disabling sorting with the `scroll` query, which we discuss <>. **** diff --git a/060_Distributed_Search/20_Scan_and_scroll.asciidoc b/060_Distributed_Search/20_Scan_and_scroll.asciidoc deleted file mode 100644 index 317b53b19..000000000 --- a/060_Distributed_Search/20_Scan_and_scroll.asciidoc +++ /dev/null @@ -1,81 +0,0 @@ -[[scan-scroll]] -=== scan and scroll - -The `scan` search type and the `scroll` API((("scroll API", "scan and scroll"))) are used together to retrieve -large numbers of documents from Elasticsearch efficiently, without paying the -penalty of deep pagination. - -`scroll`:: -+ --- -A _scrolled search_ allows us to((("scrolled search"))) do an initial search and to keep pulling -batches of results from Elasticsearch until there are no more results left. -It's a bit like a _cursor_ in ((("cursors")))a traditional database. - -A scrolled search takes a snapshot in time. It doesn't see any changes that -are made to the index after the initial search request has been made. It does -this by keeping the old data files around, so that it can preserve its ``view'' -on what the index looked like at the time it started. - --- - -`scan`:: - -The costly part of deep pagination is the global sorting of results, but if we -disable sorting, then we can return all documents quite cheaply. To do this, we -use the `scan` search type.((("scan search type"))) Scan instructs Elasticsearch to do no sorting, but -to just return the next batch of results from every shard that still has -results to return. - -To use _scan-and-scroll_, we execute a search((("scan-and-scroll"))) request setting `search_type` to((("search_type", "scan and scroll"))) -`scan`, and passing a `scroll` parameter telling Elasticsearch how long it -should keep the scroll open: - -[source,js] --------------------------------------------------- -GET /old_index/_search?search_type=scan&scroll=1m <1> -{ - "query": { "match_all": {}}, - "size": 1000 -} --------------------------------------------------- -<1> Keep the scroll open for 1 minute. - -The response to this request doesn't include any hits, but does include a -`_scroll_id`, which is a long Base-64 encoded((("scroll_id"))) string. Now we can pass the -`_scroll_id` to the `_search/scroll` endpoint to retrieve the first batch of -results: - -[source,js] --------------------------------------------------- -GET /_search/scroll?scroll=1m <1> -c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 <2> -NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy -UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW -xfaGl0czoxOw== --------------------------------------------------- -<1> Keep the scroll open for another minute. -<2> The `_scroll_id` can be passed in the body, in the URL, or as a - query parameter. - -Note that we again specify `?scroll=1m`. The scroll expiry time is refreshed -every time we run a scroll request, so it needs to give us only enough time -to process the current batch of results, not all of the documents that match -the query. - -The response to this scroll request includes the first batch of results. -Although we specified a `size` of 1,000, we get back many more -documents.((("size parameter", "in scanning"))) When scanning, the `size` is applied to each shard, so you will -get back a maximum of `size * number_of_primary_shards` documents in each -batch. - -NOTE: The scroll request also returns a _new_ `_scroll_id`. Every time -we make the next scroll request, we must pass the `_scroll_id` returned by the -_previous_ scroll request. - -When no more hits are returned, we have processed all matching documents. - -TIP: Some of the http://www.elastic.co/guide[official Elasticsearch clients] -provide _scan-and-scroll_ helpers that provide an easy wrapper around this -functionality.((("clients", "providing scan-and-scroll helpers"))) - diff --git a/060_Distributed_Search/20_Scroll.asciidoc b/060_Distributed_Search/20_Scroll.asciidoc new file mode 100644 index 000000000..15ef7170c --- /dev/null +++ b/060_Distributed_Search/20_Scroll.asciidoc @@ -0,0 +1,74 @@ +[[scroll]] +=== Scroll + +A `scroll` query ((("scroll API))) is used to retrieve +large numbers of documents from Elasticsearch efficiently, without paying the +penalty of deep pagination. + +Scrolling allows us to((("scrolled search"))) do an initial search and to keep pulling +batches of results from Elasticsearch until there are no more results left. +It's a bit like a _cursor_ in ((("cursors")))a traditional database. + +A scrolled search takes a snapshot in time. It doesn't see any changes that +are made to the index after the initial search request has been made. It does +this by keeping the old data files around, so that it can preserve its ``view'' +on what the index looked like at the time it started. + +The costly part of deep pagination is the global sorting of results, but if we +disable sorting, then we can return all documents quite cheaply. To do this, we +sort by `_doc`. This instructs Elasticsearch just return the next batch of +results from every shard that still has results to return. + +To scroll through results, we execute a search request and set the `scroll` value to +the length of time we want to keep the scroll window open. The scroll expiry +time is refreshed every time we run a scroll request, so it only needs to be long enough +to process the current batch of results, not all of the documents that match +the query. The timeout is important because keeping the scroll window open +consumes resources and we want to free them as soon as they are no longer needed. +Setting the timeout enables Elasticsearch to automatically free the resources +after a small period of inactivity. + +[source,js] +-------------------------------------------------- +GET /old_index/_search?scroll=1m <1> +{ + "query": { "match_all": {}}, + "sort" : ["_doc"], <2> + "size": 1000 +} +-------------------------------------------------- +<1> Keep the scroll window open for 1 minute. +<2> `_doc` is the most efficient sort order. + +The response to this request includes a +`_scroll_id`, which is a long Base-64 encoded((("scroll_id"))) string. Now we can pass the +`_scroll_id` to the `_search/scroll` endpoint to retrieve the next batch of +results: + +[source,js] +-------------------------------------------------- +GET /_search/scroll +{ + "scroll": "1m", <1> + "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs=" +} +-------------------------------------------------- +<1> Note that we again set the scroll expiration to 1m. + +The response to this scroll request includes the next batch of results. +Although we specified a `size` of 1,000, we get back many more +documents.((("size parameter", "in scanning"))) When scanning, the `size` is applied to each shard, so you will +get back a maximum of `size * number_of_primary_shards` documents in each +batch. + +NOTE: The scroll request also returns a _new_ `_scroll_id`. Every time +we make the next scroll request, we must pass the `_scroll_id` returned by the +_previous_ scroll request. + +When no more hits are returned, we have processed all matching documents. + +TIP: Some of the official Elasticsearch clients such as +http://elasticsearch-py.readthedocs.org/en/master/helpers.html#scan[Python client] and +https://metacpan.org/pod/Search::Elasticsearch::Scroll[Perl client] provide scroll helpers that +provide easy-to-use wrappers around this funtionality. + diff --git a/070_Index_Mgmt/50_Reindexing.asciidoc b/070_Index_Mgmt/50_Reindexing.asciidoc index aa7d57233..ed5322379 100644 --- a/070_Index_Mgmt/50_Reindexing.asciidoc +++ b/070_Index_Mgmt/50_Reindexing.asciidoc @@ -15,7 +15,7 @@ whole document available to you in Elasticsearch itself. You don't have to rebuild your index from the database, which is usually much slower. To reindex all of the documents from the old index efficiently, use -<> to retrieve batches((("scan-and-scroll", "using in reindexing documents"))) of documents from the old index, +<> to retrieve batches((("using in reindexing documents"))) of documents from the old index, and the <> to push them into the new index. .Reindexing in Batches @@ -27,7 +27,7 @@ jobs by filtering on a date or timestamp field: [source,js] -------------------------------------------------- -GET /old_index/_search?search_type=scan&scroll=1m +GET /old_index/_search?scroll=1m { "query": { "range": { @@ -37,6 +37,7 @@ GET /old_index/_search?search_type=scan&scroll=1m } } }, + "sort": ["_doc"], "size": 1000 } -------------------------------------------------- diff --git a/400_Relationships/25_Concurrency.asciidoc b/400_Relationships/25_Concurrency.asciidoc index d5de8f754..7595b3213 100644 --- a/400_Relationships/25_Concurrency.asciidoc +++ b/400_Relationships/25_Concurrency.asciidoc @@ -182,7 +182,7 @@ PUT /fs/file/1?version=2 <1> We can even rename a directory, but this means updating all of the files that exist anywhere in the path hierarchy beneath that directory. This may be quick or slow, depending on how many files need to be updated. All we would -need to do is to use <> to retrieve all the +need to do is to use <> to retrieve all the files, and the <> to update them. The process isn't atomic, but all files will quickly move to their new home. diff --git a/400_Relationships/26_Concurrency_solutions.asciidoc b/400_Relationships/26_Concurrency_solutions.asciidoc index 9e06cc6dd..10bd1d5e3 100644 --- a/400_Relationships/26_Concurrency_solutions.asciidoc +++ b/400_Relationships/26_Concurrency_solutions.asciidoc @@ -81,10 +81,9 @@ parallelism by making our locking more fine-grained. ==== Document Locking Instead of locking the whole filesystem, we could lock individual documents -by using the same technique as previously described.((("locking", "document locking")))((("document locking"))) A process could use a -<> request to retrieve the IDs of all documents -that would be affected by the change, and would need to create a lock file for -each of them: +by using the same technique as previously described.((("locking", "document locking")))((("document locking"))) +We can use a <> to retrieve all documents that would be affected by the change and +create a lock file for each one: [source,json] -------------------------- @@ -93,7 +92,6 @@ PUT /fs/lock/_bulk { "process_id": 123 } <2> { "create": { "_id": 2}} { "process_id": 123 } -... -------------------------- <1> The ID of the `lock` document would be the same as the ID of the file that should be locked. @@ -135,41 +133,51 @@ POST /fs/lock/1/_update } -------------------------- -If the document doesn't already exist, the `upsert` document will be inserted--much the same as the `create` request we used previously. However, if the -document _does_ exist, the script will look at the `process_id` stored in the -document. If it is the same as ours, it aborts the update (`noop`) and -returns success. If it is different, the `assert false` throws an exception -and we know that the lock has failed. +If the document doesn't already exist, the `upsert` document is inserted--much +the same as the previous `create` request. However, if the +document _does_ exist, the script looks at the `process_id` stored in the +document. If the `process_id` matches, no update is performed (`noop`) but the +script returns successfully. If it is different, `assert false` throws an exception +and you know that the lock has failed. + +Once all locks have been successfully created, you can proceed with your changes. + +Afterward, you must release all of the locks, which you can do by +retrieving all of the locked documents and performing a bulk delete: -Once all locks have been successfully created, the rename operation can begin. -Afterward, we must release((("delete-by-query request"))) all of the locks, which we can do with a -`delete-by-query` request: [source,json] -------------------------- POST /fs/_refresh <1> -DELETE /fs/lock/_query +GET /fs/lock/_search?scroll=1m <2> { - "query": { - "term": { - "process_id": 123 + "sort" : ["_doc"], + "query": { + "match" : { + "process_id" : 123 + } } - } } + +PUT /fs/lock/_bulk +{ "delete": { "_id": 1}} +{ "delete": { "_id": 2}} -------------------------- <1> The `refresh` call ensures that all `lock` documents are visible to - the `delete-by-query` request. + the search request. +<2> You can use a <> query when you need to retrieve large +numbers of results with a single search request. Document-level locking enables fine-grained access control, but creating lock -files for millions of documents can be expensive. In certain scenarios, such -as this example with directory trees, it is possible to achieve fine-grained -locking with much less work. +files for millions of documents can be expensive. In some cases, +you can achieve fine-grained locking with much less work, as shown in the +following directory tree scenario. [[tree-locking]] ==== Tree Locking -Rather than locking every involved document, as in the previous option, we +Rather than locking every involved document as in the previous example, we could lock just part of the directory tree.((("locking", "tree locking"))) We will need exclusive access to the file or directory that we want to rename, which can be achieved with an _exclusive lock_ document: diff --git a/410_Scaling/45_Index_per_timeframe.asciidoc b/410_Scaling/45_Index_per_timeframe.asciidoc index b247a8bfc..75f1c297a 100644 --- a/410_Scaling/45_Index_per_timeframe.asciidoc +++ b/410_Scaling/45_Index_per_timeframe.asciidoc @@ -29,25 +29,8 @@ data. If we were to have one big index for documents of this type, we would soon run out of space. Logging events just keep on coming, without pause or -interruption. We could delete the old events, with a `delete-by-query`: - -[source,json] -------------------------- -DELETE /logs/event/_query -{ - "query": { - "range": { - "@timestamp": { <1> - "lt": "now-90d" - } - } - } -} -------------------------- -<1> Deletes all documents where Logstash's `@timestamp` field is - older than 90 days. - -But this approach is _very inefficient_. Remember that when you delete a +interruption. We could delete the old events with a <> +query and bulk delete, but this approach is _very inefficient_. When you delete a document, it is only _marked_ as deleted (see <>). It won't be physically deleted until the segment containing it is merged away. diff --git a/410_Scaling/75_One_big_user.asciidoc b/410_Scaling/75_One_big_user.asciidoc index a31f02712..754400376 100644 --- a/410_Scaling/75_One_big_user.asciidoc +++ b/410_Scaling/75_One_big_user.asciidoc @@ -23,7 +23,7 @@ PUT /baking_v1 ------------------------------ The next step is to migrate the data from the shared index into the dedicated -index, which can be done using <> and the +index, which can be done using a <> query and the <>. As soon as the migration is finished, the index alias can be updated to point to the new index: @@ -47,20 +47,8 @@ just rely on the default sharding that Elasticsearch does using each document's `_id` field. The last step is to remove the old documents from the shared index, which can -be done with a `delete-by-query` request, using the original routing value and -forum ID: - -[source,json] ------------------------------- -DELETE /forums/post/_query?routing=baking -{ - "query": { - "term": { - "forum_id": "baking" - } - } -} ------------------------------- +be done by searching using the original routing value and forum ID and performing +a bulk delete. The beauty of this index-per-user model is that it allows you to reduce resources, keeping costs low, while still giving you the flexibility to scale diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 6c6169b80..f142db6da 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -206,41 +206,39 @@ NOTE: These settings can only be set in the `config/elasticsearch.yml` file or o the command line (they are not dynamically updatable) and they are only relevant during a full cluster restart. +[[unicast]] ==== Prefer Unicast over Multicast -Elasticsearch is configured to use multicast discovery out of the box. Multicast((("configuration changes, important", "preferring unicast over multicast")))((("unicast, preferring over multicast")))((("multicast versus unicast"))) -works by sending UDP pings across your local network to discover nodes. Other -Elasticsearch nodes will receive these pings and respond. A cluster is formed -shortly after. +Elasticsearch is configured to use unicast discovery out of the box to prevent +nodes from accidentally joining a cluster. Only nodes running on the same +machine will automatically form cluster. -Multicast is excellent for development, since you don't need to do anything. Turn -a few nodes on, and they automatically find each other and form a cluster. - -This ease of use is the exact reason you should disable it in production. The +While multicast is still https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[provided +as a plugin], it should never be used in production. The last thing you want is for nodes to accidentally join your production network, simply because they received an errant multicast ping. There is nothing wrong with multicast _per se_. Multicast simply leads to silly problems, and can be a bit more fragile (for example, a network engineer fiddles with the network without telling you--and all of a sudden nodes can't find each other anymore). -In production, it is recommended to use unicast instead of multicast. This works -by providing Elasticsearch a list of nodes that it should try to contact. Once -the node contacts a member of the unicast list, it will receive a full cluster -state that lists all nodes in the cluster. It will then proceed to contact -the master and join. +To use unicast, you provide Elasticsearch a list of nodes that it should try to contact. +When a node contacts a member of the unicast list, it receives a full cluster +state that lists all of the nodes in the cluster. It then contacts +the master and joins the cluster. -This means your unicast list does not need to hold all the nodes in your cluster. +This means your unicast list does not need to include all of the nodes in your cluster. It just needs enough nodes that a new node can find someone to talk to. If you use dedicated masters, just list your three dedicated masters and call it a day. -This setting is configured in your `elasticsearch.yml`: +This setting is configured in `elasticsearch.yml`: [source,yaml] ---- -discovery.zen.ping.multicast.enabled: false <1> discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ---- -<1> Make sure you disable multicast, since it can operate in parallel with unicast. +For more information about how Elasticsearch nodes find eachother, see +https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] +in the Elasticsearch Reference. diff --git a/520_Post_Deployment/40_rolling_restart.asciidoc b/520_Post_Deployment/40_rolling_restart.asciidoc index 60caf629f..1aa93dc4f 100644 --- a/520_Post_Deployment/40_rolling_restart.asciidoc +++ b/520_Post_Deployment/40_rolling_restart.asciidoc @@ -37,14 +37,7 @@ PUT /_cluster/settings } ---- -3. Shut down a single node, preferably using the `shutdown` API on that particular -machine: -+ -[source,js] ----- -POST /_cluster/nodes/_local/_shutdown ----- - +3. Shut down a single node. 4. Perform a maintenance/upgrade. 5. Restart the node, and confirm that it joins the cluster. 6. Reenable shard allocation as follows: diff --git a/stash/Terminology.asciidoc b/stash/Terminology.asciidoc index 9a98b7f60..a10a9ba0e 100644 --- a/stash/Terminology.asciidoc +++ b/stash/Terminology.asciidoc @@ -12,10 +12,13 @@ other. Node:: -A node is a running instance of Elasticsearch. By default, a new node uses -multicast to automagically discover an existing cluster with the same cluster -name, but it can also configured to use unicast or Amazon's EC2 discovery -instead. +A node is a running instance of Elasticsearch. By default, new nodes are +configured to use unicast to discover and join a cluster. If other +nodes are running on the same host, it can join the cluster automatically. +Otherwise, you need to configure a list of unicast hosts it can contact +to get the information it needs to join the cluster. You can also +install one of the https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery.html[ +discovery plugins] to use a different discovery mechanism. + Every node in the cluster knows about every other node in the cluster and what data it is responsible for. You can send your requests to any node