First round of phase 2 changes to sync up with version 2.x.

zhaofanfan2019 · Jan 7, 2016 · baecf3d · baecf3d
1 parent e1688d9
commit baecf3d
Show file tree

Hide file tree

Showing 16 changed files with 148 additions and 179 deletions.
diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc
@@ -73,7 +73,10 @@ start experimenting with it. A _node_ is a running instance of Elasticsearch.
 ((("nodes", "defined"))) A _cluster_ is ((("clusters", "defined")))a group of
 nodes with the same `cluster.name` that are working together to share data
 and to provide failover and scale. (A single node, however, can form a cluster
-all by itself.)
+all by itself.) You can change the `cluster.name` in the `elasticsearch.yml` configuration
+file that's loaded when you start a node. More information about this and other
+<<important-configuration-changes, Important Configuration Changes>> is provided
+in the Production Deployment section at the end of this book.
 
 TIP: See that View in Sense link at the bottom of the example? <<sense, Install the Sense console>>
 to run the examples in this book against your own Elasticsearch cluster and view the results. 

diff --git a/020_Distributed_Cluster/20_Add_failover.asciidoc b/020_Distributed_Cluster/20_Add_failover.asciidoc
@@ -13,11 +13,10 @@ in exactly the same way as you started the first one (see
 share the same directory.
 
 When you run a second node on the same machine, it automatically discovers 
-and joins the cluster as long as it has the same `cluster.name` as the first node (see
-the `./config/elasticsearch.yml` file). However, for nodes running on different machines
+and joins the cluster as long as it has the same `cluster.name` as the first node. 
+However, for nodes running on different machines
 to join the same cluster, you need to configure a list of unicast hosts the nodes can contact
-to join the cluster. For more information about how Elasticsearch nodes find eachother, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery]
-in the Elasticsearch Reference. 
+to join the cluster. For more information, see <<unicast, Prefer Unicast over Multicast>>.
 
 ***************************************
 

diff --git a/052_Mapping_Analysis/25_Data_type_differences.asciidoc b/052_Mapping_Analysis/25_Data_type_differences.asciidoc
@@ -44,7 +44,7 @@ This gives us the following:
             "properties": {
                "date": {
                   "type": "date",
-                  "format": "dateOptionalTime"
+                  "format": "strict_date_optional_time||epoch_millis"
                },
                "name": {
                   "type": "string"

diff --git a/052_Mapping_Analysis/45_Mapping.asciidoc b/052_Mapping_Analysis/45_Mapping.asciidoc
@@ -75,7 +75,7 @@ Elasticsearch generated dynamically from the documents that we indexed:
             "properties": {
                "date": {
                   "type": "date",
-                  "format": "dateOptionalTime"
+                  "format": "strict_date_optional_time||epoch_millis"
                },
                "name": {
                   "type": "string"

diff --git a/060_Distributed_Search.asciidoc b/060_Distributed_Search.asciidoc
@@ -6,5 +6,5 @@ include::060_Distributed_Search/10_Fetch_phase.asciidoc[]
 
 include::060_Distributed_Search/15_Search_options.asciidoc[]
 
-include::060_Distributed_Search/20_Scan_and_scroll.asciidoc[]
+include::060_Distributed_Search/20_Scroll.asciidoc[]
 
diff --git a/060_Distributed_Search/10_Fetch_phase.asciidoc b/060_Distributed_Search/10_Fetch_phase.asciidoc
@@ -57,7 +57,7 @@ culprits are usually bots or web spiders that tirelessly keep fetching page
 after page until your servers crumble at the knees.
 
 If you _do_ need to fetch large numbers of docs from your cluster, you can
-do so efficiently by disabling sorting with the `scan` search type,
+do so efficiently by disabling sorting with the `scroll` query,
 which we discuss <<scan-scroll,later in this chapter>>.
 
 ****
diff --git a/060_Distributed_Search/20_Scan_and_scroll.asciidoc b/060_Distributed_Search/20_Scan_and_scroll.asciidoc
diff --git a/060_Distributed_Search/20_Scroll.asciidoc b/060_Distributed_Search/20_Scroll.asciidoc
@@ -0,0 +1,74 @@
+[[scroll]]
+=== Scroll
+
+A `scroll` query ((("scroll API))) is used to retrieve
+large numbers of documents from Elasticsearch efficiently, without paying the
+penalty of deep pagination.
+
+Scrolling allows us to((("scrolled search"))) do an initial search and to keep pulling
+batches of results from Elasticsearch until there are no more results left.
+It's a bit like a _cursor_ in ((("cursors")))a traditional database.
+
+A scrolled search takes a snapshot in time. It doesn't see any changes that
+are made to the index after the initial search request has been made. It does
+this by keeping the old data files around, so that it can preserve its ``view''
+on what the index looked like at the time it started.
+
+The costly part of deep pagination is the global sorting of results, but if we
+disable sorting, then we can return all documents quite cheaply. To do this, we
+sort by `_doc`. This instructs Elasticsearch just return the next batch of 
+results from every shard that still has results to return.
+
+To scroll through results, we execute a search request and set the `scroll` value to
+the length of time we want to keep the scroll window open. The scroll expiry 
+time is refreshed every time we run a scroll request, so it only needs to be long enough
+to process the current batch of results, not all of the documents that match
+the query. The timeout is important because keeping the scroll window open
+consumes resources and we want to free them as soon as they are no longer needed. 
+Setting the timeout enables Elasticsearch to automatically free the resources 
+after a small period of inactivity.
+
+[source,js]
+--------------------------------------------------
+GET /old_index/_search?scroll=1m <1>
+{
+    "query": { "match_all": {}},
+    "sort" : ["_doc"], <2>
+    "size":  1000
+}
+--------------------------------------------------
+<1> Keep the scroll window open for 1 minute.
+<2> `_doc` is the most efficient sort order. 
+
+The response to this request includes a
+`_scroll_id`, which is a long Base-64 encoded((("scroll_id"))) string. Now we can pass the
+`_scroll_id` to the `_search/scroll` endpoint to retrieve the next batch of
+results:
+
+[source,js]
+--------------------------------------------------
+GET /_search/scroll
+{
+    "scroll": "1m", <1>
+    "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
+}
+--------------------------------------------------
+<1> Note that we again set the scroll expiration to 1m.  
+
+The response to this scroll request includes the next batch of results.
+Although we specified a `size` of 1,000, we get back many more
+documents.((("size parameter", "in scanning")))  When scanning, the `size` is applied to each shard, so you will
+get back a maximum of `size * number_of_primary_shards` documents in each
+batch.
+
+NOTE: The scroll request also returns  a _new_ `_scroll_id`.  Every time
+we make the next scroll request, we must pass the `_scroll_id` returned by the
+_previous_ scroll request.
+
+When no more hits are returned, we have processed all matching documents.
+
+TIP: Some of the official Elasticsearch clients such as 
+http://elasticsearch-py.readthedocs.org/en/master/helpers.html#scan[Python client] and 
+https://metacpan.org/pod/Search::Elasticsearch::Scroll[Perl client] provide scroll helpers that
+provide easy-to-use wrappers around this funtionality.
+
diff --git a/070_Index_Mgmt/50_Reindexing.asciidoc b/070_Index_Mgmt/50_Reindexing.asciidoc
@@ -15,7 +15,7 @@ whole document available to you in Elasticsearch itself. You don't have to
 rebuild your index from the database, which is usually much slower.
 
 To reindex all of the documents from the old index efficiently,  use
-<<scan-scroll,_scan-and-scroll_>> to retrieve batches((("scan-and-scroll", "using in reindexing documents"))) of documents from the old index,
+<<scan-scroll,_scroll_>> to retrieve batches((("using in reindexing documents"))) of documents from the old index,
 and the <<bulk,`bulk` API>> to push them into the new index.
 
 .Reindexing in Batches
@@ -27,7 +27,7 @@ jobs by filtering on a date or timestamp field:
 
 [source,js]
 --------------------------------------------------
-GET /old_index/_search?search_type=scan&scroll=1m
+GET /old_index/_search?scroll=1m
 {
     "query": {
         "range": {
@@ -37,6 +37,7 @@ GET /old_index/_search?search_type=scan&scroll=1m
             }
         }
     },
+    "sort": ["_doc"],
     "size":  1000
 }
 --------------------------------------------------

diff --git a/400_Relationships/25_Concurrency.asciidoc b/400_Relationships/25_Concurrency.asciidoc
@@ -182,7 +182,7 @@ PUT /fs/file/1?version=2 <1>
 We can even rename a directory, but this means updating all of the files that
 exist anywhere in the path hierarchy beneath that directory.  This may be
 quick or slow, depending on how many files need to be updated.  All we would
-need to do is to use <<scan-scroll,scan-and-scroll>> to retrieve all the
+need to do is to use <<scan-scroll,`scroll`>> to retrieve all the
 files, and the <<bulk,`bulk` API>> to update them.  The process isn't
 atomic, but all files will quickly move to their new home.
 
diff --git a/400_Relationships/26_Concurrency_solutions.asciidoc b/400_Relationships/26_Concurrency_solutions.asciidoc
@@ -81,10 +81,9 @@ parallelism by making our locking more fine-grained.
 ==== Document Locking
 
 Instead of locking the whole filesystem, we could lock individual documents
-by using the same technique as previously described.((("locking", "document locking")))((("document locking")))  A process could use a
-<<scan-scroll,scan-and-scroll>> request to retrieve the IDs of all documents
-that would be affected by the change, and would need to create a lock file for
-each of them:
+by using the same technique as previously described.((("locking", "document locking")))((("document locking")))  
+We can use a <<scroll,scrolled search>> to retrieve all documents that would be affected by the change and 
+create a lock file for each one:
 
 [source,json]
 --------------------------
@@ -93,7 +92,6 @@ PUT /fs/lock/_bulk
 { "process_id": 123    } <2>
 { "create": { "_id": 2}}
 { "process_id": 123    }
-...
 --------------------------
 <1> The ID of the `lock` document would be the same as the ID of  the file
     that should be locked.
@@ -135,41 +133,51 @@ POST /fs/lock/1/_update
 }
 --------------------------
 
-If the document doesn't already exist, the `upsert` document will be inserted--much the same as the `create` request we used previously.  However, if the
-document _does_ exist, the script will look at the `process_id` stored in the
-document.  If it is the same as ours, it aborts the update (`noop`) and
-returns success.  If it is different, the `assert false` throws an exception
-and we know that the lock has failed.
+If the document doesn't already exist, the `upsert` document is inserted--much 
+the same as the previous `create` request.  However, if the
+document _does_ exist, the script looks at the `process_id` stored in the
+document.  If the `process_id` matches, no update is performed (`noop`) but the
+script returns successfully.  If it is different, `assert false` throws an exception
+and you know that the lock has failed.
+
+Once all locks have been successfully created, you can proceed with your changes.
+
+Afterward, you must release all of the locks, which you can do by
+retrieving all of the locked documents and performing a bulk delete:
 
-Once all locks have been successfully created, the rename operation can begin.
-Afterward, we must release((("delete-by-query request"))) all of the locks, which we can do with a
-`delete-by-query` request:
 
 [source,json]
 --------------------------
 POST /fs/_refresh <1>
 
-DELETE /fs/lock/_query
+GET /fs/lock/_search?scroll=1m <2>
 {
-  "query": {
-    "term": {
-      "process_id": 123
+    "sort" : ["_doc"],
+    "query": {
+        "match" : {
+            "process_id" : 123
+        }
     }
-  }
 }
+
+PUT /fs/lock/_bulk
+{ "delete": { "_id": 1}} 
+{ "delete": { "_id": 2}}
 --------------------------
 <1> The `refresh` call ensures that all `lock` documents are visible to
-    the `delete-by-query` request.
+    the search request.
+<2> You can use a <<scan-scroll,`scroll`>> query when you need to retrieve large 
+numbers of results with a single search request.
 
 Document-level locking enables fine-grained access control, but creating lock
-files for millions of documents can be expensive.  In certain scenarios, such
-as this example with directory trees, it is possible to achieve fine-grained
-locking with much less work.
+files for millions of documents can be expensive.  In some cases, 
+you can achieve fine-grained locking with much less work, as shown in the
+following directory tree scenario. 
 
 [[tree-locking]]
 ==== Tree Locking
 
-Rather than locking every involved document, as in the previous option, we
+Rather than locking every involved document as in the previous example, we
 could lock just part of the directory tree.((("locking", "tree locking")))  We will need exclusive access
 to the file or directory that we want to rename, which can be achieved with an
 _exclusive lock_ document:

diff --git a/410_Scaling/45_Index_per_timeframe.asciidoc b/410_Scaling/45_Index_per_timeframe.asciidoc
@@ -29,25 +29,8 @@ data.
 
 If we were to have one big index for documents of this type, we would soon run
 out of space. Logging events just keep on coming, without pause or
-interruption. We could delete the old events, with a `delete-by-query`:
-
-[source,json]
--------------------------
-DELETE /logs/event/_query
-{
-  "query": {
-    "range": {
-      "@timestamp": { <1>
-        "lt": "now-90d"
-      }
-    }
-  }
-}
--------------------------
-<1> Deletes all documents where Logstash's `@timestamp` field is
-    older than 90 days.
-
-But this approach is _very inefficient_.  Remember that when you delete a
+interruption. We could delete the old events with a <<scan-scroll,`scroll`>> 
+query and bulk delete,  but this approach is _very inefficient_.  When you delete a
 document, it is only _marked_ as deleted (see <<deletes-and-updates>>). It won't
 be physically deleted until the segment containing it is merged away.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -6,5 +6,5 @@ include::060_Distributed_Search/10_Fetch_phase.asciidoc[]

		include::060_Distributed_Search/15_Search_options.asciidoc[]

		include::060_Distributed_Search/20_Scan_and_scroll.asciidoc[]
		include::060_Distributed_Search/20_Scroll.asciidoc[]