Skip to content

Commit

Permalink
First round of phase 2 changes to sync up with version 2.x.
Browse files Browse the repository at this point in the history
  • Loading branch information
debadair committed Jan 7, 2016
1 parent e1688d9 commit baecf3d
Show file tree
Hide file tree
Showing 16 changed files with 148 additions and 179 deletions.
5 changes: 4 additions & 1 deletion 010_Intro/10_Installing_ES.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,10 @@ start experimenting with it. A _node_ is a running instance of Elasticsearch.
((("nodes", "defined"))) A _cluster_ is ((("clusters", "defined")))a group of
nodes with the same `cluster.name` that are working together to share data
and to provide failover and scale. (A single node, however, can form a cluster
all by itself.)
all by itself.) You can change the `cluster.name` in the `elasticsearch.yml` configuration
file that's loaded when you start a node. More information about this and other
<<important-configuration-changes, Important Configuration Changes>> is provided
in the Production Deployment section at the end of this book.

TIP: See that View in Sense link at the bottom of the example? <<sense, Install the Sense console>>
to run the examples in this book against your own Elasticsearch cluster and view the results.
Expand Down
7 changes: 3 additions & 4 deletions 020_Distributed_Cluster/20_Add_failover.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,10 @@ in exactly the same way as you started the first one (see
share the same directory.
When you run a second node on the same machine, it automatically discovers
and joins the cluster as long as it has the same `cluster.name` as the first node (see
the `./config/elasticsearch.yml` file). However, for nodes running on different machines
and joins the cluster as long as it has the same `cluster.name` as the first node.
However, for nodes running on different machines
to join the same cluster, you need to configure a list of unicast hosts the nodes can contact
to join the cluster. For more information about how Elasticsearch nodes find eachother, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery]
in the Elasticsearch Reference.
to join the cluster. For more information, see <<unicast, Prefer Unicast over Multicast>>.
***************************************

Expand Down
2 changes: 1 addition & 1 deletion 052_Mapping_Analysis/25_Data_type_differences.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ This gives us the following:
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
"format": "strict_date_optional_time||epoch_millis"
},
"name": {
"type": "string"
Expand Down
2 changes: 1 addition & 1 deletion 052_Mapping_Analysis/45_Mapping.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Elasticsearch generated dynamically from the documents that we indexed:
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
"format": "strict_date_optional_time||epoch_millis"
},
"name": {
"type": "string"
Expand Down
2 changes: 1 addition & 1 deletion 060_Distributed_Search.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ include::060_Distributed_Search/10_Fetch_phase.asciidoc[]

include::060_Distributed_Search/15_Search_options.asciidoc[]

include::060_Distributed_Search/20_Scan_and_scroll.asciidoc[]
include::060_Distributed_Search/20_Scroll.asciidoc[]

2 changes: 1 addition & 1 deletion 060_Distributed_Search/10_Fetch_phase.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ culprits are usually bots or web spiders that tirelessly keep fetching page
after page until your servers crumble at the knees.
If you _do_ need to fetch large numbers of docs from your cluster, you can
do so efficiently by disabling sorting with the `scan` search type,
do so efficiently by disabling sorting with the `scroll` query,
which we discuss <<scan-scroll,later in this chapter>>.
****
81 changes: 0 additions & 81 deletions 060_Distributed_Search/20_Scan_and_scroll.asciidoc

This file was deleted.

74 changes: 74 additions & 0 deletions 060_Distributed_Search/20_Scroll.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
[[scroll]]
=== Scroll

A `scroll` query ((("scroll API))) is used to retrieve
large numbers of documents from Elasticsearch efficiently, without paying the
penalty of deep pagination.

Scrolling allows us to((("scrolled search"))) do an initial search and to keep pulling
batches of results from Elasticsearch until there are no more results left.
It's a bit like a _cursor_ in ((("cursors")))a traditional database.

A scrolled search takes a snapshot in time. It doesn't see any changes that
are made to the index after the initial search request has been made. It does
this by keeping the old data files around, so that it can preserve its ``view''
on what the index looked like at the time it started.

The costly part of deep pagination is the global sorting of results, but if we
disable sorting, then we can return all documents quite cheaply. To do this, we
sort by `_doc`. This instructs Elasticsearch just return the next batch of
results from every shard that still has results to return.

To scroll through results, we execute a search request and set the `scroll` value to
the length of time we want to keep the scroll window open. The scroll expiry
time is refreshed every time we run a scroll request, so it only needs to be long enough
to process the current batch of results, not all of the documents that match
the query. The timeout is important because keeping the scroll window open
consumes resources and we want to free them as soon as they are no longer needed.
Setting the timeout enables Elasticsearch to automatically free the resources
after a small period of inactivity.

[source,js]
--------------------------------------------------
GET /old_index/_search?scroll=1m <1>
{
"query": { "match_all": {}},
"sort" : ["_doc"], <2>
"size": 1000
}
--------------------------------------------------
<1> Keep the scroll window open for 1 minute.
<2> `_doc` is the most efficient sort order.

The response to this request includes a
`_scroll_id`, which is a long Base-64 encoded((("scroll_id"))) string. Now we can pass the
`_scroll_id` to the `_search/scroll` endpoint to retrieve the next batch of
results:

[source,js]
--------------------------------------------------
GET /_search/scroll
{
"scroll": "1m", <1>
"scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
}
--------------------------------------------------
<1> Note that we again set the scroll expiration to 1m.

The response to this scroll request includes the next batch of results.
Although we specified a `size` of 1,000, we get back many more
documents.((("size parameter", "in scanning"))) When scanning, the `size` is applied to each shard, so you will
get back a maximum of `size * number_of_primary_shards` documents in each
batch.

NOTE: The scroll request also returns a _new_ `_scroll_id`. Every time
we make the next scroll request, we must pass the `_scroll_id` returned by the
_previous_ scroll request.

When no more hits are returned, we have processed all matching documents.

TIP: Some of the official Elasticsearch clients such as
http://elasticsearch-py.readthedocs.org/en/master/helpers.html#scan[Python client] and
https://metacpan.org/pod/Search::Elasticsearch::Scroll[Perl client] provide scroll helpers that
provide easy-to-use wrappers around this funtionality.

5 changes: 3 additions & 2 deletions 070_Index_Mgmt/50_Reindexing.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ whole document available to you in Elasticsearch itself. You don't have to
rebuild your index from the database, which is usually much slower.

To reindex all of the documents from the old index efficiently, use
<<scan-scroll,_scan-and-scroll_>> to retrieve batches((("scan-and-scroll", "using in reindexing documents"))) of documents from the old index,
<<scan-scroll,_scroll_>> to retrieve batches((("using in reindexing documents"))) of documents from the old index,
and the <<bulk,`bulk` API>> to push them into the new index.

.Reindexing in Batches
Expand All @@ -27,7 +27,7 @@ jobs by filtering on a date or timestamp field:
[source,js]
--------------------------------------------------
GET /old_index/_search?search_type=scan&scroll=1m
GET /old_index/_search?scroll=1m
{
"query": {
"range": {
Expand All @@ -37,6 +37,7 @@ GET /old_index/_search?search_type=scan&scroll=1m
}
}
},
"sort": ["_doc"],
"size": 1000
}
--------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion 400_Relationships/25_Concurrency.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ PUT /fs/file/1?version=2 <1>
We can even rename a directory, but this means updating all of the files that
exist anywhere in the path hierarchy beneath that directory. This may be
quick or slow, depending on how many files need to be updated. All we would
need to do is to use <<scan-scroll,scan-and-scroll>> to retrieve all the
need to do is to use <<scan-scroll,`scroll`>> to retrieve all the
files, and the <<bulk,`bulk` API>> to update them. The process isn't
atomic, but all files will quickly move to their new home.

54 changes: 31 additions & 23 deletions 400_Relationships/26_Concurrency_solutions.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,9 @@ parallelism by making our locking more fine-grained.
==== Document Locking

Instead of locking the whole filesystem, we could lock individual documents
by using the same technique as previously described.((("locking", "document locking")))((("document locking"))) A process could use a
<<scan-scroll,scan-and-scroll>> request to retrieve the IDs of all documents
that would be affected by the change, and would need to create a lock file for
each of them:
by using the same technique as previously described.((("locking", "document locking")))((("document locking")))
We can use a <<scroll,scrolled search>> to retrieve all documents that would be affected by the change and
create a lock file for each one:

[source,json]
--------------------------
Expand All @@ -93,7 +92,6 @@ PUT /fs/lock/_bulk
{ "process_id": 123 } <2>
{ "create": { "_id": 2}}
{ "process_id": 123 }
...
--------------------------
<1> The ID of the `lock` document would be the same as the ID of the file
that should be locked.
Expand Down Expand Up @@ -135,41 +133,51 @@ POST /fs/lock/1/_update
}
--------------------------

If the document doesn't already exist, the `upsert` document will be inserted--much the same as the `create` request we used previously. However, if the
document _does_ exist, the script will look at the `process_id` stored in the
document. If it is the same as ours, it aborts the update (`noop`) and
returns success. If it is different, the `assert false` throws an exception
and we know that the lock has failed.
If the document doesn't already exist, the `upsert` document is inserted--much
the same as the previous `create` request. However, if the
document _does_ exist, the script looks at the `process_id` stored in the
document. If the `process_id` matches, no update is performed (`noop`) but the
script returns successfully. If it is different, `assert false` throws an exception
and you know that the lock has failed.

Once all locks have been successfully created, you can proceed with your changes.

Afterward, you must release all of the locks, which you can do by
retrieving all of the locked documents and performing a bulk delete:

Once all locks have been successfully created, the rename operation can begin.
Afterward, we must release((("delete-by-query request"))) all of the locks, which we can do with a
`delete-by-query` request:

[source,json]
--------------------------
POST /fs/_refresh <1>
DELETE /fs/lock/_query
GET /fs/lock/_search?scroll=1m <2>
{
"query": {
"term": {
"process_id": 123
"sort" : ["_doc"],
"query": {
"match" : {
"process_id" : 123
}
}
}
}
PUT /fs/lock/_bulk
{ "delete": { "_id": 1}}
{ "delete": { "_id": 2}}
--------------------------
<1> The `refresh` call ensures that all `lock` documents are visible to
the `delete-by-query` request.
the search request.
<2> You can use a <<scan-scroll,`scroll`>> query when you need to retrieve large
numbers of results with a single search request.

Document-level locking enables fine-grained access control, but creating lock
files for millions of documents can be expensive. In certain scenarios, such
as this example with directory trees, it is possible to achieve fine-grained
locking with much less work.
files for millions of documents can be expensive. In some cases,
you can achieve fine-grained locking with much less work, as shown in the
following directory tree scenario.

[[tree-locking]]
==== Tree Locking

Rather than locking every involved document, as in the previous option, we
Rather than locking every involved document as in the previous example, we
could lock just part of the directory tree.((("locking", "tree locking"))) We will need exclusive access
to the file or directory that we want to rename, which can be achieved with an
_exclusive lock_ document:
Expand Down
21 changes: 2 additions & 19 deletions 410_Scaling/45_Index_per_timeframe.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,25 +29,8 @@ data.

If we were to have one big index for documents of this type, we would soon run
out of space. Logging events just keep on coming, without pause or
interruption. We could delete the old events, with a `delete-by-query`:

[source,json]
-------------------------
DELETE /logs/event/_query
{
"query": {
"range": {
"@timestamp": { <1>
"lt": "now-90d"
}
}
}
}
-------------------------
<1> Deletes all documents where Logstash's `@timestamp` field is
older than 90 days.

But this approach is _very inefficient_. Remember that when you delete a
interruption. We could delete the old events with a <<scan-scroll,`scroll`>>
query and bulk delete, but this approach is _very inefficient_. When you delete a
document, it is only _marked_ as deleted (see <<deletes-and-updates>>). It won't
be physically deleted until the segment containing it is merged away.

Expand Down
Loading

0 comments on commit baecf3d

Please sign in to comment.