diff --git a/056_Sorting.asciidoc b/056_Sorting.asciidoc index 0b33372e1..bfc24188a 100644 --- a/056_Sorting.asciidoc +++ b/056_Sorting.asciidoc @@ -4,5 +4,4 @@ include::056_Sorting/88_String_sorting.asciidoc[] include::056_Sorting/90_What_is_relevance.asciidoc[] -include::056_Sorting/95_Fielddata.asciidoc[] - +include::056_Sorting/95_Docvalues.asciidoc[] diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 8f57c4ad8..f35a59058 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -71,5 +71,4 @@ GET /_search // SENSE: 056_Sorting/88_Multifield.json WARNING: Sorting on a full-text `analyzed` field can use a lot of memory. See -<> for more information. - +<> for more information. diff --git a/056_Sorting/95_Docvalues.asciidoc b/056_Sorting/95_Docvalues.asciidoc new file mode 100644 index 000000000..68a6b2614 --- /dev/null +++ b/056_Sorting/95_Docvalues.asciidoc @@ -0,0 +1,44 @@ +[[docvalues-intro]] +=== Doc Values Intro + +Our final topic in this chapter is about an internal aspect of Elasticsearch. +While we don't demonstrate any new techniques here, doc values are an +important topic that we will refer to repeatedly, and is something that you +should be aware of.((("docvalues"))) + +When you sort on a field, Elasticsearch needs access to the value of that +field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which +performs very well when searching, is not the ideal structure for sorting on +field values: + +* When searching, we need to be able to map a term to a list of documents. + +* When sorting, we need to map a document to its terms. In other words, we + need to ``uninvert'' the inverted index. + +This ``uninverted'' structure is often called a ``column-store'' in other systems. +Essentially, it stores all the values for a single field together in a single +column of data, which makes it very efficient for operations like sorting. + +In Elasticsearch, this column-store is known as _doc values_, and is enabled +by default. Doc values are created at index-time: when a field is indexed, Elasticsearch +adds the tokens to the inverted index for search. But it also extracts the terms +and adds them to the columnar doc values. + +Doc values are used in several places in Elasticsearch: + +* Sorting on a field +* Aggregations on a field +* Certain filters (for example, geolocation filters) +* Scripts that refer to fields + +Because doc values are serialized to disk, we can leverage the OS to help keep +access fast. When the "working set" is smaller than the available memory on a node, +the OS will naturally keep all the doc values hot in memory, leading to very fast +access. When the "working set" is much larger than available memory, the OS will +naturally start to page doc-values on/off disk without running into the dreaded +OutOfMemory exception. + +We'll talk about doc values in much greater depth later. For now, all you need +to know is that sorting (and some other operations) happen on a parallel data +structure which is built at index-time. diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc deleted file mode 100644 index 10ba6a947..000000000 --- a/056_Sorting/95_Fielddata.asciidoc +++ /dev/null @@ -1,54 +0,0 @@ -[[fielddata-intro]] -=== Fielddata - -Our final topic in this chapter is about an internal aspect of Elasticsearch. -While we don't demonstrate any new techniques here, fielddata is an -important topic that we will refer to repeatedly, and is something that you -should be aware of.((("fielddata"))) - -When you sort on a field, Elasticsearch needs access to the value of that -field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which -performs very well when searching, is not the ideal structure for sorting on -field values: - -* When searching, we need to be able to map a term to a list of documents. - -* When sorting, we need to map a document to its terms. In other words, we - need to ``uninvert'' the inverted index. - -To make sorting efficient, Elasticsearch loads all the values for -the field that you want to sort on into memory. This is referred to as -_fielddata_. - -WARNING: Elasticsearch doesn't just load the values for the documents that matched a -particular query. It loads the values from _every document in your index_, -regardless of the document `type`. - -The reason that Elasticsearch loads all values into memory is that uninverting the index -from disk is slow. Even though you may need the values for only a few docs -for the current request, you will probably need access to the values for other -docs on the next request, so it makes sense to load all the values into memory -at once, and to keep them there. - -Fielddata is used in several places in Elasticsearch: - -* Sorting on a field -* Aggregations on a field -* Certain filters (for example, geolocation filters) -* Scripts that refer to fields - -Clearly, this can consume a lot of memory, especially for high-cardinality -string fields--string fields that have many unique values--like the body -of an email. Fortunately, insufficient memory is a problem that can be solved -by horizontal scaling, by adding more nodes to your cluster. - -For now, all you need to know is what fielddata is, and to be aware that it -can be memory hungry. Later, we will show you how to determine the amount of memory that fielddata -is using, how to limit the amount of memory that is available to it, and -how to preload fielddata to improve the user experience. - - - - - - diff --git a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc index c987c6bfd..57cfb87da 100644 --- a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc +++ b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc @@ -1,20 +1,28 @@ === Limiting Memory Usage -In order for aggregations (or any operation that requires access to field -values) to be fast, ((("aggregations", "limiting memory usage")))access to fielddata must be fast, which is why it is -loaded into memory. ((("fielddata")))((("memory usage", "limiting for aggregations", id="ix_memagg"))) But loading too much data into memory will cause slow -garbage collections as the JVM tries to find extra space in the heap, or -possibly even an OutOfMemory exception. +Once analyzed strings have been loaded into fielddata, they will sit there until +evicted (or your node crashes). For that reason it is important to keep an eye on this +memory usage, understand how and when it loads, and how you can limit the impact on your cluster. -It may surprise you to find that Elasticsearch does not load into fielddata -just the values for the documents that match your query. It loads the values -for _all documents in your index_, even documents with a different `_type`! +Fielddata is loaded _lazily_. If you never aggregate on an analyzed string, you'll +never load fielddata into memory. Furthermore, fielddata is loaded on a per-field basis, +meaning only actively used fields will incur the "fielddata tax". -The logic is: if you need access to documents X, Y, and Z for this query, you -will probably need access to other documents in the next query. It is cheaper -to load all values once, and to _keep them in memory_, than to have to scan -the inverted index on every request. +However, there is a subtle surprise lurking here. Suppose your query is highly selective and +only returns 100 hits. Most people assume fielddata is only loaded for those +100 documents. + +In reality, fielddata will be loaded for *all* documents in that index (for that +particular field), regardless of the query's specificity. The logic is: +if you need access to documents X, Y, and Z for this query, you +will probably need access to other documents in the next query. + +Unlike doc values, +the fielddata structure is not created at index time. Instead, it is populated +on-the-fly when the query is run. This is a potentially non-trivial operation and +can take some time. It is cheaper to load all the values once, and keep them in +memory, than load only a portion of the total fielddata repeatedly. The JVM heap ((("JVM (Java Virtual Machine)", "heap usage, fielddata and")))is a limited resource that should be used wisely. A number of mechanisms exist to limit the impact of fielddata on heap usage. These limits @@ -29,18 +37,15 @@ the `$ES_HEAP_SIZE` environment variable: No more than 50% of available RAM:: Lucene makes good use of the filesystem caches, which are managed by the -kernel. Without enough filesystem cache space, performance will suffer. +kernel. Without enough filesystem cache space, performance will suffer. +Furthermore, the more memory dedicated to the heap means less available +for all your other fields using doc values. -No more than 32 GB: +No more than 32 GB:: If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes. -+ -Increasing the heap from 32 GB to 34 GB would mean that you have much _less_ -memory available, because all pointers are taking double the space. Also, -with bigger heaps, garbage collection becomes more costly and can result in -node instability. -This limit has a direct impact on the amount of memory that can be devoted to fielddata. +For a longer and more complete discussion of heap sizing, see <> ****************************************** @@ -48,10 +53,10 @@ This limit has a direct impact on the amount of memory that can be devoted to fi ==== Fielddata Size The `indices.fielddata.cache.size` controls how much heap space is allocated -to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size"))) When you run a query that requires access to new field values, -it will load the values into memory and then try to add them to fielddata. If -the resulting fielddata size would exceed the specified `size`, other -values would be evicted in order to make space. +to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size"))) +As you are issuing queries, aggregations on analyzed strings will load into fielddata +if the field wasn't previously loaded. If the resulting fielddata size would +exceed the specified `size`, other values will be evicted in order to make space. By default, this setting is _unbounded_—Elasticsearch will never evict data from fielddata. @@ -92,7 +97,7 @@ setting to the `config/elasticsearch.yml` file: [source,yaml] ----------------------------- -indices.fielddata.cache.size: 40% <1> +indices.fielddata.cache.size: 20% <1> ----------------------------- <1> Can be set to a percentage of the heap size, or a concrete value like `5gb` diff --git a/300_Aggregations/110_docvalues.asciidoc b/300_Aggregations/110_docvalues.asciidoc deleted file mode 100644 index 0ce141a6b..000000000 --- a/300_Aggregations/110_docvalues.asciidoc +++ /dev/null @@ -1,67 +0,0 @@ -[[doc-values]] -=== Doc Values - -In-memory fielddata is limited by the size of your heap.((("aggregations", "doc values"))) While this is a -problem that can be solved by scaling horizontally--you can always add more -nodes--you will find that heavy use of aggregations and sorting can exhaust -your heap space while other resources on the node are underutilized. - -While fielddata defaults to loading values into memory on the fly, this is not -the only option. It can also be written to disk at index time in a way that -provides all the functionality of in-memory fielddata, but without the -heap memory usage. This alternative format is ((("fielddata", "doc values")))((("doc values")))called _doc values_. - -Doc values were added to Elasticsearch in version 1.0.0 but, until recently, -they were much slower than in-memory fielddata. By benchmarking and profiling -performance, various bottlenecks have been identified--in both Elasticsearch -and Lucene--and removed. Starting in version 2.0, doc values became the default -format for almost all field types, with the notable exception of analyzed -string fields. - -Doc values are now only about 10–25% slower than in-memory fielddata, and -come with two major advantages: - - * They live on disk instead of in heap memory. This allows you to work with - quantities of fielddata that would normally be too large to fit into - memory. In fact, your heap space (`$ES_HEAP_SIZE`) can now be set to a - smaller size, which improves the speed of garbage collection and, - consequently, node stability. - - * Doc values are built at index time, not at search time. While in-memory - fielddata has to be built on the fly at search time by uninverting the - inverted index, doc values are prebuilt and much faster to initialize. - -The trade-off is a larger index size and slightly slower fielddata access. Doc -values are remarkably efficient, so for many queries you might not even notice -the slightly slower speed. Combine that with faster garbage collections and -improved initialization times and you may notice a net gain. - -The more filesystem cache space that you have available, the better doc values -will perform. If the files holding the doc values are resident in the filesystem cache, then accessing the files is almost equivalent to reading from -RAM. And the filesystem cache is managed by the kernel instead of the JVM. - -==== Enabling Doc Values - -Doc values are enabled by default for numeric, date, Boolean, binary, and geo-point -fields, and for `not_analyzed` string fields.((("doc values", "enabling"))) They do not currently work with -`analyzed` string fields. If you are sure that you don’t need to sort or aggregate -on a field, or access the field value from a script, you can disable doc values in -order to save disk space: - -[source,js] ----- -PUT /music/_mapping/song -{ - "properties" : { - "tag": { - "type": "string", - "index" : "not_analyzed", - "doc_values": false - } - } -} ----- - - - - diff --git a/300_Aggregations/115_eager.asciidoc b/300_Aggregations/115_eager.asciidoc index 5adacad68..9927833e9 100644 --- a/300_Aggregations/115_eager.asciidoc +++ b/300_Aggregations/115_eager.asciidoc @@ -39,8 +39,8 @@ are pre-loaded: ---- PUT /music/_mapping/_song { - "price_usd": { - "type": "integer", + "tags": { + "type": "string", "fielddata": { "loading" : "eager" <1> } diff --git a/300_Aggregations/125_Conclusion.asciidoc b/300_Aggregations/125_Conclusion.asciidoc index 3c20f0d2b..0eeb61d8d 100644 --- a/300_Aggregations/125_Conclusion.asciidoc +++ b/300_Aggregations/125_Conclusion.asciidoc @@ -2,36 +2,33 @@ == Closing Thoughts This section covered a lot of ground, and a lot of deeply technical issues. -Aggregations bring a power and flexibility to Elasticsearch that is hard to +Aggregations bring a power and flexibility to Elasticsearch that is hard to overstate. The ability to nest buckets and metrics, to quickly approximate -cardinality and percentiles, to find statistical anomalies in your data, all +cardinality and percentiles, to find statistical anomalies in your data, all while operating on near-real-time data and in parallel to full-text search--these are game-changers to many organizations. It is a feature that, once you start using it, you'll find dozens of other candidate uses. Real-time reporting and analytics is central to many organizations (be it over business intelligence or server logs). -But with great power comes great responsibility, and for Elasticsearch that often -means proper memory stewardship. Memory is often the limiting factor in -Elasticsearch deployments, particularly those that heavily utilize aggregations. -Because aggregation data is loaded to fielddata--and this is an in-memory data -structure--managing ((("aggregations", "managing efficient memory usage")))efficient memory usage is important. +Elasticsearch has made great strides in becoming more memory friendly by defaulting +to doc values for _most_ fields, but the necessity of fielddata for string fields +means you must remain vigilant. The management of this memory can take several forms, depending on your particular use-case: -- At a data level, by making sure you analyze (or `not_analyze`) your data appropriately -so that it is memory-friendly -- During indexing, by configuring heavy fields to use disk-based doc values instead -of in-memory fielddata +- During the planning stage, attempt to organize your data so that aggregations are +run on `not_analyzed` strings instead of analyzed so that doc values may be leveraged. +- While testing, verify that analysis chains are not creating high cardinality +fields which are later aggregated on - At search time, by utilizing approximate aggregations and data filtering - At a node level, by setting hard memory and dynamic circuit-breaker limits -- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles, potentially by adding more nodes to the cluster +- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles, +potentially by adding more nodes to the cluster Most deployments will use one or more of the preceding methods. The exact combination -is highly dependent on your particular environment. Some organizations need -blisteringly fast responses and opt to simply add more nodes. Other organizations -are limited by budget and choose doc values and approximate aggregations. +is highly dependent on your particular environment. Whatever the path you take, it is important to assess the available options and create both a short- and long-term plan. Decide how your memory situation exists diff --git a/300_Aggregations/90_fielddata.asciidoc b/300_Aggregations/90_docvalues.asciidoc similarity index 55% rename from 300_Aggregations/90_fielddata.asciidoc rename to 300_Aggregations/90_docvalues.asciidoc index 1af3f9afa..b1bcf47e2 100644 --- a/300_Aggregations/90_fielddata.asciidoc +++ b/300_Aggregations/90_docvalues.asciidoc @@ -1,23 +1,13 @@ -[[fielddata]] -=== Fielddata +[[docvalues]] +=== Doc Values -Aggregations work via a data structure known as _fielddata_ (briefly introduced -in <>). ((("fielddata")))((("memory usage", "fielddata")))Fielddata is often the largest consumer of memory -in an Elasticsearch cluster, so it is important to understand how it works. +Aggregations work via a data structure known as _doc values_ (briefly introduced +in <>). ((("docvalues")))((("docvalues")))Doc values +are what make aggregations fast, efficient and memory-friendly, so it is useful +to understand how they work. -[TIP] -================================================== - -Fielddata can be loaded on the fly into memory, or built at index time and -stored on disk.((("fielddata", "loaded into memory vs. on disk"))) Later, we will talk about on-disk fielddata in -<>. For now we will focus on in-memory fielddata, as it is -currently the default mode of operation in Elasticsearch. This may well change -in a future version. - -================================================== - -Fielddata exists because inverted indices are efficient only for certain operations. -The inverted index excels((("inverted index", "fielddata versus"))) at finding documents that contain a term. It does not +Doc values exists because inverted indices are efficient for only certain operations. +The inverted index excels((("inverted index", "doc values versus"))) at finding documents that contain a term. It does not perform well in the opposite direction: determining which terms exist in a single document. Aggregations need this secondary access pattern. @@ -68,14 +58,14 @@ columns to see which documents contain +brown+. We can very quickly see that `Doc_1` and `Doc_2` contain the token +brown+. Then, for the aggregation portion, we need to find all the unique terms in -`Doc_1` and `Doc_2`.((("aggregations", "fielddata", "using instead of inverted index"))) Trying to do this with the inverted index would be a +`Doc_1` and `Doc_2`.((("aggregations", "doc values", "using instead of inverted index"))) Trying to do this with the inverted index would be a very expensive process: we would have to iterate over every term in the index and collect tokens from `Doc_1` and `Doc_2` columns. This would be slow and scale poorly: as the number of terms and documents grows, so would the execution time. -Fielddata addresses this problem by inverting the relationship. While the -inverted index maps terms to the documents containing the term, fielddata +Doc values addresses this problem by inverting the relationship. While the +inverted index maps terms to the documents containing the term, doc values maps documents to the terms contained by the document: Doc Terms @@ -89,30 +79,16 @@ Once the data has been uninverted, it is trivial to collect the unique tokens fr `Doc_1` and `Doc_2`. Go to the rows for each document, collect all the terms, and take the union of the two sets. - -[TIP] -================================================== - -The fielddata cache is per segment.((("fielddata cache")))((("segments", "fielddata cache"))) In other words, when a new segment becomes -visible to search, the fielddata cached from old segments remains valid. Only -the data for the new segment needs to be loaded into memory. - -================================================== - Thus, search and aggregations are closely intertwined. Search finds documents by using the inverted index. Aggregations collect and aggregate values from -fielddata, which is itself generated from the inverted index. - -The rest of this chapter covers various functionality that either -decreases fielddata's memory footprint or increases execution speed. +doc values. [NOTE] ================================================== -Fielddata is not just used for aggregations.((("fielddata", "uses other than aggregations"))) It is required for any -operation that needs to look up the value contained in a specific document. +Doc values are not just used for aggregations.((("doc values", "uses other than aggregations"))) They are required for any +operation that must look up the value contained in a specific document. Besides aggregations, this includes sorting, scripts that access field -values, parent-child relationships (see <>), and certain types -of queries or filters, such as the <> filter. +values and parent-child relationships (see <>). ================================================== diff --git a/300_Aggregations/93_technical_docvalues.asciidoc b/300_Aggregations/93_technical_docvalues.asciidoc new file mode 100644 index 000000000..afb04c9d1 --- /dev/null +++ b/300_Aggregations/93_technical_docvalues.asciidoc @@ -0,0 +1,175 @@ + +=== Deep Dive on Doc Values + +The last section opened by saying doc values are _"fast, efficient and memory-friendly"_. +Those are some nice marketing buzzwords, but how do doc values actually work? + +Doc values are generated at index-time, alongside the creation of the inverted index. +That means doc values are generated on a per-segment basis and are immutable, just like +the inverted index used for search. And, like the inverted index, doc values are serialized +to disk. This is important to performance and scalability. + +By serializing a persistent data structure to disk, we can rely on the OS's file +system cache to manage memory instead of retaining structures on the JVM heap. +In situations where the "working set" of data is smaller than the available +memory, the OS will naturally keep the doc values resident in memory. This gives +the same performance profile as on-heap data structures. + +But when your working set is much larger than available memory, the OS will begin +paging the doc values on/off disk as required. This will obviously be slower +than an entirely memory-resident data structure, but it has the advantage of scaling +well beyond the server's memory capacity. If these data structures were +purely on-heap, the only option is to crash with an OutOfMemory exception (or implement +a paging scheme just like the OS). + +[NOTE] +==== +Because doc values are not managed by the JVM, Elasticsearch servers can be +configured with a much smaller heap. This gives more memory to the OS for caching. +It also has the benefit of letting the JVM's garbage collector work with a smaller +heap, which will result in faster and more efficient collection cycles. + +Traditionally, the recommendation has been to dedicate 50% of the machine's memory +to the JVM heap. With the introduction of doc values, this recommendation is starting +to slide. Consider giving far less to the heap, perhaps 4-16gb on a 64gb machine, +instead of the full 32gb previously recommended. + +For a more detailed discussion, see <>. +==== + + +==== Column-store compression + +At a high level, doc values are essentially a serialized _column-store_. As we +discussed in the last section, column-stores excel at certain operations because +the data is naturally laid out in a fashion that is amenable to those queries. + +But they also excel at compressing data, particularly numbers. This is important for both saving space +on disk _and_ for faster access. Modern CPU's are many orders of magnitude faster +than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means +it is often advantageous to minimize the amount of data that must be read from disk, +even if it requires extra CPU cycles to decompress. + +To see how it can help compression, take this set of doc values for a numeric field: + + Doc Terms + ----------------------------------------------------------------- + Doc_1 | 100 + Doc_2 | 1000 + Doc_3 | 1500 + Doc_4 | 1200 + Doc_5 | 300 + Doc_6 | 1900 + Doc_7 | 4200 + ----------------------------------------------------------------- + +The column-stride layout means we have a contiguous block of numbers: +`[100,1000,1500,1200,300,1900,4200]`. Because we know they are all numbers +(instead of a heterogeneous collection like you'd see in a document or row) +values can be packed tightly together with uniform offsets. + +Further, there are a variety of compression tricks we can apply to these numbers. +You'll notice that each of the above numbers are a multiple of 100. Doc values +detect when all the values in a segment share a _greatest common divisor_ and use +that to compress the values further. + +If we save `100` as the divisor for this segment, we can divide each number by 100 +to get: `[1,10,15,12,3,19,42]`. Now that the numbers are smaller, they require +fewer bits to store and we've reduced the size on-disk. + +Doc values use several tricks like this. In order, the following compression +schemes are checked: + +1. If all values are identical (or missing), set a flag and record the value +2. If there are fewer than 256 values, a simple table encoding is used +3. If there are > 256 values, check to see if there is a common divisor +4. If there is no common divisor, encode everything as an offset from the smallest +value + +You'll note that these compression schemes are not "traditional" general purpose +compression like DEFLATE or LZ4. Because the structure of column-stores are +rigid and well-defined, we can achieve higher compression by using specialized +schemes rather than the more general compression algorithms like LZ4. + +[NOTE] +==== +You may be thinking _"Well that's great for numbers, but what about strings?"_ +Strings are encoded similarly, with the help of an ordinal table. The +strings are de-duplicated and sorted into a table, assigned an ID, and then those +ID's are used as numeric doc values. Which means strings enjoy many of the same +compression benefits that numerics do. + +The ordinal table itself has some compression tricks, such as using fixed, variable +or prefix-encoded strings. +==== + +==== Disabling Doc Values + +Doc values are enabled by default for all fields _except_ analyzed strings. That means +all numerics, geo_points, dates, IPs and `not_analyzed` strings. + +Analyzed strings are not able to use doc values at this time; the analysis process +generates many tokens and does not work efficiently with doc values. We'll discuss +using analyzed strings for aggregations in <>. + +Because doc values are on by default, you have the option to aggregate and sort +on most fields in your dataset. But what if you know you will _never_ aggregate, +sort or script on a certain field? + +While rare, these circumstances do arise and you may wish to disable doc values +on that particular field. This will save you some disk space (since the doc values +are not being serialized to disk anymore) and may increase indexing speed slightly +(since the doc values don't need to be generated). + +To disable doc values, set `doc_values: false` in the field's mapping. For example, +here we create a new index where doc values are disabled for the `"session_id"` field: + +[source,js] +---- +PUT my_index +{ + "mappings": { + "my_type": { + "properties": { + "session_id": { + "type": "string", + "index": "not_analyzed", + "doc_values": false <1> + } + } + } + } +} +---- +<1> By setting `doc_values: false`, this field will not be usable in aggregations, sorts +or scripts + +It is possible to configure the inverse relationship too: make a field available +for aggregations via doc values, but make it unavailable for normal search by disabling +the inverted index. For example: + + +[source,js] +---- +PUT my_index +{ + "mappings": { + "my_type": { + "properties": { + "customer_token": { + "type": "string", + "index": "not_analyzed", + "doc_values": true, <1> + "index": "no" <2> + } + } + } + } +} +---- +<1> Doc values are enabled to allow aggregations +<2> Indexing is disabled, which makes the field unavailable to queries/searches + +By setting `doc_values: true` and `index: no`, we generate a field which can _only_ +be used in aggregations/sorts/scripts. This is admittedly a very rare requirement, +but sometimes useful. diff --git a/300_Aggregations/95_analyzed_vs_not.asciidoc b/300_Aggregations/95_analyzed_vs_not.asciidoc index ac72d2e68..088a15d5b 100644 --- a/300_Aggregations/95_analyzed_vs_not.asciidoc +++ b/300_Aggregations/95_analyzed_vs_not.asciidoc @@ -6,8 +6,11 @@ Some aggregations, such as the `terms` bucket, operate((("analysis", "aggregatio string fields may be either `analyzed` or `not_analyzed`, which begs the question: how does analysis affect aggregations?((("strings", "analyzed or not_analyzed string fields")))((("not_analyzed fields")))((("analyzed fields"))) -The answer is "a lot," but it is best shown through an example. First, index -some documents representing various states in the US: +The answer is "a lot," for two reasons: analysis affects the tokens used in the aggregation, +and doc values _do not work_ with analyzed strings. + +Let's tackle the first problem: how the generation of analyzed tokens affects +aggregations. First, let's index some documents representing various states in the US: [source,js] ---- @@ -79,7 +82,7 @@ are built from the inverted index, and the inverted index is _post-analysis_. When we added those documents to Elasticsearch, the string `"New York"` was analyzed/tokenized into `["new", "york"]`. These individual tokens were then -used to populate fielddata, and ultimately we see counts for `new` instead of +used to populate aggregation counts, and ultimately we see counts for `new` instead of `New York`. This is obviously not the behavior that we wanted, but luckily it is easily @@ -172,6 +175,33 @@ It is a generalization, but there are not many instances where you want to use an analyzed field in an aggregation. When in doubt, add a multifield so you have the option for both.((("analyzed fields", "aggregations and"))) +==== Analyzed strings and Fielddata + +While the first problem relates to how data is aggregated and displayed to your +user, the second problem is largely technical and behind the scenes. + +Doc values do not support `analyzed` string fields because they are not very efficient +at representing multi-valued strings. Doc values are most efficient +when each document has one or several tokens, but not thousands as in the case +of large, analyzed strings (imagine a PDF body, which may be several megabytes +and have thousands of unique tokens). + +For that reason, doc values are not generated for `analyzed` strings. Yet these fields +can still be used in aggregations. How is that possible? + +The answer is a data structure known as _fielddata_. Unlike doc values, fielddata +is built and managed 100% in memory, living inside the JVM heap. That means +it is inherently less scalable and has a lot of edge-cases to watch out for. +The rest of this chapter are addressing the challenges of fielddata in the context +of `analyzed` strings + +NOTE: Historically, fielddata was the default for _all_ fields, but Elasticsearch +has been migrating towards doc values to reduce the chance of OOM. +Analyzed strings are the last holdout where fielddata is still used. The goal is to +eventually build a serialized data structure similar to doc values which can handle +highly dimensional analyzed strings, obsoleting fielddata once and for all. + + ==== High-Cardinality Memory Implications There is another reason to avoid aggregating analyzed fields: high-cardinality @@ -196,18 +226,9 @@ You can imagine how the n-gramming process creates a huge number of unique token especially when analyzing paragraphs of text. When these are loaded into memory, you can easily exhaust your heap space. -So, before aggregating across fields, take a second to verify that the fields are -`not_analyzed`. And if you want to aggregate analyzed fields, ensure that the analysis -process is not creating an obscene number of tokens. - -[TIP] -================================================== - -At the end of the day, it doesn't matter whether a field is `analyzed` or -`not_analyzed`. The more unique values in a field--the higher the -cardinality of the field--the more memory that is required. This is -especially true for string fields, where every unique string must be held in -memory--longer strings use more memory. - -================================================== +So, before aggregating string fields, assess the situation: +- Is it a `not_analyzed` field? If yes, the field will use doc values and be memory-friendly +- Otherwise, this is an `analyzed` field. It will use fielddata and live in-memory. +Does this field have a very large cardinality caused by ngrams, shingles, etc? If yes, +it may be very memory unfriendly. diff --git a/306_Practical_Considerations.asciidoc b/306_Practical_Considerations.asciidoc index e3f2b90ce..905e9e3fc 100644 --- a/306_Practical_Considerations.asciidoc +++ b/306_Practical_Considerations.asciidoc @@ -1,7 +1,9 @@ -[[controlling-memory]] -== Controlling Memory Use and Latency +[[docvalues-and-fielddata]] +== Doc Values and Fielddata -include::300_Aggregations/90_fielddata.asciidoc[] +include::300_Aggregations/90_docvalues.asciidoc[] + +include::300_Aggregations/93_technical_docvalues.asciidoc[] include::300_Aggregations/95_analyzed_vs_not.asciidoc[] @@ -9,8 +11,6 @@ include::300_Aggregations/100_circuit_breaker_fd_settings.asciidoc[] include::300_Aggregations/105_filtering.asciidoc[] -include::300_Aggregations/110_docvalues.asciidoc[] - include::300_Aggregations/115_eager.asciidoc[] include::300_Aggregations/120_breadth_vs_depth.asciidoc[] diff --git a/310_Geopoints.asciidoc b/310_Geopoints.asciidoc index 40a710cea..a68f6ae91 100644 --- a/310_Geopoints.asciidoc +++ b/310_Geopoints.asciidoc @@ -6,6 +6,4 @@ include::310_Geopoints/32_Bounding_box.asciidoc[] include::310_Geopoints/34_Geo_distance.asciidoc[] -include::310_Geopoints/38_Reducing_memory.asciidoc[] - include::310_Geopoints/50_Sorting_by_distance.asciidoc[] diff --git a/310_Geopoints/30_Filter_by_geopoint.asciidoc b/310_Geopoints/30_Filter_by_geopoint.asciidoc index a94725b16..d50fbb74e 100644 --- a/310_Geopoints/30_Filter_by_geopoint.asciidoc +++ b/310_Geopoints/30_Filter_by_geopoint.asciidoc @@ -23,15 +23,15 @@ geolocation: very expensive_. If you find yourself wanting to use it, you should be looking at <> instead. -All of these filters work in a similar way: the `lat/lon` values are loaded -into memory for _all documents in the index_, not just the documents that -match the query (see <>).((("aggregations", "fielddata", "filtering"))) Each filter performs a slightly -different calculation to check whether a point falls into the containing area. +Each filter performs a slightly different calculation to check whether a point +falls into the containing area, but the process is similar. The requested area +is converted into a range of quad/geohash prefix tokens and used to search the +inverted index for documents who share the same tokens. [TIP] ============================ -Geo-filters are expensive -- they should be used on as few documents as +Geo-filters are relatively expensive -- they should be used on as few documents as possible. First remove as many documents as you can with cheaper filters, like `term` or `range` filters, and apply the geo-filters last. diff --git a/310_Geopoints/38_Reducing_memory.asciidoc b/310_Geopoints/38_Reducing_memory.asciidoc deleted file mode 100644 index ddff47040..000000000 --- a/310_Geopoints/38_Reducing_memory.asciidoc +++ /dev/null @@ -1,61 +0,0 @@ -[[geo-memory]] -=== Reducing Memory Usage - -Each `lat/lon` pair requires 16 bytes of memory, memory that is in short -supply.((("latitude/longitude pairs", "reducing memory usage by lat/lon pairs")))((("memory usage", "reducing for geo-points")))((("geo-points", "reducing memory usage"))) It needs this much memory in order to provide very accurate results. -But as we have commented before, such exacting precision is seldom required. - -You can reduce the amount of memory that is used by switching to a -`compressed` fielddata format and by((("fielddata", "compressed, using for geo-points"))) specifying how precise you need your geo-points to be. Even reducing precision to `1mm` reduces memory usage by a -third. A more realistic setting of `3m` reduces usage by 62%, and `1km` saves -a massive 75%! - -This setting can be changed on a live index with the `update-mapping` API: - -[source,json] ----------------------------- -POST /attractions/_mapping/restaurant -{ - "location": { - "type": "geo_point", - "fielddata": { - "format": "compressed", - "precision": "1km" <1> - } - } -} ----------------------------- -<1> Each `lat/lon` pair will require only 4 bytes, instead of 16. - -Alternatively, you can avoid using memory for geo-points altogether, either by -using the technique described in <>, or by storing -geo-points ((("doc values", "storing geo-points as")))as <>: - -[source,json] ----------------------------- -PUT /attractions -{ - "mappings": { - "restaurant": { - "properties": { - "name": { - "type": "string" - }, - "location": { - "type": "geo_point", - "doc_values": true <1> - } - } - } - } -} ----------------------------- -<1> Geo-points will not be loaded into memory, but instead stored on disk. - -Mapping a geo-point to use doc values can be done only when the field is first -created. There is a small performance cost in using doc values instead of -fielddata, but with memory in such short supply, it is often worth doing. - - - - diff --git a/404_Parent_Child/40_Parent_child.asciidoc b/404_Parent_Child/40_Parent_child.asciidoc index b772bdb7a..15b64a23e 100644 --- a/404_Parent_Child/40_Parent_child.asciidoc +++ b/404_Parent_Child/40_Parent_child.asciidoc @@ -24,14 +24,9 @@ which children. It is thanks to this map that query-time joins are fast, but it does place a limitation on the parent-child relationship: _the parent document and all of its children must live on the same shard_. -[NOTE] -================================================== - -At the time of going to press, the parent-child ID map is held in memory as -part of <>. There are plans afoot to change the default -setting to use <> by default instead. - -================================================== +The parent-child ID maps are stored in <>, which allows them to execute +quickly when fully hot in memory, but scalable enough to spill to disk when +the map is very large. [[parent-child-mapping]] @@ -70,5 +65,3 @@ PUT /company } ------------------------- <1> Documents of type `employee` are children of type `branch`. - - diff --git a/500_Cluster_Admin/30_node_stats.asciidoc b/500_Cluster_Admin/30_node_stats.asciidoc index b78deb94b..83849c793 100644 --- a/500_Cluster_Admin/30_node_stats.asciidoc +++ b/500_Cluster_Admin/30_node_stats.asciidoc @@ -276,7 +276,7 @@ quickly. But the old-gen is quite a bit larger, and a slow GC here could mean The garbage collectors in the JVM are _very_ sophisticated algorithms and do a great job minimizing pauses. And Elasticsearch tries very hard to be _garbage-collection friendly_, by intelligently reusing objects internally, reusing network -buffers, and offering features like <>. But ultimately, +buffers, and enabling <> by default. But ultimately, GC frequency and duration is a metric that needs to be watched by you, since it is the number one culprit for cluster instability. @@ -545,7 +545,3 @@ The main thing to watch is the `tripped` metric. If this number is large or consistently increasing, it's a sign that your queries may need to be optimized or that you may need to obtain more memory (either per box or by adding more nodes).((("Node Stats API", range="endofrange", startref="ix_NodeStats"))) - - - - diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 2d45a8830..ca6a9b2ff 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -2,7 +2,7 @@ === Heap: Sizing and Swapping The default installation of Elasticsearch is configured with a 1 GB heap. ((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting"))) For -just about every deployment, this number is far too small. If you are using the +just about every deployment, this number is usually too small. If you are using the default heap values, your cluster is probably configured incorrectly. There are two ways to change the heap size in Elasticsearch. The easiest is to @@ -28,7 +28,7 @@ the heap from resizing at runtime, a very costly process. Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over setting explicit `-Xmx` and `-Xms` values. -==== Give Half Your Memory to Lucene +==== Give (less than) Half Your Memory to Lucene A common problem is configuring a heap that is _too_ large. ((("heap", "sizing and setting", "giving half your memory to Lucene"))) You have a 64 GB machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More @@ -41,16 +41,22 @@ user of memory that is _off heap_: Lucene. Lucene is designed to leverage the underlying OS for caching in-memory data structures.((("Lucene", "memory for"))) Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying -OS will happily keep hot segments resident in memory for faster access. +OS will happily keep hot segments resident in memory for faster access. These segments +include both the inverted index (for fulltext search) and doc values (for aggregations). Lucene's performance relies on this interaction with the OS. But if you give all available memory to Elasticsearch's heap, there won't be any left over for Lucene. -This can seriously impact the performance of full-text search. +This can seriously impact the performance. The standard recommendation is to give 50% of the available memory to Elasticsearch heap, while leaving the other 50% free. It won't go unused; Lucene will happily gobble up whatever is left over. +If you are not aggregating on analyzed string fields (e.g. you won't be needing +<>) you can consider lowering the heap even +more. The smaller you can make the heap, the better performance you can expect +from both Elasticsearch (faster GCs) and Lucene (more memory for caching). + [[compressed_oops]] ==== Don't Cross 32 GB! There is another reason to not allocate enormous heaps to Elasticsearch. As it turns((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))) @@ -143,12 +149,18 @@ First, we would recommend avoiding such large machines (see <>). But if you already have the machines, you have two practical options: -- Are you doing mostly full-text search? Consider giving just under 32 GB to Elasticsearch +- Are you doing mostly full-text search? Consider giving 4-32 GB to Elasticsearch and letting Lucene use the rest of memory via the OS filesystem cache. All that memory will cache segments and lead to blisteringly fast full-text search. -- Are you doing a lot of sorting/aggregations? You'll likely want that memory -in the heap then. Instead of one node with more than 32 GB of RAM, consider running two or +- Are you doing a lot of sorting/aggregations? Are most of your aggregations on numerics, +dates, geo_points and `not_analyzed` strings? You're in luck! Give Elasticsearch +somewhere from 4-32 GB of memory and leave the rest for the OS to cache doc values +in memory. + +- Are you doing a lot of sorting/aggregations on analyzed strings (e.g. for word-tags, +or SigTerms, etc)? Unfortunately that means you'll need fielddata, which means you +need heap space. Instead of one node with more than 512 GB of RAM, consider running two or more nodes on a single machine. Still adhere to the 50% rule, though. So if your machine has 128 GB of RAM, run two nodes, each with just under 32 GB. This means that less than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. diff --git a/Preface.asciidoc b/Preface.asciidoc index 8b68d5d0e..bfb9bdd24 100644 --- a/Preface.asciidoc +++ b/Preface.asciidoc @@ -40,9 +40,9 @@ principles, helping novices to gain a sure footing in the complex world of search. The reader with a search background will also benefit from this book. -The more experienced user will gain an understanding of how familiar search -concepts have been implemented and how they interact in the context of -Elasticsearch. Even the early chapters contain nuggets of information that +The more experienced user will gain an understanding of how familiar search +concepts have been implemented and how they interact in the context of +Elasticsearch. Even the early chapters contain nuggets of information that will be useful to the more advanced user. Finally, maybe you are in DevOps. While the other departments are stuffing @@ -77,9 +77,9 @@ this book to explain _why_ and _when_ to use various features. === Elasticsearch Version -The initial print version of this book targeted Elasticsearch version 1.4.0. We +The initial print version of this book targeted Elasticsearch version 1.4.0. We are actively updating the explanations and code examples in the https://www.elastic.co/guide/en/elasticsearch/guide/current/[online version] -to target Elasticsearch 2.x. +to target Elasticsearch 2.x. You can track the updates by visiting the https://github.com/elastic/elasticsearch-definitive-guide/[GitHub repository]. @@ -179,7 +179,7 @@ ifdef::es_build[] of language, alphabets, and sorting. We cover stemming, stopwords, synonyms, and fuzzy matching. -* Chapters <> through <> +* Chapters <> through <> discuss aggregations and analytics--ways to summarize and group your data to show overall trends. * Chapters <> through <> @@ -201,7 +201,7 @@ endif::es_build[] === Online Resources -Because this book focuses on problem solving in Elasticsearch rather than syntax, we sometimes refer to the detailed descriptions in the https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html[Elasticsearch Reference]. You +Because this book focuses on problem solving in Elasticsearch rather than syntax, we sometimes refer to the detailed descriptions in the https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html[Elasticsearch Reference]. You can find the latest Elasticsearch Reference and related documentation at: https://www.elastic.co/guide/