Start cleaning up fielddata/doc values language

zhaofanfan2019 · Apr 15, 2016 · 1333bb4 · 1333bb4
1 parent 0d8cdc2
commit 1333bb4
Show file tree

Hide file tree

Showing 19 changed files with 358 additions and 325 deletions.
diff --git a/056_Sorting.asciidoc b/056_Sorting.asciidoc
@@ -4,5 +4,4 @@ include::056_Sorting/88_String_sorting.asciidoc[]
 
 include::056_Sorting/90_What_is_relevance.asciidoc[]
 
-include::056_Sorting/95_Fielddata.asciidoc[]
-
+include::056_Sorting/95_Docvalues.asciidoc[]
diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc
@@ -71,5 +71,4 @@ GET /_search
 // SENSE: 056_Sorting/88_Multifield.json
 
 WARNING: Sorting on a full-text `analyzed` field can use a lot of memory.  See
-<<fielddata-intro>> for more information.
-
+<<aggregations-and-analysis>> for more information.
diff --git a/056_Sorting/95_Docvalues.asciidoc b/056_Sorting/95_Docvalues.asciidoc
@@ -0,0 +1,44 @@
+[[docvalues-intro]]
+=== Doc Values Intro
+
+Our final topic in this chapter is about an internal aspect of Elasticsearch.
+While we don't demonstrate any new techniques here, doc values are an
+important topic that we will refer to repeatedly, and is something that you
+should be aware of.((("docvalues")))
+
+When you sort on a field, Elasticsearch needs access to the value of that
+field for every document that matches the query.((("inverted index", "sorting and")))  The inverted index, which
+performs very well when searching, is not the ideal structure for sorting on
+field values:
+
+* When searching, we need to be able to map a term to a list of documents.
+
+* When sorting, we need to map a document to its terms. In other words, we
+  need to ``uninvert'' the inverted index.
+
+This ``uninverted'' structure is often called a ``column-store'' in other systems.
+Essentially, it stores all the values for a single field together in a single
+column of data, which makes it very efficient for operations like sorting.
+
+In Elasticsearch, this column-store is known as _doc values_, and is enabled
+by default. Doc values are created at index-time: when a field is indexed, Elasticsearch
+adds the tokens to the inverted index for search.  But it also extracts the terms
+and adds them to the columnar doc values.
+
+Doc values are used in several places in Elasticsearch:
+
+* Sorting on a field
+* Aggregations on a field
+* Certain filters (for example, geolocation filters)
+* Scripts that refer to fields
+
+Because doc values are serialized to disk, we can leverage the OS to help keep
+access fast.  When the "working set" is smaller than the available memory on a node,
+the OS will naturally keep all the doc values hot in memory, leading to very fast
+access.  When the "working set" is much larger than available memory, the OS will
+naturally start to page doc-values on/off disk without running into the dreaded
+OutOfMemory exception.
+
+We'll talk about doc values in much greater depth later.  For now, all you need
+to know is that sorting (and some other operations) happen on a parallel data
+structure which is built at index-time.
diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc
diff --git a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc
@@ -1,20 +1,28 @@
 
 === Limiting Memory Usage
 
-In order for aggregations (or any operation that requires access to field
-values) to be fast, ((("aggregations", "limiting memory usage")))access to fielddata must be fast, which is why it is
-loaded into memory. ((("fielddata")))((("memory usage", "limiting for aggregations", id="ix_memagg"))) But loading too much data into memory will cause slow
-garbage collections as the JVM tries to find extra space in the heap, or
-possibly even an OutOfMemory exception.
+Once analyzed strings have been loaded into fielddata, they will sit there until
+evicted (or your node crashes).  For that reason it is important to keep an eye on this
+memory usage, understand how and when it loads, and how you can limit the impact on your cluster.
 
-It may surprise you to find that Elasticsearch does not load into fielddata
-just the values for the documents that match your query. It loads the values
-for _all documents in your index_, even documents with a different `_type`!
+Fielddata is loaded _lazily_.  If you never aggregate on an analyzed string, you'll
+never load fielddata into memory.  Furthermore, fielddata is loaded on a per-field basis,
+meaning only actively used fields will incur the "fielddata tax".
 
-The logic is: if you need access to documents X, Y, and Z for this query, you
-will probably need access to other documents in the next query.  It is cheaper
-to load all values once, and to _keep them in memory_, than to have to scan
-the inverted index on every request.
+However, there is a subtle surprise lurking here.  Suppose your query is highly selective and
+only returns 100 hits.  Most people assume fielddata is only loaded for those
+100 documents.
+
+In reality, fielddata will be loaded for *all* documents in that index (for that
+particular field), regardless of the query's specificity. The logic is:
+if you need access to documents X, Y, and Z for this query, you
+will probably need access to other documents in the next query.
+
+Unlike doc values,
+the fielddata structure is not created at index time.  Instead, it is populated
+on-the-fly when the query is run.  This is a potentially non-trivial operation and
+can take some time.  It is cheaper to load all the values once, and keep them in
+memory, than load only a portion of the total fielddata repeatedly.
 
 The JVM heap ((("JVM (Java Virtual Machine)", "heap usage, fielddata and")))is a limited resource that should be used wisely. A number of
 mechanisms exist to limit the impact of fielddata on heap usage. These limits
@@ -29,29 +37,26 @@ the `$ES_HEAP_SIZE` environment variable:
 
 No more than 50% of available RAM::
 Lucene makes good use of the filesystem caches, which are managed by the
-kernel.  Without enough filesystem cache space, performance will suffer.
+kernel. Without enough filesystem cache space, performance will suffer.
+Furthermore, the more memory dedicated to the heap means less available
+for all your other fields using doc values.
 
-No more than 32 GB:
+No more than 32 GB::
 If the heap is less than 32 GB, the JVM can use compressed pointers, which
 saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
-+
-Increasing the heap from 32 GB to 34 GB would mean that you have much _less_
-memory available, because all pointers are taking double the space.  Also,
-with bigger heaps, garbage collection becomes more costly and can result in
-node instability.
 
-This limit has a direct impact on the amount of memory that can be devoted to fielddata.
+For a longer and more complete discussion of heap sizing, see <<heap-sizing>>
 
 ******************************************
 
 [[fielddata-size]]
 ==== Fielddata Size
 
 The `indices.fielddata.cache.size` controls how much heap space is allocated
-to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size")))  When you run a query that requires access to new field values,
-it will load the values into memory and then try to add them to fielddata. If
-the resulting fielddata size  would exceed the specified `size`, other
-values would be evicted in order to make space.
+to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size")))
+As you are issuing queries, aggregations on analyzed strings will load into fielddata
+if the field wasn't previously loaded. If the resulting fielddata size would
+exceed the specified `size`, other values will be evicted in order to make space.
 
 By default, this setting is _unbounded_&#x2014;Elasticsearch will never evict data
 from fielddata.
@@ -92,7 +97,7 @@ setting to the `config/elasticsearch.yml` file:
 
 [source,yaml]
 -----------------------------
-indices.fielddata.cache.size:  40% <1>
+indices.fielddata.cache.size:  20% <1>
 -----------------------------
 <1> Can be set to a percentage of the heap size, or a concrete
     value like `5gb`

diff --git a/300_Aggregations/110_docvalues.asciidoc b/300_Aggregations/110_docvalues.asciidoc
diff --git a/300_Aggregations/115_eager.asciidoc b/300_Aggregations/115_eager.asciidoc
@@ -39,8 +39,8 @@ are pre-loaded:
 ----
 PUT /music/_mapping/_song
 {
-  "price_usd": {
-    "type": "integer",
+  "tags": {
+    "type": "string",
     "fielddata": {
       "loading" : "eager" <1>
     }

diff --git a/300_Aggregations/125_Conclusion.asciidoc b/300_Aggregations/125_Conclusion.asciidoc
@@ -2,36 +2,33 @@
 == Closing Thoughts
 
 This section covered a lot of ground, and a lot of deeply technical issues.
-Aggregations bring a power and flexibility to Elasticsearch that is hard to 
+Aggregations bring a power and flexibility to Elasticsearch that is hard to
 overstate. The ability to nest buckets and metrics, to quickly approximate
-cardinality and percentiles, to find statistical anomalies in your data, all 
+cardinality and percentiles, to find statistical anomalies in your data, all
 while operating on near-real-time data and in parallel to full-text search--these are game-changers to many organizations.
 
 It is a feature that, once you start using it, you'll find dozens
 of other candidate uses.  Real-time reporting and analytics is central to many
  organizations (be it over business intelligence or server logs).
 
-But with great power comes great responsibility, and for Elasticsearch that often
-means proper memory stewardship. Memory is often the limiting factor in 
-Elasticsearch deployments, particularly those that heavily utilize aggregations.  
-Because aggregation data is loaded to fielddata--and this is an in-memory data 
-structure--managing ((("aggregations", "managing efficient memory usage")))efficient memory usage is important.
+Elasticsearch has made great strides in becoming more memory friendly by defaulting
+to doc values for _most_ fields, but the necessity of fielddata for string fields
+means you must remain vigilant.
 
 The management of this memory can take several forms, depending on your
 particular use-case:
 
-- At a data level, by making sure you analyze (or `not_analyze`) your data appropriately
-so that it is memory-friendly
-- During indexing, by configuring heavy fields to use disk-based doc values instead
-of in-memory fielddata
+- During the planning stage, attempt to organize your data so that aggregations are
+run on `not_analyzed` strings instead of analyzed so that doc values may be leveraged.
+- While testing, verify that analysis chains are not creating high cardinality
+fields which are later aggregated on
 - At search time, by utilizing approximate aggregations and data filtering
 - At a node level, by setting hard memory and dynamic circuit-breaker limits
-- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles, potentially by adding more nodes to the cluster
+- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles,
+potentially by adding more nodes to the cluster
 
 Most deployments will use one or more of the preceding methods.  The exact combination
-is highly dependent on your particular environment.  Some organizations need
-blisteringly fast responses and opt to simply add more nodes.  Other organizations
-are limited by budget and choose doc values and approximate aggregations.
+is highly dependent on your particular environment.
 
 Whatever the path you take, it is important to assess the available options and
 create both a short- and long-term plan.  Decide how your memory situation exists
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,5 +4,4 @@ include::056_Sorting/88_String_sorting.asciidoc[]

		include::056_Sorting/90_What_is_relevance.asciidoc[]

		include::056_Sorting/95_Fielddata.asciidoc[]

		include::056_Sorting/95_Docvalues.asciidoc[]