Skip to content

Commit

Permalink
Start cleaning up fielddata/doc values language
Browse files Browse the repository at this point in the history
  • Loading branch information
polyfractal committed Apr 15, 2016
1 parent 0d8cdc2 commit 1333bb4
Show file tree
Hide file tree
Showing 19 changed files with 358 additions and 325 deletions.
3 changes: 1 addition & 2 deletions 056_Sorting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,4 @@ include::056_Sorting/88_String_sorting.asciidoc[]

include::056_Sorting/90_What_is_relevance.asciidoc[]

include::056_Sorting/95_Fielddata.asciidoc[]

include::056_Sorting/95_Docvalues.asciidoc[]
3 changes: 1 addition & 2 deletions 056_Sorting/88_String_sorting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -71,5 +71,4 @@ GET /_search
// SENSE: 056_Sorting/88_Multifield.json

WARNING: Sorting on a full-text `analyzed` field can use a lot of memory. See
<<fielddata-intro>> for more information.

<<aggregations-and-analysis>> for more information.
44 changes: 44 additions & 0 deletions 056_Sorting/95_Docvalues.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[[docvalues-intro]]
=== Doc Values Intro

Our final topic in this chapter is about an internal aspect of Elasticsearch.
While we don't demonstrate any new techniques here, doc values are an
important topic that we will refer to repeatedly, and is something that you
should be aware of.((("docvalues")))

When you sort on a field, Elasticsearch needs access to the value of that
field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which
performs very well when searching, is not the ideal structure for sorting on
field values:

* When searching, we need to be able to map a term to a list of documents.

* When sorting, we need to map a document to its terms. In other words, we
need to ``uninvert'' the inverted index.

This ``uninverted'' structure is often called a ``column-store'' in other systems.
Essentially, it stores all the values for a single field together in a single
column of data, which makes it very efficient for operations like sorting.

In Elasticsearch, this column-store is known as _doc values_, and is enabled
by default. Doc values are created at index-time: when a field is indexed, Elasticsearch
adds the tokens to the inverted index for search. But it also extracts the terms
and adds them to the columnar doc values.

Doc values are used in several places in Elasticsearch:

* Sorting on a field
* Aggregations on a field
* Certain filters (for example, geolocation filters)
* Scripts that refer to fields

Because doc values are serialized to disk, we can leverage the OS to help keep
access fast. When the "working set" is smaller than the available memory on a node,
the OS will naturally keep all the doc values hot in memory, leading to very fast
access. When the "working set" is much larger than available memory, the OS will
naturally start to page doc-values on/off disk without running into the dreaded
OutOfMemory exception.

We'll talk about doc values in much greater depth later. For now, all you need
to know is that sorting (and some other operations) happen on a parallel data
structure which is built at index-time.
54 changes: 0 additions & 54 deletions 056_Sorting/95_Fielddata.asciidoc

This file was deleted.

55 changes: 30 additions & 25 deletions 300_Aggregations/100_circuit_breaker_fd_settings.asciidoc
Original file line number Diff line number Diff line change
@@ -1,20 +1,28 @@

=== Limiting Memory Usage

In order for aggregations (or any operation that requires access to field
values) to be fast, ((("aggregations", "limiting memory usage")))access to fielddata must be fast, which is why it is
loaded into memory. ((("fielddata")))((("memory usage", "limiting for aggregations", id="ix_memagg"))) But loading too much data into memory will cause slow
garbage collections as the JVM tries to find extra space in the heap, or
possibly even an OutOfMemory exception.
Once analyzed strings have been loaded into fielddata, they will sit there until
evicted (or your node crashes). For that reason it is important to keep an eye on this
memory usage, understand how and when it loads, and how you can limit the impact on your cluster.

It may surprise you to find that Elasticsearch does not load into fielddata
just the values for the documents that match your query. It loads the values
for _all documents in your index_, even documents with a different `_type`!
Fielddata is loaded _lazily_. If you never aggregate on an analyzed string, you'll
never load fielddata into memory. Furthermore, fielddata is loaded on a per-field basis,
meaning only actively used fields will incur the "fielddata tax".

The logic is: if you need access to documents X, Y, and Z for this query, you
will probably need access to other documents in the next query. It is cheaper
to load all values once, and to _keep them in memory_, than to have to scan
the inverted index on every request.
However, there is a subtle surprise lurking here. Suppose your query is highly selective and
only returns 100 hits. Most people assume fielddata is only loaded for those
100 documents.

In reality, fielddata will be loaded for *all* documents in that index (for that
particular field), regardless of the query's specificity. The logic is:
if you need access to documents X, Y, and Z for this query, you
will probably need access to other documents in the next query.

Unlike doc values,
the fielddata structure is not created at index time. Instead, it is populated
on-the-fly when the query is run. This is a potentially non-trivial operation and
can take some time. It is cheaper to load all the values once, and keep them in
memory, than load only a portion of the total fielddata repeatedly.

The JVM heap ((("JVM (Java Virtual Machine)", "heap usage, fielddata and")))is a limited resource that should be used wisely. A number of
mechanisms exist to limit the impact of fielddata on heap usage. These limits
Expand All @@ -29,29 +37,26 @@ the `$ES_HEAP_SIZE` environment variable:
No more than 50% of available RAM::
Lucene makes good use of the filesystem caches, which are managed by the
kernel. Without enough filesystem cache space, performance will suffer.
kernel. Without enough filesystem cache space, performance will suffer.
Furthermore, the more memory dedicated to the heap means less available
for all your other fields using doc values.
No more than 32 GB:
No more than 32 GB::
If the heap is less than 32 GB, the JVM can use compressed pointers, which
saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
+
Increasing the heap from 32 GB to 34 GB would mean that you have much _less_
memory available, because all pointers are taking double the space. Also,
with bigger heaps, garbage collection becomes more costly and can result in
node instability.
This limit has a direct impact on the amount of memory that can be devoted to fielddata.
For a longer and more complete discussion of heap sizing, see <<heap-sizing>>
******************************************

[[fielddata-size]]
==== Fielddata Size

The `indices.fielddata.cache.size` controls how much heap space is allocated
to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size"))) When you run a query that requires access to new field values,
it will load the values into memory and then try to add them to fielddata. If
the resulting fielddata size would exceed the specified `size`, other
values would be evicted in order to make space.
to fielddata.((("fielddata", "size")))((("aggregations", "limiting memory usage", "fielddata size")))
As you are issuing queries, aggregations on analyzed strings will load into fielddata
if the field wasn't previously loaded. If the resulting fielddata size would
exceed the specified `size`, other values will be evicted in order to make space.

By default, this setting is _unbounded_&#x2014;Elasticsearch will never evict data
from fielddata.
Expand Down Expand Up @@ -92,7 +97,7 @@ setting to the `config/elasticsearch.yml` file:

[source,yaml]
-----------------------------
indices.fielddata.cache.size: 40% <1>
indices.fielddata.cache.size: 20% <1>
-----------------------------
<1> Can be set to a percentage of the heap size, or a concrete
value like `5gb`
Expand Down
67 changes: 0 additions & 67 deletions 300_Aggregations/110_docvalues.asciidoc

This file was deleted.

4 changes: 2 additions & 2 deletions 300_Aggregations/115_eager.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ are pre-loaded:
----
PUT /music/_mapping/_song
{
"price_usd": {
"type": "integer",
"tags": {
"type": "string",
"fielddata": {
"loading" : "eager" <1>
}
Expand Down
27 changes: 12 additions & 15 deletions 300_Aggregations/125_Conclusion.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,33 @@
== Closing Thoughts

This section covered a lot of ground, and a lot of deeply technical issues.
Aggregations bring a power and flexibility to Elasticsearch that is hard to
Aggregations bring a power and flexibility to Elasticsearch that is hard to
overstate. The ability to nest buckets and metrics, to quickly approximate
cardinality and percentiles, to find statistical anomalies in your data, all
cardinality and percentiles, to find statistical anomalies in your data, all
while operating on near-real-time data and in parallel to full-text search--these are game-changers to many organizations.

It is a feature that, once you start using it, you'll find dozens
of other candidate uses. Real-time reporting and analytics is central to many
organizations (be it over business intelligence or server logs).

But with great power comes great responsibility, and for Elasticsearch that often
means proper memory stewardship. Memory is often the limiting factor in
Elasticsearch deployments, particularly those that heavily utilize aggregations.
Because aggregation data is loaded to fielddata--and this is an in-memory data
structure--managing ((("aggregations", "managing efficient memory usage")))efficient memory usage is important.
Elasticsearch has made great strides in becoming more memory friendly by defaulting
to doc values for _most_ fields, but the necessity of fielddata for string fields
means you must remain vigilant.

The management of this memory can take several forms, depending on your
particular use-case:

- At a data level, by making sure you analyze (or `not_analyze`) your data appropriately
so that it is memory-friendly
- During indexing, by configuring heavy fields to use disk-based doc values instead
of in-memory fielddata
- During the planning stage, attempt to organize your data so that aggregations are
run on `not_analyzed` strings instead of analyzed so that doc values may be leveraged.
- While testing, verify that analysis chains are not creating high cardinality
fields which are later aggregated on
- At search time, by utilizing approximate aggregations and data filtering
- At a node level, by setting hard memory and dynamic circuit-breaker limits
- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles, potentially by adding more nodes to the cluster
- At an operations level, by monitoring memory usage and controlling slow garbage-collection cycles,
potentially by adding more nodes to the cluster

Most deployments will use one or more of the preceding methods. The exact combination
is highly dependent on your particular environment. Some organizations need
blisteringly fast responses and opt to simply add more nodes. Other organizations
are limited by budget and choose doc values and approximate aggregations.
is highly dependent on your particular environment.

Whatever the path you take, it is important to assess the available options and
create both a short- and long-term plan. Decide how your memory situation exists
Expand Down
Loading

0 comments on commit 1333bb4

Please sign in to comment.