forked from elasticsearch-cn/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f218f59
commit 570af4e
Showing
3 changed files
with
271 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
|
||
== Approximate Aggregations | ||
|
||
Life is easy if all of your data fits on a single machine. Classic algorithms | ||
taught in CS201 will be sufficient for all your needs. But if all your data fits | ||
on a single machine, there would be no need for distributed software | ||
like Elasticsearch at all. Once you start distributing data, you need to start | ||
evaluating algorithms and potential trade-offs. | ||
|
||
Many algorithms are amenable to distributed execution. All of the aggregations | ||
discussed thus far execute in a single-pass, and give exact results. These types | ||
of algorithms are often referred to as "embarrassingly parallel", | ||
because they parallelize to multiple machines with little effort. When | ||
performing a `max` metric, the underlying algorithm is very simple: | ||
|
||
1. Broadcast the request to all shards | ||
2. Look at the "price" field for each document. If `price > current_max`, replace | ||
`current_max` with `price` | ||
3. Return maximum price from all shards to coordinating node | ||
4. Find maximum price returned from all shards. This is the true maximum. | ||
|
||
The algorithm scales linearly with machines because the algorithm requires no | ||
coordination (the machines don't need to discuss intermediate results), and the | ||
memory footprint is very small (a single integer representing the maximum). | ||
|
||
There are some operations which we would like to perform, however, which are | ||
_not_ embarrassingly parallel. For these algorithms, you need to | ||
start making trade-offs. There is a triangle of factors at play: "big data", | ||
exactness, real-time latency. | ||
|
||
You get to choose two from this triangle. | ||
|
||
- Exact + Real-time: Your data fits in the RAM of a single machine. The world | ||
is your oyster, use any algorithm you want | ||
|
||
- Big Data + Exact: A classic hadoop installation. Can handle petabytes of data | ||
and give you exact answers...but it may take a week to give you that answer | ||
|
||
- Big Data + Real-time: Approximate algorithms which give you accurate, but not | ||
exact, results | ||
|
||
Elasticsearch currently supports two approximate algorithms (`cardinality` and | ||
`percentiles`). These will give you accurate results but not 100% exact. | ||
These algorithms trade exactness for a small memory footprint and/or faster | ||
execution speed. | ||
|
||
For _most_ domains, highly accurate results that return _in real time_ across | ||
_all your data_ is more important than 100% exactness. When trying to determine | ||
what the 99th percentile of latency is for your website, you often don't care if | ||
that value is off by 0.1%. | ||
|
||
But you do care if it takes five minutes to load the dashboard. This is an example | ||
where a little estimation gives you huge advantages with minimal downside. | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
|
||
=== Cardinality Metric | ||
|
||
The first approximate aggregation provided by Elasticsearch is the `cardinality` | ||
metric. This provides the cardinality of a field, also called "distinct" or | ||
"unique" count. You may be familiar with the SQL version: | ||
|
||
[source, sql] | ||
-------- | ||
SELECT DISTINCT(color) | ||
FROM cars | ||
-------- | ||
|
||
Distinct counts are a common operation, and answer many fundamental business questions: | ||
|
||
- How many unique visitors have come to my website? | ||
- How many unique cars have we sold? | ||
- How many distinct users purchased a product each month? | ||
|
||
We can use the `cardinality` metric to determine the number of car colors being | ||
sold at our dealership: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /cars/transactions/_search?search_type=count | ||
{ | ||
"aggs" : { | ||
"distinct_colors" : { | ||
"cardinality" : { | ||
"field" : "color" | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// SENSE: 300_Aggregations/60_cardinality.json | ||
|
||
Which returns a minimal response show that we have sold three different colored | ||
cars: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
... | ||
"aggregations": { | ||
"distinct_colors": { | ||
"value": 3 | ||
} | ||
} | ||
... | ||
-------------------------------------------------- | ||
|
||
We can make our example more useful: how many colors were sold each month? For | ||
that metric, we just nest the `cardinality` metric under a `date_histogram`: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /cars/transactions/_search?search_type=count | ||
{ | ||
"aggs" : { | ||
"months" : { | ||
"date_histogram": { | ||
"field": "sold", | ||
"interval": "month" | ||
}, | ||
"aggs": { | ||
"distinct_colors" : { | ||
"cardinality" : { | ||
"field" : "color" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
|
||
==== Understanding the Tradeoffs | ||
As mentioned at the top of this chapter, the `cardinality` metric is an approximate | ||
algorithm. It is based on the http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] (HLL) algorithm. HLL works by | ||
hashing your input and using the bits from the hash to make probabilistic estimations | ||
on the cardinality. | ||
|
||
You don't need to understand the technical details (although if you're interested, | ||
the paper is a great read!), but you should be aware of the *properties* of the | ||
algorithm: | ||
|
||
- Configurable precision, which decides on how to trade memory for accuracy, | ||
- Excellent accuracy on low-cardinality sets | ||
- Fixed memory usage. no matter if there are tens or billions of unique values, memory usage only depends on the configured precision. | ||
|
||
To configure the precision, you must specify the `precision_threshold` parameter. | ||
This threshold defines the point under which cardinalities are expected to be very | ||
close to accurate. So for example: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /cars/transactions/_search?search_type=count | ||
{ | ||
"aggs" : { | ||
"distinct_colors" : { | ||
"cardinality" : { | ||
"field" : "color", | ||
"precision_threshold" : 100 | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// SENSE: 300_Aggregations/60_cardinality.json | ||
|
||
Will ensure that fields with cardinality 100 and under will be extremely accurate. | ||
Although not guaranteed by the algorithm, if a cardinality us under the threshold | ||
it is almost always 100% accurate. Cardinalities above this will begin to trade | ||
accuracy for memory savings and a little error will creep into the metric. | ||
|
||
For a given threshold, the HLL data-structure will use about | ||
`precision_threshold * 8` bytes of memory. So you must balance how much memory | ||
you are willing to sacrifice for additional accuracy. | ||
|
||
Practically speaking, a threshold of `100` maintains an error under 5% even when | ||
counting millions of unique values. | ||
|
||
==== Optimizing for speed | ||
The nature of counting distinct objects usually means you are querying your entire | ||
data-set (or nearly all of it). HyperLogLog is very fast already -- it simply | ||
hashes your data and does some bit-twiddling. | ||
|
||
If speed is very important to you, we can optimize it further. Since HLL simply | ||
needs the hash of the field, we can precompute that hash at index time. Then | ||
when you query, we can skip the hashing computation and load the value directly | ||
out of field data. | ||
|
||
[NOTE] | ||
========================= | ||
Pre-computing hashes is usually only useful on very large and/or high-cardinality | ||
fields as it saves CPU and memory. However, on numeric fields, hashing is very | ||
fast and storing the original values requires as much or less memory than storing the hashes. | ||
This is also true on low-cardinality string fields, especially given that those | ||
have an optimization in order to make sure that hashes are computed at most once | ||
per unique value per segment. | ||
Basically, pre-computing hashes is not guaranteed to make all fields faster -- | ||
only those that have high cardinality and/or large strings. And remember, | ||
pre-computing simply shifts the cost to index-time...meaning indexing will require | ||
more CPU. | ||
========================= | ||
|
||
To do this, we need to add a new multi-field to our data. We'll delete our index, | ||
add a new mapping which includes the hashed field, then reindex: | ||
|
||
[source,js] | ||
---- | ||
DELETE /cars/ | ||
PUT /cars/ | ||
{ | ||
"mappings": { | ||
"color": { | ||
"type": "string", | ||
"fields": { | ||
"hash": { | ||
"type": "murmur3" <1> | ||
} | ||
} | ||
} | ||
} | ||
} | ||
POST /cars/transactions/_bulk | ||
{ "index": {}} | ||
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } | ||
{ "index": {}} | ||
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } | ||
{ "index": {}} | ||
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" } | ||
{ "index": {}} | ||
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" } | ||
{ "index": {}} | ||
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" } | ||
{ "index": {}} | ||
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } | ||
{ "index": {}} | ||
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" } | ||
{ "index": {}} | ||
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" } | ||
---- | ||
// SENSE: 300_Aggregations/60_cardinality.json | ||
<1> This multi-field is of type `murmur3`, which is a hashing function | ||
|
||
Now when we run an aggregation, we use the `"color.hash"` field instead of the | ||
`"color"` field: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /cars/transactions/_search?search_type=count | ||
{ | ||
"aggs" : { | ||
"distinct_colors" : { | ||
"cardinality" : { | ||
"field" : "color.hash" <1> | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// SENSE: 300_Aggregations/60_cardinality.json | ||
<1> Notice that we specify the hashed multi-field, rather than the original |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,6 @@ | ||
|
||
== Approximate Aggregations (todo) | ||
TODO | ||
|
||
|
||
include::300_Aggregations/55_approx_intro.asciidoc[] | ||
|
||
include::300_Aggregations/60_cardinality.asciidoc[] |