work on approx and cardinality

zhaofanfan2019 · Aug 1, 2014 · 570af4e · 570af4e
1 parent f218f59
commit 570af4e
Show file tree

Hide file tree

Showing 3 changed files with 271 additions and 2 deletions.
diff --git a/300_Aggregations/55_approx_intro.asciidoc b/300_Aggregations/55_approx_intro.asciidoc
@@ -0,0 +1,57 @@
+
+== Approximate Aggregations
+
+Life is easy if all of your data fits on a single machine.  Classic algorithms
+taught in CS201 will be sufficient for all your needs.  But if all your data fits
+on a single machine, there would be no need for distributed software
+like Elasticsearch at all.  Once you start distributing data, you need to start
+evaluating algorithms and potential trade-offs.
+
+Many algorithms are amenable to distributed execution.  All of the aggregations
+discussed thus far execute in a single-pass, and give exact results. These types 
+of algorithms are often referred to as "embarrassingly parallel", 
+because they parallelize to multiple machines with little effort.  When 
+performing a `max` metric, the underlying algorithm is very simple:
+
+1. Broadcast the request to all shards
+2. Look at the "price" field for each document.  If `price > current_max`, replace
+`current_max` with `price`
+3. Return maximum price from all shards to coordinating node
+4. Find maximum price returned from all shards.  This is the true maximum.
+
+The algorithm scales linearly with machines because the algorithm requires no
+coordination (the machines don't need to discuss intermediate results), and the 
+memory footprint is very small (a single integer representing the maximum).
+
+There are some operations which we would like to perform, however, which are
+_not_ embarrassingly parallel.  For these algorithms, you need to
+start making trade-offs.  There is a triangle of factors at play: "big data",
+exactness, real-time latency.
+
+You get to choose two from this triangle.
+
+- Exact + Real-time: Your data fits in the RAM of a single machine.  The world
+is your oyster, use any algorithm you want
+
+- Big Data + Exact:  A classic hadoop installation.  Can handle petabytes of data
+and give you exact answers...but it may take a week to give you that answer
+
+- Big Data + Real-time: Approximate algorithms which give you accurate, but not
+exact, results
+
+Elasticsearch currently supports two approximate algorithms (`cardinality` and 
+`percentiles`).  These will give you accurate results but not 100% exact.
+These algorithms trade exactness for a small memory footprint and/or faster
+execution speed.
+
+For _most_ domains, highly accurate results that return _in real time_ across
+_all your data_ is more important than 100% exactness.  When trying to determine
+what the 99th percentile of latency is for your website, you often don't care if
+that value is off by 0.1%. 
+
+But you do care if it takes five minutes to load the dashboard.  This is an example
+where a little estimation gives you huge advantages with minimal downside.
+
+
+
+
diff --git a/300_Aggregations/60_cardinality.asciidoc b/300_Aggregations/60_cardinality.asciidoc
@@ -0,0 +1,209 @@
+
+=== Cardinality Metric
+
+The first approximate aggregation provided by Elasticsearch is the `cardinality`
+metric.  This provides the cardinality of a field, also called "distinct" or
+"unique" count.  You may be familiar with the SQL version:
+
+[source, sql]
+--------
+SELECT DISTINCT(color) 
+FROM cars
+--------
+
+Distinct counts are a common operation, and answer many fundamental business questions:
+
+- How many unique visitors have come to my website?
+- How many unique cars have we sold?
+- How many distinct users purchased a product each month?
+
+We can use the `cardinality` metric to determine the number of car colors being
+sold at our dealership:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "aggs" : {
+        "distinct_colors" : {
+            "cardinality" : {
+              "field" : "color"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/60_cardinality.json
+
+Which returns a minimal response show that we have sold three different colored
+cars:
+
+[source,js]
+--------------------------------------------------
+...
+"aggregations": {
+  "distinct_colors": {
+     "value": 3
+  }
+}
+...
+--------------------------------------------------
+
+We can make our example more useful:  how many colors were sold each month?  For
+that metric, we just nest the `cardinality` metric under a `date_histogram`:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+  "aggs" : {
+      "months" : {
+        "date_histogram": {
+          "field": "sold",
+          "interval": "month"
+        },
+        "aggs": {
+          "distinct_colors" : {
+              "cardinality" : {
+                "field" : "color"
+              }
+          }
+        }
+      }
+  }
+}
+--------------------------------------------------
+
+
+==== Understanding the Tradeoffs
+As mentioned at the top of this chapter, the `cardinality` metric is an approximate
+algorithm.  It is based on the http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] (HLL) algorithm.  HLL works by
+hashing your input and using the bits from the hash to make probabilistic estimations
+on the cardinality.
+
+You don't need to understand the technical details (although if you're interested,
+the paper is a great read!), but you should be aware of the *properties* of the
+algorithm:
+
+- Configurable precision, which decides on how to trade memory for accuracy,
+- Excellent accuracy on low-cardinality sets
+- Fixed memory usage. no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
+
+To configure the precision, you must specify the `precision_threshold` parameter.
+This threshold defines the point under which cardinalities are expected to be very
+close to accurate.  So for example:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "aggs" : {
+        "distinct_colors" : {
+            "cardinality" : {
+              "field" : "color",
+              "precision_threshold" : 100
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/60_cardinality.json
+
+Will ensure that fields with cardinality 100 and under will be extremely accurate.
+Although not guaranteed by the algorithm, if a cardinality us under the threshold
+it is almost always 100% accurate.  Cardinalities above this will begin to trade 
+accuracy for memory savings and a little error will creep into the metric.
+
+For a given threshold, the HLL data-structure will use about
+`precision_threshold * 8` bytes of memory.  So you must balance how much memory
+you are willing to sacrifice for additional accuracy.
+
+Practically speaking, a threshold of `100` maintains an error under 5% even when
+counting millions of unique values.  
+
+==== Optimizing for speed
+The nature of counting distinct objects usually means you are querying your entire
+data-set (or nearly all of it).  HyperLogLog is very fast already -- it simply
+hashes your data and does some bit-twiddling.
+
+If speed is very important to you, we can optimize it further.  Since HLL simply
+needs the hash of the field, we can precompute that hash at index time.  Then
+when you query, we can skip the hashing computation and load the value directly
+out of field data.
+
+[NOTE]
+=========================
+Pre-computing hashes is usually only useful on very large and/or high-cardinality 
+fields as it saves CPU and memory. However, on numeric fields, hashing is very 
+fast and storing the original values requires as much or less memory than storing the hashes. 
+
+This is also true on low-cardinality string fields, especially given that those 
+have an optimization in order to make sure that hashes are computed at most once 
+per unique value per segment.
+
+Basically, pre-computing hashes is not guaranteed to make all fields faster -- 
+only those that have high cardinality and/or large strings.  And remember, 
+pre-computing simply shifts the cost to index-time...meaning indexing will require
+more CPU.
+=========================
+
+To do this, we need to add a new multi-field to our data.  We'll delete our index,
+add a new mapping which includes the hashed field, then reindex:
+
+[source,js]
+----
+DELETE /cars/
+
+PUT /cars/
+{
+  "mappings": {
+    "color": {
+      "type": "string",
+      "fields": {
+          "hash": {
+              "type": "murmur3" <1>
+          }
+      }
+    }
+  }
+}
+
+POST /cars/transactions/_bulk
+{ "index": {}}
+{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
+{ "index": {}}
+{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
+{ "index": {}}
+{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
+{ "index": {}}
+{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
+{ "index": {}}
+{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
+{ "index": {}}
+{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
+{ "index": {}}
+{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
+{ "index": {}}
+{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }
+----
+// SENSE: 300_Aggregations/60_cardinality.json
+<1> This multi-field is of type `murmur3`, which is a hashing function
+
+Now when we run an aggregation, we use the `"color.hash"` field instead of the
+`"color"` field:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "aggs" : {
+        "distinct_colors" : {
+            "cardinality" : {
+              "field" : "color.hash" <1>
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/60_cardinality.json
+<1> Notice that we specify the hashed multi-field, rather than the original
diff --git a/304_Approximate_Aggregations.asciidoc b/304_Approximate_Aggregations.asciidoc
@@ -1,3 +1,6 @@
 
-== Approximate Aggregations (todo)
-TODO
+
+
+include::300_Aggregations/55_approx_intro.asciidoc[]
+
+include::300_Aggregations/60_cardinality.asciidoc[]