Skip to content

Commit

Permalink
Added missing part 03_Aggregations
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Nov 30, 2014
1 parent b61e63f commit de2478d
Show file tree
Hide file tree
Showing 8 changed files with 149 additions and 146 deletions.
13 changes: 0 additions & 13 deletions 02_Dealing_with_language.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -64,16 +64,3 @@ include::240_Stopwords.asciidoc[]
include::260_Synonyms.asciidoc[]

include::270_Fuzzy_matching.asciidoc[]

include::301_Aggregation_Overview.asciidoc[]

include::302_Example_Walkthrough.asciidoc[]

include::303_Making_Graphs.asciidoc[]

include::304_Approximate_Aggregations.asciidoc[]

include::305_Significant_Terms.asciidoc[]

include::306_Practical_Considerations.asciidoc[]

105 changes: 60 additions & 45 deletions 300_Aggregations/05_overview.asciidoc → 03_Aggregations.asciidoc
Original file line number Diff line number Diff line change
@@ -1,45 +1,60 @@
[[aggregations]]
= Aggregations

[partintro]
--
Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations"))) With search,
we have a query and we want to find a subset of documents that
match the query. We are looking for the proverbial needle(s) in the
haystack.

With aggregations, we zoom out to get an overview of our data. Instead of
looking for individual documents, we want to analyze and summarize our complete
set of data:

// Popular manufacturers? Unusual clumps of needles in the haystack?
- How many needles are in the haystack?
- What is the average length of the needles?
- What is the median length of the needles, broken down by manufacturer?
- How many needles were added to the haystack each month?

Aggregations can answer more subtle questions too:

- What are your most popular needle manufacturers?
- Are there any unusual or anomalous clumps of needles?

Aggregations allow us to ask sophisticated questions of our data. And yet, while
the functionality is completely different from search, it leverages the
same data-structures. This means aggregations execute quickly and are
_near real-time_, just like search.

This is extremely powerful for reporting and dashboards. Instead of performing
_rollups_ of your data (_that crusty Hadoop job that takes a week to run_),
you can visualize your data in real time, allowing you to respond immediately.

// Perhaps mention "not precalculated, out of date, and irrelevant"?
// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".

Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
both search/filter documents _and_ perform analytics at the same time, on the
same data, in a single request. And because aggregations are calculated in the
context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.

Aggregations are so powerful that many companies have built large Elasticsearch
clusters solely for analytics.
--
ifndef::es_build[= placeholder3]

[[aggregations]]
= Aggregations

[partintro]
--
Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations"))) With search,
we have a query and we want to find a subset of documents that
match the query. We are looking for the proverbial needle(s) in the
haystack.

With aggregations, we zoom out to get an overview of our data. Instead of
looking for individual documents, we want to analyze and summarize our complete
set of data:

// Popular manufacturers? Unusual clumps of needles in the haystack?
- How many needles are in the haystack?
- What is the average length of the needles?
- What is the median length of the needles, broken down by manufacturer?
- How many needles were added to the haystack each month?

Aggregations can answer more subtle questions too:

- What are your most popular needle manufacturers?
- Are there any unusual or anomalous clumps of needles?

Aggregations allow us to ask sophisticated questions of our data. And yet, while
the functionality is completely different from search, it leverages the
same data-structures. This means aggregations execute quickly and are
_near real-time_, just like search.

This is extremely powerful for reporting and dashboards. Instead of performing
_rollups_ of your data (_that crusty Hadoop job that takes a week to run_),
you can visualize your data in real time, allowing you to respond immediately.

// Perhaps mention "not precalculated, out of date, and irrelevant"?
// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".

Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
both search/filter documents _and_ perform analytics at the same time, on the
same data, in a single request. And because aggregations are calculated in the
context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.

Aggregations are so powerful that many companies have built large Elasticsearch
clusters solely for analytics.
--

include::301_Aggregation_Overview.asciidoc[]

include::302_Example_Walkthrough.asciidoc[]

include::303_Making_Graphs.asciidoc[]

include::304_Approximate_Aggregations.asciidoc[]

include::305_Significant_Terms.asciidoc[]

include::306_Practical_Considerations.asciidoc[]

86 changes: 0 additions & 86 deletions 300_Aggregations/15_concepts_buckets.asciidoc

This file was deleted.

2 changes: 2 additions & 0 deletions 300_Aggregations/30_histogram.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ means `0-20,000`, the key `20000` means `20,000-40,000`, and so forth.
Graphically, you could represent the preceding data in the histogram shown in <<barcharts-histo1>>:

[[barcharts-histo1]]
.Histogram of top makes per price range
image::images/elas_28in01.png["Histogram of top makes per price range"]

Of course, you can build bar charts with any aggregation that emits categories
Expand Down Expand Up @@ -144,6 +145,7 @@ std_err = std_deviation / count
This will allow us to build a chart like <<barcharts-bar1>>:

[[barcharts-bar1]]
.Barchart of average price per make, with error bars
image::images/elas_28in02.png["Barchart of average price per make, with error bars"]


Expand Down
86 changes: 84 additions & 2 deletions 301_Aggregation_Overview.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,86 @@
[[aggs-high-level]]
== High-Level Concepts

Like the query DSL, ((("aggregations", "high-level concepts")))aggregations have a _composable_ syntax: independent units
of functionality can be mixed and matched to provide the custom behavior that
you need. This means that there are only a few basic concepts to learn, but
nearly limitless combinations of those basic components.

To master aggregations, you need to understand only two main concepts:

_Buckets_:: Collections of documents that meet a criterion
_Metrics_:: Statistics calculated on the documents in a bucket

That's it! Every aggregation is simply a combination of one or more buckets
and zero or more metrics. To translate into rough SQL terms:

[source,sql]
--------------------------------------------------
SELECT COUNT(color) <1>
FROM table
GROUP BY color <2>
--------------------------------------------------
<1> `COUNT(color)` is equivalent to a metric.
<2> `GROUP BY color` is equivalent to a bucket.

Buckets are conceptually similar to grouping in SQL, while metrics are similar
to `COUNT()`, `SUM()`, `MAX()`, and so forth.


Let's dig into both of these concepts((("aggregations", "high-level concepts", "buckets")))((("buckets"))) and see what they entail.

=== Buckets

A _bucket_ is simply a collection of documents that meet a certain criteria:

- An employee would land in either the _male_ or _female_ bucket.
- The city of Albany would land in the _New York_ state bucket.
- The date 2014-10-28 would land within the _October_ bucket.

As aggregations are executed, the values inside each document are evaluated to
determine whether they match a bucket's criteria. If they match, the document is placed
inside the bucket and the aggregation continues.

Buckets can also be nested inside other buckets, giving you a hierarchy or
conditional partitioning scheme. For example, Cincinnati would be placed inside
the Ohio state bucket, and the _entire_ Ohio bucket would be placed inside the
USA country bucket.

Elasticsearch has a variety of buckets, which allow you to
partition documents in many ways (by hour, by most-popular terms, by
age ranges, by geographical location, and more). But fundamentally they all operate
on the same principle: partitioning documents based on a criteria.

=== Metrics

Buckets allow us to partition documents into useful subsets,((("aggregations", "high-level concepts", "metrics")))((("metrics"))) but ultimately what
we want is some kind of metric calculated on those documents in each bucket.
Bucketing is the means to an end: it provides a way to group documents in a way
that you can calculate interesting metrics.

Most _metrics_ are simple mathematical operations (for example, min, mean, max, and sum)
that are calculated using the document values. In practical terms, metrics allow
you to calculate quantities such as the average salary, or the maximum sale price,
or the 95th percentile for query latency.

=== Combining the Two

An _aggregation_ is a combination of buckets and metrics.((("aggregations", "high-level concepts", "combining buckets and metrics")))((("buckets", "combining with metrics")))((("metrics", "combining with buckets"))) An aggregation may have
a single bucket, or a single metric, or one of each. It may even have multiple
buckets nested inside other buckets. For example, we can partition documents by which country they belong to (a bucket), and
then calculate the average salary per country (a metric).

Because buckets can be nested, we can derive a much more complex aggregation:

1. Partition documents by country (bucket).
2. Then partition each country bucket by gender (bucket).
3. Then partition each gender bucket by age ranges (bucket).
4. Finally, calculate the average salary for each age range (metric)

This will give you the average salary per `<country, gender, age>` combination. All in
one request and with one pass over the data!




include::300_Aggregations/05_overview.asciidoc[]

include::300_Aggregations/15_concepts_buckets.asciidoc[]
1 change: 1 addition & 0 deletions 306_Practical_Considerations.asciidoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[[controlling-memory]]
== Controlling Memory Use and Latency

include::300_Aggregations/90_fielddata.asciidoc[]
Expand Down
1 change: 1 addition & 0 deletions atlas.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"00_Getting_started.asciidoc",
"01_Search_in_depth.asciidoc",
"02_Dealing_with_language.asciidoc",
"03_Aggregations.asciidoc",
"04_Geolocation.asciidoc",
"06_Modeling_your_data.asciidoc",
"07_Admin.asciidoc",
Expand Down
1 change: 1 addition & 0 deletions book.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ include::01_Search_in_depth.asciidoc[]

include::02_Dealing_with_language.asciidoc[]

include::03_Aggregations.asciidoc[]

include::04_Geolocation.asciidoc[]

Expand Down

0 comments on commit de2478d

Please sign in to comment.