diff --git a/010_Intro/30_Tutorial_Search.asciidoc b/010_Intro/30_Tutorial_Search.asciidoc index 2d5d4d80d..f9717422d 100644 --- a/010_Intro/30_Tutorial_Search.asciidoc +++ b/010_Intro/30_Tutorial_Search.asciidoc @@ -209,15 +209,15 @@ which allows us to execute structured searches efficiently: GET /megacorp/employee/_search { "query" : { - "filtered" : { - "filter" : { - "range" : { - "age" : { "gt" : 30 } <1> - } - }, - "query" : { + "bool": { + "must": [ "match" : { - "last_name" : "smith" <2> + "last_name" : "smith" <1> + } + ], + "filter": { + "range" : { + "age" : { "gt" : 30 } <2> } } } @@ -226,13 +226,15 @@ GET /megacorp/employee/_search -------------------------------------------------- // SENSE: 010_Intro/30_Query_DSL.json -<1> This portion of the query is a `range` _filter_, which((("range filters"))) will find all ages +<1> This portion of the query is the((("match queries"))) same `match` _query_ that we used before. +<2> This portion of the query is a `range` _filter_, which((("range filters"))) will find all ages older than 30—`gt` stands for _greater than_. -<2> This portion of the query is the((("match queries"))) same `match` _query_ that we used before. + Don't worry about the syntax too much for now; we will cover it in great detail later. Just recognize that we've added a _filter_ that performs a -range search, and reused the same `match` query as before. Now our results show only one employee who happens to be 32 and is named Jane Smith: +range search, and reused the same `match` query as before. Now our results show +only one employee who happens to be 32 and is named Jane Smith: [source,js] -------------------------------------------------- @@ -446,4 +448,3 @@ HTML tags: You can read more about the highlighting of search snippets in the {ref}/search-request-highlighting.html[highlighting reference documentation]. - diff --git a/054_Query_DSL.asciidoc b/054_Query_DSL.asciidoc index 4d6a04119..7933d3fad 100644 --- a/054_Query_DSL.asciidoc +++ b/054_Query_DSL.asciidoc @@ -6,7 +6,6 @@ include::054_Query_DSL/65_Queries_vs_filters.asciidoc[] include::054_Query_DSL/70_Important_clauses.asciidoc[] -include::054_Query_DSL/75_Queries_with_filters.asciidoc[] +include::054_Query_DSL/75_Combining_queries_together.asciidoc[] include::054_Query_DSL/80_Validating_queries.asciidoc[] - diff --git a/054_Query_DSL/60_Query_DSL.asciidoc b/054_Query_DSL/60_Query_DSL.asciidoc index 492f9aa5f..14168a33f 100644 --- a/054_Query_DSL/60_Query_DSL.asciidoc +++ b/054_Query_DSL/60_Query_DSL.asciidoc @@ -99,7 +99,8 @@ other to create complex queries. Clauses can be as follows: * _Compound_ clauses that are used ((("compound query clauses")))to combine other query clauses. For instance, a `bool` clause((("bool clause"))) allows you to combine other clauses that - either `must` match, `must_not` match, or `should` match if possible: + either `must` match, `must_not` match, or `should` match if possible. They can also include non-scoring, + filters for structured search: [source,js] -------------------------------------------------- @@ -107,7 +108,8 @@ other to create complex queries. Clauses can be as follows: "bool": { "must": { "match": { "tweet": "elasticsearch" }}, "must_not": { "match": { "name": "mary" }}, - "should": { "match": { "tweet": "full text" }} + "should": { "match": { "tweet": "full text" }}, + "filter": { "range": { "age" : { "gt" : 30 }} } } } -------------------------------------------------- diff --git a/054_Query_DSL/65_Queries_vs_filters.asciidoc b/054_Query_DSL/65_Queries_vs_filters.asciidoc index 616b6157e..981465b03 100644 --- a/054_Query_DSL/65_Queries_vs_filters.asciidoc +++ b/054_Query_DSL/65_Queries_vs_filters.asciidoc @@ -1,11 +1,13 @@ === Queries and Filters -Although we refer to the query DSL, in reality there are two DSLs: the -query DSL and the filter DSL.((("DSL (Domain Specific Language)", "Query and Filter DSL")))((("Filter DSL"))) Query clauses and filter clauses are similar -in nature, but have slightly different purposes. +The DSL((("DSL (Domain Specific Language)", "Query and Filter DSL"))) used by +Elasticsearch has a single set of components called queries, which can be mixed +and matched in endless combinations. This single set of components can be used +in two contexts: filtering context and query context. -A _filter_ asks a yes|no question of((("filters", "queries versus")))((("exact values", "filters with yes|no questions for fields containing"))) every document and is used -for fields that contain exact values: +When used in _filtering context_, the query is said to be a "non-scoring" or "filtering" +query. That is, the query simply asks the question: "Does this document match?". +The answer is always a simple, binary yes|no. * Is the `created` date in the range `2013` - `2014`? @@ -13,10 +15,11 @@ for fields that contain exact values: * Is the `lat_lon` field within `10km` of a specified point? -A _query_ is similar to a filter, but also asks((("queries", "filters versus"))) the question: -How _well_ does this document match? +When used in a _querying context_, the query becomes a "scoring" query. Similar to +its non-scoring sibling, this determines if a document matches. But it also determines +how _well_ does the document matches. -A typical use for a query is to find documents +A typical use for a query is to find documents: * Best matching the words `full text search` @@ -29,34 +32,47 @@ A typical use for a query is to find documents * Tagged with `lucene`, `search`, or `java`—the more tags, the more relevant the document -A query calculates how _relevant_ each document((("relevance", "calculation by queries"))) is to the +A scoring query calculates how _relevant_ each document((("relevance", "calculation by queries"))) is to the query, and assigns it a relevance `_score`, which is later used to sort matching documents by relevance. This concept of relevance is well suited to full-text search, where there is seldom a completely ``correct'' answer. +[NOTE] +==== +Historically, queries and filters were separate components in Elasticsearch. Starting +in Elasticsearch 2.0, filters were technically eliminated, and all queries gained +the ability to become non-scoring. + +However, for clarity and simplicity, we will use the term "filter" to mean a query which +is used in a non-scoring, filtering context. You can think of the terms "filter", +"filtering query" and "non-scoring query" as being identical. + +Similarly, if the term "query" is used in isolation without a qualifier, we are +referring to a "scoring query". +==== + ==== Performance Differences -The output from most filter clauses--a simple((("filters", "performance, queries versus"))) list of the documents that match -the filter--is quick to calculate and easy to cache in memory, using -only 1 bit per document. These cached filters can be reused -efficiently for subsequent requests. +Filtering queries are simple checks for set inclusion/exclusion, which make them +very fast to compute. There are various optimizations that can be leveraged +when at least one of your filtering query is "sparse" (few matching documents), +and frequently used non-scoring queries can be cached in memory for faster access. -Queries have to not only find((("queries", "performance, filters versus"))) matching documents, but also calculate how -relevant each document is, which typically makes queries heavier than filters. -Also, query results are not cachable. +In contrast, scoring queries have to not only find((("queries", "performance, filters versus"))) +matching documents, but also calculate how relevant each document is, which typically makes +them heavier than their non-scoring counterparts. Also, query results are not cacheable. -Thanks to the inverted index, a simple query that matches just a few documents -may perform as well or better than a cached filter that spans millions -of documents. In general, however, a cached filter will outperform a -query, and will do so consistently. +Thanks to the inverted index, a simple scoring query that matches just a few documents +may perform as well or better than a filter that spans millions +of documents. In general, however, a filter will outperform a +scoring query. And it will do so consistently. -The goal of filters is to _reduce the number of documents that have to -be examined by the query_. +The goal of filtering is to _reduce the number of documents that have to +be examined by the scoring queries_. ==== When to Use Which -As a general rule, use((("filters", "when to use")))((("queries", "when to use"))) query clauses for _full-text_ search or -for any condition that should affect the _relevance score_, and -use filter clauses for everything else. - +As a general rule, use((("filters", "when to use")))((("queries", "when to use"))) +query clauses for _full-text_ search or for any condition that should affect the +_relevance score_, and use filters for everything else. diff --git a/054_Query_DSL/70_Important_clauses.asciidoc b/054_Query_DSL/70_Important_clauses.asciidoc index ff6050867..0cf6ea486 100644 --- a/054_Query_DSL/70_Important_clauses.asciidoc +++ b/054_Query_DSL/70_Important_clauses.asciidoc @@ -1,121 +1,9 @@ -=== Most Important Queries and Filters +=== Most Important Queries -While Elasticsearch comes with many queries and filters, you will use +While Elasticsearch comes with many queries, you will use just a few frequently. We discuss them in much greater detail in <> but next we give you a quick introduction to -the most important queries and filters. - -==== term Filter - -The `term` filter is used to filter by((("filters", "important")))((("term filter"))) exact values, be they numbers, dates, -Booleans, or `not_analyzed` exact-value string fields: - -[source,js] --------------------------------------------------- -{ "term": { "age": 26 }} -{ "term": { "date": "2014-09-01" }} -{ "term": { "public": true }} -{ "term": { "tag": "full_text" }} --------------------------------------------------- -// SENSE: 054_Query_DSL/70_Term_filter.json - -==== terms Filter - -The `terms` filter is((("terms filter"))) the same as the `term` filter, but allows you -to specify multiple values to match. If the field contains any of -the specified values, the document matches: - -[source,js] --------------------------------------------------- -{ "terms": { "tag": [ "search", "full_text", "nosql" ] }} --------------------------------------------------- -// SENSE: 054_Query_DSL/70_Terms_filter.json - -==== range Filter - -The `range` filter allows you to find((("range filters"))) numbers or dates that fall into -a specified range: - -[source,js] --------------------------------------------------- -{ - "range": { - "age": { - "gte": 20, - "lt": 30 - } - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/70_Range_filter.json - -The operators that it accepts are as follows: - - `gt`:: - Greater than - - `gte`:: - Greater than or equal to - - `lt`:: - Less than - - `lte`:: - Less than or equal to - - -==== exists and missing Filters - -The `exists` and `missing` filters are ((("exists filter")))((("missing filter")))used to find documents in which the -specified field either has one or more values (`exists`) or doesn't have any -values (`missing`). It is similar in nature to `IS_NULL` (`missing`) and `NOT -IS_NULL` (`exists`)in SQL: - -[source,js] --------------------------------------------------- -{ - "exists": { - "field": "title" - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/70_Exists_filter.json - -These filters are frequently used to apply a condition only if a field is -present, and to apply a different condition if it is missing. - -==== bool Filter - -The `bool` filter is used ((("bool filter")))((("must clause", "in bool filters")))((("must_not clause", "in bool filters")))((("should clause", "in bool filters")))to combine multiple filter clauses using -Boolean logic. ((("bool filter", "must, must_not, and should clauses"))) It accepts three parameters: - - `must`:: - These clauses _must_ match, like `and`. - - `must_not`:: - These clauses _must not_ match, like `not`. - - `should`:: - At least one of these clauses must match, like `or`. - -Each of these parameters can accept a single filter clause or an array -of filter clauses: - -[source,js] --------------------------------------------------- -{ - "bool": { - "must": { "term": { "folder": "inbox" }}, - "must_not": { "term": { "tag": "spam" }}, - "should": [ - { "term": { "starred": true }}, - { "term": { "unread": true }} - ] - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/70_Bool_filter.json - +the most important queries. ==== match_all Query @@ -161,8 +49,8 @@ exact value: -------------------------------------------------- // SENSE: 054_Query_DSL/70_Match_query.json -TIP: For exact-value searches, you probably want to use a filter instead of a -query, as a filter will be cached. +TIP: For exact-value searches, you probably want to use a filter clause instead of a +query, as a filter will be cached. We'll see some filtering examples soon. Unlike the query-string search that we showed in <>, the `match` query does not use a query syntax like `+user_id:2 +tweet:search`. It just @@ -186,46 +74,88 @@ fields: -------------------------------------------------- // SENSE: 054_Query_DSL/70_Multi_match_query.json -==== bool Query -The `bool` query, like the `bool` filter,((("bool query"))) is used to combine multiple -query clauses. However, there are some differences. Remember that while -filters give binary yes/no answers, queries calculate a relevance score -instead. The `bool` query combines the `_score` from each `must` or -`should` clause that matches.((("bool query", "must, must_not, and should clauses")))((("should clause", "in bool queries")))((("must_not clause", "in bool queries")))((("must clause", "in bool queries"))) This query accepts the following parameters: +==== range Query -`must`:: - Clauses that _must_ match for the document to be included. +The `range` query allows you to find((("range query"))) numbers or dates that fall into +a specified range: -`must_not`:: - Clauses that _must not_ match for the document to be included. +[source,js] +-------------------------------------------------- +{ + "range": { + "age": { + "gte": 20, + "lt": 30 + } + } +} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Range_filter.json + +The operators that it accepts are as follows: -`should`:: - If these clauses match, they increase the `_score`; - otherwise, they have no effect. They are simply used to refine - the relevance score for each document. + `gt`:: + Greater than + + `gte`:: + Greater than or equal to + + `lt`:: + Less than + + `lte`:: + Less than or equal to -The following query finds documents whose `title` field matches -the query string `how to make millions` and that are not marked -as `spam`. If any documents are `starred` or are from 2014 onward, -they will rank higher than they would have otherwise. Documents that -match _both_ conditions will rank even higher: +==== term Query + +The `term` query is used to search by((("query", "important")))((("term query"))) exact values, be they numbers, dates, +Booleans, or `not_analyzed` exact-value string fields: + +[source,js] +-------------------------------------------------- +{ "term": { "age": 26 }} +{ "term": { "date": "2014-09-01" }} +{ "term": { "public": true }} +{ "term": { "tag": "full_text" }} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Term_filter.json + +The `term` query performs no _analysis_ on the input text, so it will look for exactly +the value that is supplied. + +==== terms Query + +The `terms` query is((("terms query"))) the same as the `term` query, but allows you +to specify multiple values to match. If the field contains any of +the specified values, the document matches: + +[source,js] +-------------------------------------------------- +{ "terms": { "tag": [ "search", "full_text", "nosql" ] }} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Terms_filter.json + +Like the `term` query, no analysis is performed on the input text. It is looking +for exact matches (including differences in case, accents, spaces, etc). + + +==== exists and missing Queries + +The `exists` and `missing` queries are ((("exists query")))((("missing query")))used to find documents in which the +specified field either has one or more values (`exists`) or doesn't have any +values (`missing`). It is similar in nature to `IS_NULL` (`missing`) and `NOT +IS_NULL` (`exists`)in SQL: [source,js] -------------------------------------------------- { - "bool": { - "must": { "match": { "title": "how to make millions" }}, - "must_not": { "match": { "tag": "spam" }}, - "should": [ - { "match": { "tag": "starred" }}, - { "range": { "date": { "gte": "2014-01-01" }}} - ] + "exists": { + "field": "title" } } -------------------------------------------------- -// SENSE: 054_Query_DSL/70_Bool_query.json +// SENSE: 054_Query_DSL/70_Exists_filter.json -TIP: If there are no `must` clauses, at least one `should` clause has to -match. However, if there is at least one `must` clause, no `should` clauses -are required to match. +These queries are frequently used to apply a condition only if a field is +present, and to apply a different condition if it is missing. diff --git a/054_Query_DSL/75_Combining_queries_together.asciidoc b/054_Query_DSL/75_Combining_queries_together.asciidoc new file mode 100644 index 000000000..bbba76b25 --- /dev/null +++ b/054_Query_DSL/75_Combining_queries_together.asciidoc @@ -0,0 +1,152 @@ +[[combining-queries-together]] +=== Combining queries together + +Real world search requests are never simple; they search multiple fields with +various input text, and filter based on an array of criteria. To build +sophisticated search, you will need a way to combine multiple queries together +into a single search request. + +To do that, you can use the `bool` query. This query combines multiple queries +together in user-defined boolean combinations. This query accepts the following parameters: + +`must`:: + Clauses that _must_ match for the document to be included. + +`must_not`:: + Clauses that _must not_ match for the document to be included. + +`should`:: + If these clauses match, they increase the `_score`; + otherwise, they have no effect. They are simply used to refine + the relevance score for each document. + +`filter`:: + Clauses that _must_ match, but are run in non-scoring, filtering mode. These + clauses do not contribute to the score, instead they simply include/exclude + documents based on their criteria. + +Because this is the first query we've seen that contains other queries, we need +to talk about how scores are combined. Each sub-query clause will individually +calculate a relevance score for the document. Once these scores are calculated, +the `bool` query will merge the scores together and return a single score representing +the total score of the boolean operation. + +The following query finds documents whose `title` field matches +the query string `how to make millions` and that are not marked +as `spam`. If any documents are `starred` or are from 2014 onward, +they will rank higher than they would have otherwise. Documents that +match _both_ conditions will rank even higher: + +[source,js] +-------------------------------------------------- +{ + "bool": { + "must": { "match": { "title": "how to make millions" }}, + "must_not": { "match": { "tag": "spam" }}, + "should": [ + { "match": { "tag": "starred" }}, + { "range": { "date": { "gte": "2014-01-01" }}} + ] + } +} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Bool_query.json + +TIP: If there are no `must` clauses, at least one `should` clause has to +match. However, if there is at least one `must` clause, no `should` clauses +are required to match. + +==== Adding a filtering query + +If we don't want the date of the document to affect scoring at all, we can re-arrange +the previous example to use a `filter` clause: + +[source,js] +-------------------------------------------------- +{ + "bool": { + "must": { "match": { "title": "how to make millions" }}, + "must_not": { "match": { "tag": "spam" }}, + "should": [ + { "match": { "tag": "starred" }} + ], + "filter": { + "range": { "date": { "gte": "2014-01-01" }} <1> + } + } +} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Bool_query.json + +<1> The range query was moved out of the `should` clause and into a `filter` clause + +By moving the range query into the `filter` clause, we have converted it into a +non-scoring query. It will no longer contribute a score to the document's relevance +ranking. And because it is now a non-scoring query, it can use the variety of optimizations +available to filters which should increase performance. + +Any query can be used in this manner. Simply move a query into the +`filter` clause of a `bool` query and it automatically converts to a non-scoring +filter. + +If you need to filter on many different criteria, the `bool` query itself can be +used as a non-scoring query. Simply place it inside the `filter` clause and +continue building your boolean logic: + +[source,js] +-------------------------------------------------- +{ + "bool": { + "must": { "match": { "title": "how to make millions" }}, + "must_not": { "match": { "tag": "spam" }}, + "should": [ + { "match": { "tag": "starred" }} + ], + "filter": { + "bool": { <1> + "must": [ + { "range": { "date": { "gte": "2014-01-01" }}}, + { "range": { "price": { "lte": 29.99 }}} + ], + "must_not": [ + { "term": { "category": "ebooks" }} + ] + } + } + } +} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_Bool_query.json + +<1> By embedding a `bool` query in the `filter` clause, we can add boolean logic +to our filtering criteria + +By mixing and matching where Boolean queries are placed, we can flexibly encode +both scoring and filtering logic in our search request. + + +==== constant_score Query + +Although not used nearly as often as the `bool` query, the `constant_score` query is +still useful to have in your toolbox. The query applies a static, constant score to +all matching documents. It is predominantly used when you want to execute a filter +and nothing else (e.g. no scoring queries). + +You can use this instead of a `bool` that only has filter clauses. Performance +will be identical, but it may aid in query simplicity/clarity. + +[source,js] +-------------------------------------------------- +{ + "constant_score": { + "filter": { + "term": { "category": "ebooks" } <1> + } + } +} +-------------------------------------------------- +// SENSE: 054_Query_DSL/70_bool_query.json + +<1> A `term` query is placed inside the `constant_score`, converting it to a +non-scoring filter. This method can be used in place of a `bool` query which only +has a single filter diff --git a/054_Query_DSL/75_Queries_with_filters.asciidoc b/054_Query_DSL/75_Queries_with_filters.asciidoc deleted file mode 100644 index ed81a4007..000000000 --- a/054_Query_DSL/75_Queries_with_filters.asciidoc +++ /dev/null @@ -1,142 +0,0 @@ -=== Combining Queries with Filters - -Queries can be used in _query context_, and filters can be used -in _filter context_. ((("filters", "combining with queries")))((("queries", "combining with filters"))) Throughout the Elasticsearch API, you will see parameters -with `query` or `filter` in the name. These -expect a single argument containing either a single query or filter clause -respectively. In other words, they establish the -outer _context_ as query context or filter context. - -Compound query clauses can wrap other query clauses, and compound -filter clauses can wrap other filter clauses. However, it is often -useful to apply a filter to a query or, less frequently, to use a full-text query as a filter. - -To do this, there are dedicated query clauses that wrap filter clauses, and -vice versa, thus allowing us to switch from one context to another. It is -important to choose the correct combination of query and filter clauses -to achieve your goal in the most efficient way. - -[[filtered-query]] -==== Filtering a Query - -Let's say we have((("queries", "combining with filters", "filtering a query")))((("filters", "combining with queries", "filtering a query"))) this query: - -[source,js] --------------------------------------------------- -{ "match": { "email": "business opportunity" }} --------------------------------------------------- - -We want to combine it with the following `term` filter, which will -match only documents that are in our inbox: - -[source,js] --------------------------------------------------- -{ "term": { "folder": "inbox" }} --------------------------------------------------- - - -The `search` API accepts only a single `query` parameter, so we need -to wrap the query and the filter in another query, called the `filtered` -query: - -[source,js] --------------------------------------------------- -{ - "filtered": { - "query": { "match": { "email": "business opportunity" }}, - "filter": { "term": { "folder": "inbox" }} - } -} --------------------------------------------------- - - -We can now pass this query to the `query` parameter of the `search` API: - -[source,js] --------------------------------------------------- -GET /_search -{ - "query": { - "filtered": { - "query": { "match": { "email": "business opportunity" }}, - "filter": { "term": { "folder": "inbox" }} - } - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/75_Filtered_query.json - -[role="pagebreak-before"] -==== Just a Filter - -While in query context, if ((("filters", "combining with queries", "using just a filter in query context")))((("queries", "combining with filters", "using just a filter in query context")))you need to use a filter without a query (for -instance, to match all emails in the inbox), you can just omit the -query: - -[source,js] --------------------------------------------------- -GET /_search -{ - "query": { - "filtered": { - "filter": { "term": { "folder": "inbox" }} - } - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/75_Filtered_query.json - - -If a query is not specified it defaults to using the `match_all` query, so -the preceding query is equivalent to the following: - -[source,js] --------------------------------------------------- -GET /_search -{ - "query": { - "filtered": { - "query": { "match_all": {}}, - "filter": { "term": { "folder": "inbox" }} - } - } -} --------------------------------------------------- - - -==== A Query as a Filter - -Occasionally, you will want to use a query while you are in filter context. -This can be achieved with the `query` filter, which ((("filters", "combining with queries", "query as a filter")))((("queries", "combining with filters", "query filter")))just wraps a query. The following -example shows one way we could exclude emails that look like spam: - - -[source,js] --------------------------------------------------- -GET /_search -{ - "query": { - "filtered": { - "filter": { - "bool": { - "must": { "term": { "folder": "inbox" }}, - "must_not": { - "query": { <1> - "match": { "email": "urgent business proposal" } - } - } - } - } - } - } -} --------------------------------------------------- -// SENSE: 054_Query_DSL/75_Filtered_query.json -<1> Note the `query` filter, which is allowing us to use the `match` _query_ - inside a `bool` _filter_. - - -NOTE: You seldom need to use a query as a filter, but we have included it for -completeness' sake. The only time you may need it is when you need to use -full-text matching while in filter context. - diff --git a/054_Query_DSL/80_Validating_queries.asciidoc b/054_Query_DSL/80_Validating_queries.asciidoc index fae28d722..866be20ec 100644 --- a/054_Query_DSL/80_Validating_queries.asciidoc +++ b/054_Query_DSL/80_Validating_queries.asciidoc @@ -36,7 +36,7 @@ invalid: ==== Understanding Errors -To find out ((("validate query API", "understqnding errors")))why it is invalid, add the `explain` parameter((("explain parameter"))) to the query +To find out ((("validate query API", "understanding errors")))why it is invalid, add the `explain` parameter((("explain parameter"))) to the query string: [source,js] diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 28d1e7977..6d12ee98e 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -22,7 +22,7 @@ value `1`: GET /_search { "query" : { - "filtered" : { + "bool" : { "filter" : { "term" : { "user_id" : 1 @@ -33,9 +33,36 @@ GET /_search } -------------------------------------------------- -Filters have no bearing on `_score`, and the((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and"))) missing-but-implied `match_all` -query just sets the `_score` to a neutral value of `1` for all documents. In -other words, all documents are considered to be equally relevant. +There isn't a meaningful score here: because we are using a filter, we are indicating +that we just want the documents that match `user_id: 1` with no attempt to determine +relevance. Documents will be returned in effectively random order, and each document +will have a score of zero. + +[NOTE] +==== +If a score of zero makes your life difficult for logistical reasons, you can use +a `constant_score` query instead: + +[source,js] +-------------------------------------------------- +GET /_search +{ + "query" : { + "constant_score" : { + "filter" : { + "term" : { + "user_id" : 1 + } + } + } + } +} +-------------------------------------------------- + +This will apply a constant score (default of `1`) to all documents. It will +perform the same as the above query, and all documents will be returned randomly +like before, they'll just have a score of one instead of zero. +==== ==== Sorting by Field Values @@ -47,7 +74,7 @@ recent tweets first.((("sorting", "by field values")))((("fields", "sorting sear GET /_search { "query" : { - "filtered" : { + "bool" : { "filter" : { "term" : { "user_id" : 1 }} } }, @@ -116,8 +143,8 @@ show all matching results sorted first by date, then by relevance: GET /_search { "query" : { - "filtered" : { - "query": { "match": { "tweet": "manage text search" }}, + "bool" : { + "must": { "match": { "tweet": "manage text search" }}, "filter" : { "term" : { "user_id" : 2 }} } }, @@ -168,7 +195,3 @@ could sort on the earliest date in each `dates` field by using the following: } } -------------------------------------------------- - - - - diff --git a/056_Sorting/90_What_is_relevance.asciidoc b/056_Sorting/90_What_is_relevance.asciidoc index d7f0ebd04..993b29edc 100644 --- a/056_Sorting/90_What_is_relevance.asciidoc +++ b/056_Sorting/90_What_is_relevance.asciidoc @@ -196,9 +196,9 @@ The path for the request is `/index/type/id/_explain`, as in the following: GET /us/tweet/12/_explain { "query" : { - "filtered" : { + "bool" : { "filter" : { "term" : { "user_id" : 2 }}, - "query" : { "match" : { "tweet" : "honeymoon" }} + "must" : { "match" : { "tweet" : "honeymoon" }} } } } diff --git a/080_Structured_Search.asciidoc b/080_Structured_Search.asciidoc index 88a392bd9..0ba1079b1 100644 --- a/080_Structured_Search.asciidoc +++ b/080_Structured_Search.asciidoc @@ -13,9 +13,3 @@ include::080_Structured_Search/25_ranges.asciidoc[] include::080_Structured_Search/30_existsmissing.asciidoc[] include::080_Structured_Search/40_bitsets.asciidoc[] - -include::080_Structured_Search/45_filter_order.asciidoc[] - - - - diff --git a/080_Structured_Search/00_structuredsearch.asciidoc b/080_Structured_Search/00_structuredsearch.asciidoc index 8af690482..f919a0699 100644 --- a/080_Structured_Search/00_structuredsearch.asciidoc +++ b/080_Structured_Search/00_structuredsearch.asciidoc @@ -2,7 +2,7 @@ == Structured Search _Structured search_ is about interrogating ((("structured search")))data that has inherent structure. -Dates, times, and numbers are all structured: they have a precise format +Dates, times, and numbers are all structured: they have a precise format that you can perform logical operations on. Common operations include comparing ranges of numbers or dates, or determining which of two values is larger. @@ -21,4 +21,3 @@ excludes documents. This should make sense logically. A number can't be _more_ in a range than any other number that falls in the same range. It is either in the range--or it isn't. Similarly, for structured text, a value is either equal or it isn't. There is no concept of _more similar_. - diff --git a/080_Structured_Search/05_term.asciidoc b/080_Structured_Search/05_term.asciidoc index b298ed8b1..170ed3181 100644 --- a/080_Structured_Search/05_term.asciidoc +++ b/080_Structured_Search/05_term.asciidoc @@ -1,19 +1,22 @@ === Finding Exact Values -When working with exact values,((("structured search", "finding exact values")))((("exact values", "finding"))) you will be working with filters. Filters are -important because they are very, very fast. Filters do not calculate +When working with exact values,((("structured search", "finding exact values")))((("exact values", "finding"))) +you will be working with non-scoring, filtering queries. Filters are +important because they are very fast. They do not calculate relevance (avoiding the entire scoring phase) and are easily cached. We'll talk about the performance benefits of filters later in <>, -but for now, just keep in mind that you should use filters as often as you +but for now, just keep in mind that you should use filtering queries as often as you can. -==== term Filter with Numbers +==== term Query with Numbers -We are going to explore the `term` filter ((("term filter", "with numbers")))((("structured search", "finding exact values", "using term filter with numbers")))first because you will use it often. -This filter is capable of handling numbers, Booleans, dates, and text. +We are going to explore the `term` query ((("term query", "with numbers"))) +((("structured search", "finding exact values", "using term filter with numbers"))) +first because you will use it often. This query is capable of handling numbers, +booleans, dates, and text. -Let's look at an example using numbers first by indexing some products. These -documents have a `price` and a `productID`: +We'll start by indexing some documents representing products, each having a + `price` and `productID`: [source,js] -------------------------------------------------- @@ -40,9 +43,9 @@ FROM products WHERE price = 20 -------------------------------------------------- -In the Elasticsearch query DSL, we use a `term` filter to accomplish the same -thing. The `term` filter will look for the exact value that we specify. By -itself, a `term` filter is simple. It accepts a field name and the value +In the Elasticsearch query DSL, we use a `term` query to accomplish the same +thing. The `term` query will look for the exact value that we specify. By +itself, a `term` query is simple. It accepts a field name and the value that we wish to find: [source,js] @@ -54,22 +57,20 @@ that we wish to find: } -------------------------------------------------- -The `term` filter isn't very useful on its own, though. As discussed in -<>, the `search` API expects a `query`, not a `filter`. To -use our `term` filter, ((("filtered query")))we need to wrap it with a -<>: +Usually, when looking for an exact value, we don't want to score the query. We just +want to include/exclude documents, so we will use a `constant_score` query to execute +the `term` query in a non-scoring mode and apply a uniform score of one. + +The final combination will be a `constant_score` query which contains a `term` query: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "filtered" : { <1> - "query" : { - "match_all" : {} <2> - }, + "constant_score" : { <1> "filter" : { - "term" : { <3> + "term" : { <2> "price" : 20 } } @@ -79,11 +80,8 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/05_Term_number.json -<1> The `filtered` query accepts both a `query` and a `filter`. -<2> A `match_all` is used to return all matching documents.((("match_all query clause"))) This is the default -behavior, so in future examples we will simply omit the `query` section. -<3> The `term` filter that we saw previously. Notice how it is placed inside -the `filter` clause. +<1> We use a `constant_score` to convert the `term` query into a filter +<2> The `term` query that we saw previously. Once executed, the search results from this query are exactly what you would expect: only document 2 is returned as a hit (because only `2` had a price @@ -104,13 +102,13 @@ of `20`): } ] -------------------------------------------------- -<1> Filters do not perform scoring or relevance. The score comes from the - `match_all` query, which treats all docs as equal, so all results receive - a neutral score of `1`. +<1> Queries placed inside the `filter` clause do not perform scoring or relevance, +so all results receive a neutral score of `1`. -==== term Filter with Text +==== term Query with Text -As mentioned at the top of ((("structured search", "finding exact values", "using term filter with text")))((("term filter", "with text")))this section, the `term` filter can match strings +As mentioned at the top of ((("structured search", "finding exact values", "using term filter with text"))) +((("term filter", "with text")))this section, the `term` query can match strings just as easily as numbers. Instead of price, let's try to find products that have a certain UPC identification code. To do this with SQL, we might use a query like this: @@ -130,7 +128,7 @@ filter, like so: GET /my_store/products/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "term" : { "productID" : "XHDK-A-1293-#fJ3" @@ -190,7 +188,7 @@ There are a few important points here: * All letters have been lowercased. * We lost the hyphen and the hash (`#`) sign. -So when our `term` filter looks for the exact value `XHDK-A-1293-#fJ3`, it +So when our `term` query looks for the exact value `XHDK-A-1293-#fJ3`, it doesn't find anything, because that token does not exist in our inverted index. Instead, there are the four tokens listed previously. @@ -244,7 +242,7 @@ POST /my_store/products/_bulk -------------------------------------------------- // SENSE: 080_Structured_Search/05_Term_text.json -Only now will our `term` filter work as expected. Let's try it again on the +Only now will our `term` query work as expected. Let's try it again on the newly indexed data (notice, the query and filter have not changed at all, just how the data is mapped): @@ -253,7 +251,7 @@ how the data is mapped): GET /my_store/products/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "term" : { "productID" : "XHDK-A-1293-#fJ3" @@ -265,19 +263,20 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/05_Term_text.json -Since the `productID` field is not analyzed, and the `term` filter performs no +Since the `productID` field is not analyzed, and the `term` query performs no analysis, the query finds the exact match and returns document 1 as a hit. Success! [[_internal_filter_operation]] ==== Internal Filter Operation -Internally, Elasticsearch is((("structured search", "finding exact values", "intrnal filter operations")))((("filters", "internal filter operation"))) performing several operations when executing a -filter: +Internally, Elasticsearch is((("structured search", "finding exact values", "intrnal filter operations"))) +((("filters", "internal filter operation"))) performing several operations when executing a +non-scoring query: 1. _Find matching docs_. + -The `term` filter looks up the term `XHDK-A-1293-#fJ3` in the inverted index +The `term` query looks up the term `XHDK-A-1293-#fJ3` in the inverted index and retrieves the list of documents that contain that term. In this case, only document 1 has the term we are looking for. @@ -285,18 +284,37 @@ only document 1 has the term we are looking for. + The filter then builds a _bitset_--an array of 1s and 0s--that describes which documents contain the term. Matching documents receive a `1` -bit. In our example, the bitset would be `[1,0,0,0]`. +bit. In our example, the bitset would be `[1,0,0,0]`. Internally, this is represented +as a https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps["roaring bitmap"], +which can efficiently encode both sparse and dense sets. -3. _Cache the bitset_. +3. _Iterate over the bitset(s)_ + -Last, the bitset is stored in memory, since we can use this in the future -and skip steps 1 and 2. This adds a lot of performance and makes filters very -fast. - -When executing a `filtered` query, the `filter` is executed before the -`query`. The resulting bitset is given to the `query`, which uses it to simply -skip over any documents that have already been excluded by the filter. This is -one of the ways that filters can improve performance. Fewer documents -evaluated by the query means faster response times. - +Once the bitsets are generated for each query, Elasticsearch iterates over the +bitsets to find the set of matching documents that satisfy all filtering criteria. +The order of execution is decided heuristically, but generally the most sparse +bitset is iterated on first (since it excludes the largest number of documents). +4. _Increment the usage counter_. ++ +Elasticsearch can cache non-scoring queries for faster access, but its silly to +cache something that is used only rarely. Non-scoring queries are already quite fast +due to the inverted index, so we only want to cache queries we _know_ will be used +again in the future to prevent resource wastage. ++ +To do this, Elasticsearch tracks the history of query usage on a per-index basis. +If a query is used more than a few times in the last 256 queries, it is cached +in memory. And when the bitset is cached, caching is omitted on segments that have +fewer than 10,000 documents (or less than 3% of the total index size). These +small segments tend to disappear quickly anyway and it is a waste to associate a +cache with them. + + +Although not quite true in reality (execution is a bit more complicated based on +how the query planner re-arranges things, and some heuristics based on query cost), +you can conceptually think of non-scoring queries as executing _before_ the scoring +queries. The job of non-scoring queries is to reduce the number of documents that +the more costly scoring queries need to evaluate, resulting in a faster search request. + +By conceptually thinking of non-scoring queries as executing first, you'll be +equipped to write efficient and fast search requests. diff --git a/080_Structured_Search/10_compoundfilters.asciidoc b/080_Structured_Search/10_compoundfilters.asciidoc index 28fd27b88..dfbd3989d 100644 --- a/080_Structured_Search/10_compoundfilters.asciidoc +++ b/080_Structured_Search/10_compoundfilters.asciidoc @@ -1,9 +1,9 @@ [[combining-filters]] === Combining Filters -The previous two examples showed a single filter in use.((("structured search", "combining filters")))((("filters", "combining"))) In practice, you -will probably need to filter on multiple values or fields. For example, how -would you express this SQL in Elasticsearch? +The previous two examples showed a single filter in use.((("structured search", "combining filters")))((("filters", "combining"))) +In practice, you will probably need to filter on multiple values or fields. +For example, how would you express this SQL in Elasticsearch? [source,sql] -------------------------------------------------- @@ -13,14 +13,14 @@ WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) -------------------------------------------------- -In these situations, you will need the `bool` filter.((("filters", "combining", "in bool filter")))((("bool filter"))) This is a _compound -filter_ that accepts other filters as arguments, combining them in various -Boolean combinations. +In these situations, you will need to use a `bool` query((("filters", "combining", "in bool query")))((("bool query"))) +inside the `constant_score` query. This allows us to build +filters that can have multiple components in boolean combinations. [[bool-filter]] ==== Bool Filter -The `bool` filter is composed of three sections: +Recall that the `bool` query is composed of four sections: [source,js] -------------------------------------------------- @@ -29,39 +29,45 @@ The `bool` filter is composed of three sections: "must" : [], "should" : [], "must_not" : [], + "filter": [] } } -------------------------------------------------- - `must`:: + `must`:: All of these clauses _must_ match. The equivalent of `AND`. - - `must_not`:: + + `must_not`:: All of these clauses _must not_ match. The equivalent of `NOT`. - - `should`:: + + `should`:: At least one of these clauses must match. The equivalent of `OR`. -And that's it!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) When you need multiple filters, simply place them into the -different sections of the `bool` filter. + `filter`:: + Clauses that _must_ match, but are run in non-scoring, filtering mode. + +In this secondary boolean query, we can ignore the `filter` clause: the queries +are already running in non-scoring mode, so the extra `filter` clause is useless. [NOTE] ==== Each section of the `bool` filter is optional (for example, you can have a `must` -clause and nothing else), and each section can contain a single filter or an -array of filters. +clause and nothing else), and each section can contain a single query or an +array of queries. ==== -To replicate the preceding SQL example, we will take the two `term` filters that -we used((("term filter", "placing inside bool filter")))((("bool filter", "with two term filters in should clause and must_not clause"))) previously and place them inside the `should` clause of a `bool` -filter, and add another clause to deal with the `NOT` condition: +To replicate the preceding SQL example, we will take the two `term` queries that +we used((("term query", "placing inside bool query"))) +((("bool query", "with two term query in should clause and must_not clause"))) previously and +place them inside the `should` clause of a `bool` query, and add another clause +to deal with the `NOT` condition: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "filtered" : { <1> + "constant_score" : { <1> "filter" : { "bool" : { "should" : [ @@ -79,14 +85,19 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Note that we still need to use a `filtered` query to wrap everything. -<2> These two `term` filters are _children_ of the `bool` filter, and since they +<1> Note that we still need to use a `constant_score` query to wrap everything with its +`filter` clause. This is what enables non-scoring mode +<2> These two `term` queries are _children_ of the `bool` query, and since they are placed inside the `should` clause, at least one of them needs to match. <3> If a product has a price of `30`, it is automatically excluded because it matches a `must_not` clause. +Notice how boolean is placed inside the `constant_score`, but the individual term +queries are just placed in the `should` and `must_not`. Because everything is wrapped +with the `constant_score`, the rest of the queries are executing in filtering mode. + Our search results return two hits, each document satisfying a different clause -in the `bool` filter: +in the `bool` query: [source,json] -------------------------------------------------- @@ -109,17 +120,17 @@ in the `bool` filter: } ] -------------------------------------------------- -<1> Matches the `term` filter for `productID = "XHDK-A-1293-#fJ3"` -<2> Matches the `term` filter for `price = 20` +<1> Matches the `term` query for `productID = "XHDK-A-1293-#fJ3"` +<2> Matches the `term` query for `price = 20` -==== Nesting Boolean Filters +==== Nesting Boolean Queries -Even though `bool` is a compound filter and accepts children filters, it is -important to understand that `bool` is just a filter itself.((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) This means you -can nest `bool` filters inside other `bool` filters, giving you the -ability to make arbitrarily complex Boolean logic. +You can already see how nesting boolean queries together can give rise to more +sophisticated boolean logic. If you need to perform more complex operations, you +can continue nesting boolean queries in any combination, giving rise to +arbitrarily complex boolean logic. -Given this SQL statement: +For example, if we have this SQL statement: [source,sql] -------------------------------------------------- @@ -137,7 +148,7 @@ We can translate it into a pair of nested `bool` filters: GET /my_store/products/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "bool" : { "should" : [ @@ -157,8 +168,8 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Because the `term` and the `bool` are sibling clauses inside the first - Boolean `should`, at least one of these filters must match for a document +<1> Because the `term` and the `bool` are sibling clauses inside the + Boolean `should`, at least one of these queries must match for a document to be a hit. <2> These two `term` clauses are siblings in a `must` clause, so they both @@ -190,5 +201,5 @@ The results show us two documents, one matching each of the `should` clauses: <1> This `productID` matches the `term` in the first `bool`. <2> These two fields match the `term` filters in the nested `bool`. -This was a simple example, but it demonstrates how Boolean filters can be +This was a simple example, but it demonstrates how Boolean queries can be used as building blocks to construct complex logical conditions. diff --git a/080_Structured_Search/15_terms.asciidoc b/080_Structured_Search/15_terms.asciidoc index 914be6b8c..635a2a1be 100644 --- a/080_Structured_Search/15_terms.asciidoc +++ b/080_Structured_Search/15_terms.asciidoc @@ -1,12 +1,13 @@ === Finding Multiple Exact Values -The `term` filter is useful for finding a single value, but often you'll want -to search for multiple values.((("exact values", "finding multiple")))((("structured search", "finding multiple exact values"))) What if you want to find documents that have a -price of $20 or $30? +The `term` query is useful for finding a single value, but often you'll want +to search for multiple values.((("exact values", "finding multiple"))) +((("structured search", "finding multiple exact values"))) What if you want to +find documents that have a price of $20 or $30? -Rather than using multiple `term` filters, you can instead use a single `terms` -filter (note the _s_ at the end). The `terms` filter((("terms filter"))) is simply the plural -version of the singular `term` filter. +Rather than using multiple `term` queries, you can instead use a single `terms` +query (note the _s_ at the end). The `terms` query((("terms query"))) is simply the plural +version of the singular `term` query cousin. It looks nearly identical to a vanilla `term` too. Instead of specifying a single price, we are now specifying an array of values: @@ -20,15 +21,15 @@ specifying a single price, we are now specifying an array of values: } -------------------------------------------------- -And like the `term` filter, we will place it inside a `filtered` query to -((("filtered query", "terms filter in"))) use it: +And like the `term` query, we will place it inside the `filter` clause of a +constant score query to use it: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "terms" : { <1> "price" : [20, 30] @@ -40,7 +41,7 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/15_Terms_filter.json -<1> The `terms` filter as seen previously, but placed inside the `filtered` query +<1> The `terms` query as seen previously, but placed inside the `constant_score` query The query will return the second, third, and fourth documents: @@ -73,7 +74,3 @@ The query will return the second, third, and fourth documents: } ] -------------------------------------------------- - - - - diff --git a/080_Structured_Search/20_contains.asciidoc b/080_Structured_Search/20_contains.asciidoc index c3fc68586..e27e922d7 100644 --- a/080_Structured_Search/20_contains.asciidoc +++ b/080_Structured_Search/20_contains.asciidoc @@ -1,9 +1,11 @@ ==== Contains, but Does Not Equal It is important to understand that `term` and `terms` are _contains_ operations, -not _equals_.((("structured search", "contains, but does not equal")))((("terms filter", "contains, but does not equal")))((("term filter", "contains, but does not equal"))) What does that mean? +not _equals_.((("structured search", "contains, but does not equal"))) +((("terms query", "contains, but does not equal")))((("term query", "contains, but does not equal"))) +What does that mean? -If you have a term filter for `{ "term" : { "tags" : "search" } }`, it will match +If you have a term query for `{ "term" : { "tags" : "search" } }`, it will match _both_ of the following documents: [source,js] @@ -14,7 +16,7 @@ _both_ of the following documents: <1> This document is returned, even though it has terms other than `search`. -Recall how the `term` filter works: it checks the inverted index for all +Recall how the `term` query works: it checks the inverted index for all documents that contain a term, and then constructs a bitset. In our simple example, we have the following inverted index: @@ -25,7 +27,7 @@ example, we have the following inverted index: |`search` | `1`,`2` |========================== -When a `term` filter is executed for the token `search`, it goes straight to the +When a `term` query is executed for the token `search`, it goes straight to the corresponding entry in the inverted index and extracts the associated doc IDs. As you can see, both document 1 and document 2 contain the token in the inverted index. Therefore, they are both returned as a result. @@ -45,9 +47,9 @@ _must equal exactly_. ==== Equals Exactly If you do want that behavior--entire field equality--the best way to -accomplish it involves indexing a secondary field. ((("structured search", "equals exactly"))) In this field, you index the -number of values that your field contains. Using our two previous documents, -we now include a field that maintains the number of tags: +accomplish it involves indexing a secondary field. ((("structured search", "equals exactly"))) +In this field, you index the number of values that your field contains. Using +our two previous documents, we now include a field that maintains the number of tags: [source,js] -------------------------------------------------- @@ -56,7 +58,7 @@ we now include a field that maintains the number of tags: -------------------------------------------------- // SENSE: 080_Structured_Search/20_Exact.json -Once you have the count information indexed, you can construct a `bool` filter +Once you have the count information indexed, you can construct a `constant_score` that enforces the appropriate number of terms: [source,js] @@ -64,7 +66,7 @@ that enforces the appropriate number of terms: GET /my_index/my_type/_search { "query": { - "filtered" : { + "constant_score" : { "filter" : { "bool" : { "must" : [ @@ -84,4 +86,3 @@ GET /my_index/my_type/_search This query will now match only the document that has a single tag that is `search`, rather than any document that contains `search`. - diff --git a/080_Structured_Search/25_ranges.asciidoc b/080_Structured_Search/25_ranges.asciidoc index 13506aaad..9bdc8b64c 100644 --- a/080_Structured_Search/25_ranges.asciidoc +++ b/080_Structured_Search/25_ranges.asciidoc @@ -13,8 +13,8 @@ FROM products WHERE price BETWEEN 20 AND 40 -------------------------------------------------- -Elasticsearch has a `range` filter, ((("range filters", "using on numbers")))which, unsurprisingly, allows you to -filter ranges: +Elasticsearch has a `range` query, ((("range query", "using on numbers")))which, unsurprisingly, +can be used to find documents falling inside a range: [source,js] -------------------------------------------------- @@ -26,7 +26,7 @@ filter ranges: } -------------------------------------------------- -The `range` filter supports both inclusive and exclusive ranges, through +The `range` query supports both inclusive and exclusive ranges, through combinations of the following options: * `gt`: `>` greater than @@ -35,13 +35,13 @@ combinations of the following options: * `lte`: `<=` less than or equal to -.Here is an example range filter: +.Here is an example range query: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "range" : { "price" : { @@ -71,7 +71,7 @@ boundaries: ==== Ranges on Dates -The `range` filter can be used on date ((("date ranges")))((("range filters", "using on dates")))fields too: +The `range` query can be used on date ((("date ranges")))((("range query", "using on dates")))fields too: [source,js] -------------------------------------------------- @@ -83,7 +83,7 @@ The `range` filter can be used on date ((("date ranges")))((("range filters", "u } -------------------------------------------------- -When used on date fields, the `range` filter ((("date math operations")))supports _date math_ operations. +When used on date fields, the `range` query ((("date math operations")))supports _date math_ operations. For example, if we want to find all documents that have a timestamp sometime in the last hour: @@ -121,7 +121,8 @@ the {ref}/mapping-date-format.html[date format reference documentation]. ==== Ranges on Strings -The `range` filter can also operate on string fields.((("range filters", "using on strings")))((("strings", "using range filter on")))((("lexicographical order, string ranges"))) String ranges are +The `range` query can also operate on string fields.((("range query", "using on strings"))) +((("strings", "using range query on")))((("lexicographical order, string ranges"))) String ranges are calculated _lexicographically_ or alphabetically. For example, these values are sorted in lexicographic order: @@ -134,7 +135,7 @@ string ranges use this order. ==== If we want a range from `a` up to but not including `b`, we can use the same -`range` filter syntax: +`range` query syntax: [source,js] -------------------------------------------------- @@ -159,4 +160,3 @@ unique terms. But the more unique terms you have, the slower the string range will be. **** - diff --git a/080_Structured_Search/30_existsmissing.asciidoc b/080_Structured_Search/30_existsmissing.asciidoc index 6f8324b1b..9a9a1a8da 100644 --- a/080_Structured_Search/30_existsmissing.asciidoc +++ b/080_Structured_Search/30_existsmissing.asciidoc @@ -1,11 +1,11 @@ === Dealing with Null Values Think back to our earlier example, where documents have a field named `tags`. -This is a multivalue field.((("structured search", "dealing with null values")))((("null values"))) A document may have one tag, many tags, or -potentially no tags at all. If a field has no values, how is it stored in an -inverted index? +This is a multivalue field.((("structured search", "dealing with null values")))((("null values"))) +A document may have one tag, many tags, or potentially no tags at all. If a field has +no values, how is it stored in an inverted index? -That's a trick question, because the answer is, it isn't stored at all. Let's +That's a trick question, because the answer is: it isn't stored at all. Let's look at that inverted index from the previous section: [width="50%",frame="topbot"] @@ -28,11 +28,11 @@ Obviously, the world is not simple, and data is often missing fields, or contain explicit nulls or empty arrays. To deal with these situations, Elasticsearch has a few tools to work with null or missing values. -==== exists Filter +==== exists Query -The first tool in your arsenal is the `exists` filter.((("null values", "working with, using exists filter")))((("exists filter"))) This filter will return -documents that have any value in the specified field. Let's use the tagging example -and index some example documents: +The first tool in your arsenal is the `exists` query.((("null values", "working with, using exists filter"))) +((("exists query"))) This query will return documents that have any value in +the specified field. Let's use the tagging example and index some example documents: [source,js] -------------------------------------------------- @@ -77,14 +77,14 @@ FROM posts WHERE tags IS NOT NULL -------------------------------------------------- -In Elasticsearch, we use the `exists` filter: +In Elasticsearch, we use the `exists` query: [source,js] -------------------------------------------------- GET /my_index/posts/_search { "query" : { - "filtered" : { + "constant_score" : { "filter" : { "exists" : { "field" : "tags" } } @@ -125,9 +125,10 @@ The results are easy to understand. Any document that has terms in the `tags` field was returned as a hit. The only two documents that were excluded were documents 3 and 4. -==== missing Filter +==== missing Query -The `missing` filter is essentially((("null values", "working with, using missing filter")))((("missing filter"))) the inverse of `exists`: it returns +The `missing` query is essentially((("null values", "working with, using missing filter"))) +((("missing filter"))) the inverse of `exists`: it returns documents where there is _no_ value for a particular field, much like this SQL: @@ -135,17 +136,17 @@ SQL: -------------------------------------------------- SELECT tags FROM posts -WHERE tags IS NULL +WHERE tags IS NULL -------------------------------------------------- -Let's swap the `exists` filter for a `missing` filter from our previous example: +Let's swap the `exists` query for a `missing` query from our previous example: [source,js] -------------------------------------------------- GET /my_index/posts/_search { "query" : { - "filtered" : { + "constant_score" : { "filter": { "missing" : { "field" : "tags" } } @@ -201,8 +202,9 @@ When choosing a suitable `null_value`, ensure the following: ==== exists/missing on Objects -The `exists` and `missing` filters ((("objects", "using exists/missing filters on")))((("exists filter", "using on objects")))((("missing filter", "using on objects")))also work on inner objects, not just core -types. With the following document +The `exists` and `missing` queries ((("objects", "using exists/missing queries on"))) +((("exists query", "using on objects")))((("missing query", "using on objects")))also +work on inner objects, not just core types. With the following document [source,js] -------------------------------------------------- @@ -226,7 +228,7 @@ flattened internally into a simple field-value structure, much like this: } -------------------------------------------------- -So how can we use an `exists` or `missing` filter on the `name` field, which +So how can we use an `exists` or `missing` query on the `name` field, which doesn't really exist in the inverted index? The reason that it works is that a filter like @@ -254,5 +256,3 @@ is really executed as That also means that if `first` and `last` were both empty, the `name` namespace would not exist. - - diff --git a/080_Structured_Search/40_bitsets.asciidoc b/080_Structured_Search/40_bitsets.asciidoc index d9b4aa144..1522a9efd 100644 --- a/080_Structured_Search/40_bitsets.asciidoc +++ b/080_Structured_Search/40_bitsets.asciidoc @@ -2,10 +2,13 @@ === All About Caching Earlier in this chapter (<<_internal_filter_operation>>), we briefly discussed -how filters are calculated.((("structured search", "caching of filter results")))((("caching", "bitsets representing documents matching filters")))((("bitsets, caching of")))((("filters", "bitsets representing documents matching, caching of"))) At their heart is a bitset representing which -documents match the filter. Elasticsearch aggressively caches these bitsets for later use. Once cached, -these bitsets can be reused _wherever_ the same filter is used, without having -to reevaluate the entire filter again. +how non-scoring queries are calculated.((("structured search", "caching of query results"))) +((("caching", "bitsets representing documents matching queries")))((("bitsets, caching of"))) +((("queries", "bitsets representing documents matching, caching of"))) At their +heart is a bitset representing which documents match the filter. When Elasticsearch +determines a bitset is likely to be reused in the future, it will be cached directly +in memory for later use. Once cached, these bitsets can be reused _wherever_ +the same query is used, without having to reevaluate the entire query again. These cached bitsets are ``smart'': they are updated incrementally. As you index new documents, only those new documents need to be added to the existing @@ -13,12 +16,16 @@ bitsets, rather than having to recompute the entire cached filter over and over. Filters are real-time like the rest of the system; you don't need to worry about cache expiry. -==== Independent Filter Caching +==== Independent Query Caching -Each filter is calculated and cached independently, regardless of where it is -used.((("filters", "independent caching of"))) If two different queries use the same filter, the same filter bitset -will be reused. Likewise, if a single query uses the same filter in multiple -places, only one bitset is calculated and then reused. +The bitsets belonging to a query component are independent from the rest of the +search request. This means that, once cached, a query can be reused in multiple +search requests. It is not dependent on the "context" of the surrounding query. +This allows caching to accelerate the most frequently used portions of your queries, +without wasting overhead on the less frequent / more volatile portions. + +Similarly, if a single search request reuses the same non-scoring query, it's +cached bitset can be reused for all instances inside the single search request. Let's look at this example query, which looks for emails that are either of the following: @@ -27,93 +34,73 @@ Let's look at this example query, which looks for emails that are either of the [source,js] -------------------------------------------------- -"bool": { - "should": [ - { "bool": { - "must": [ - { "term": { "folder": "inbox" }}, <1> - { "term": { "read": false }} - ] - }}, - { "bool": { - "must_not": { - "term": { "folder": "inbox" } <1> - }, - "must": { - "term": { "important": true } +GET /inbox/emails/_search +{ + "query": { + "constant_score": { + "filter": { + "bool": { + "should": [ + { "bool": { + "must": [ + { "term": { "folder": "inbox" }}, <1> + { "term": { "read": false }} + ] + }}, + { "bool": { + "must_not": { + "term": { "folder": "inbox" } <1> + }, + "must": { + "term": { "important": true } + } + }} + ] + } } - }} - ] + } + } } -------------------------------------------------- -<1> These two filters are identical and will use the same bitset. +<1> These two queries are identical and will use the same bitset. Even though one of the inbox clauses is a `must` clause and the other is a -`must_not` clause, the two clauses themselves are identical. This means that -the bitset is calculated once for the first clause that is executed, and then -the cached bitset is used for the other clause. By the time this query is run -a second time, the inbox filter is already cached and so both clauses will use -the cached bitset. +`must_not` clause, the two clauses themselves are identical. If this particular +`term` query was previously cached, both instances would benefit from the cached +representation despite being used in different styles of boolean logic. This ties in nicely with the composability of the query DSL. It is easy to -move filters around, or reuse the same filter in multiple places within the -same query. This isn't just convenient to the developer--it has direct +move filtering queries around, or reuse the same query in multiple places within the +search request. This isn't just convenient to the developer--it has direct performance benefits. -==== Controlling Caching - -Most _leaf filters_—those dealing directly with fields like the `term` -filter--are cached, while((("leaf filters, caching of")))((("caching", "of leaf filters, controlling")))((("filters", "controlling caching of"))) compound filters, like the `bool` filter, are not. - -[NOTE] -==== -Leaf filters have to consult the inverted index on disk, so it makes sense to -cache them. Compound filters, on the other hand, use fast bit logic to combine -the bitsets resulting from their inner clauses, so it is efficient to -recalculate them every time. -==== - -Certain leaf filters, however, are not cached by default, because it -doesn't make sense to do so: - -Script filters:: - -The results((("script filters, no caching of results"))) from {ref}/query-dsl-script-query.html cannot -be cached because the meaning of the script is opaque to Elasticsearch. - -Geo-filters:: - -The geolocation filters, which((("geolocation filters, no caching of results"))) we cover in more detail in <>, are -usually used to filter results based on the geolocation of a specific user. -Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them. - -Date ranges:: - -Date ranges that ((("date ranges", "using now function, no caching of")))((("now function", "date ranges using")))use the `now` function (for example `"now-1h"`), result in values -accurate to the millisecond. Every time the filter is run, `now` returns a new -time. Older filters will never be reused, so caching is disabled by default. -However, when using `now` with rounding (for example, `now/d` rounds to the nearest day), -caching is enabled by default. - -Sometimes the default caching strategy is not correct. Perhaps you have a -complicated `bool` expression that is reused several times in the same query. -Or you have a filter on a `date` field that will never be reused. The default -caching strategy ((("_cache flag", sortas="cache flag")))((("filters", "overriding default caching strategy on")))can be overridden on almost any filter by setting the -`_cache` flag: - -[source,js] --------------------------------------------------- -{ - "range" : { - "timestamp" : { - "gt" : "2014-01-02 16:15:14" <1> - }, - "_cache": false <2> - } -} --------------------------------------------------- -<1> It is unlikely that we will reuse this exact timestamp. -<2> Disable caching of this filter. - -Later chapters provide examples of when it can make sense to -override the default caching strategy. +==== Autocaching Behavior + +In older versions of Elasticsearch, the default behavior was to cache everything +that was cacheable. This often meant the system cached bitsets too aggressively +and performance suffered due to thrashing the cache. In addition, many filters +are very fast to evaluate, but substantially slower to cache (and reuse from cache). +These filters don't make sense to cache, since you'd be better off just re-executing +the fitler again. + +Inspecting the inverted index is very fast and most query components are rare. +Consider a `term` filter on a `"user_id"` field: if you have millions of users, +any particular user ID will only occur rarely. It isn't profitable to cache +the bitsets for this filter, as the cached result will likely be evicted +from the cache before it is used again. + +This type of cache churn can have serious effects on performance. What's worse, +it is difficult for developers to identify which components exhibit good cache +behavior and which are useless. + +To address this, Elasticsearch caches queries automatically based on usage frequency. +If a non-scoring query has been used a few times (dependent on the query type) in the last 256 queries, +the query is a candidate for caching. However, not all segments are guaranteed +to cache the bitset. Only segments that hold more than 10,000 documents (or 3% +of the total documents, whichever is larger) will cache the bitset. Because +small segments are fast to search and merged out quickly, it doesn't make sense +to cache bitsets here. + +Once cached, a non-scoring bitset will remain in the cache until it is evicted. +Eviction is done on an LRU basis: the least-recently used filter will be evicted +once the cache is full. diff --git a/080_Structured_Search/45_filter_order.asciidoc b/080_Structured_Search/45_filter_order.asciidoc deleted file mode 100644 index 40adfb6f9..000000000 --- a/080_Structured_Search/45_filter_order.asciidoc +++ /dev/null @@ -1,73 +0,0 @@ -=== Filter Order - -The order of filters in a `bool` clause is important for performance.((("structured search", "filter order")))((("filters", "order of"))) More-specific filters should be placed before less-specific filters in order to -exclude as many documents as possible, as early as possible. - -If Clause A could match 10 million documents, and Clause B could match -only 100 documents, then Clause B should be placed before Clause A. - -Cached filters are very fast, so they should be placed before filters that -are not cacheable.((("caching", "cached filters, order of"))) Imagine that we have an index that contains one month's -worth of log events. However, we're mostly interested only in log events from -the previous hour: - -[source,js] --------------------------------------------------- -GET /logs/2014-01/_search -{ - "query" : { - "filtered" : { - "filter" : { - "range" : { - "timestamp" : { - "gt" : "now-1h" - } - } - } - } - } -} --------------------------------------------------- - -This filter is not cached because it uses the `now` function,((("now function", "filters using, caching and"))) the value of -which changes every millisecond. That means that we have to examine one -month's worth of log events every time we run this query! - -We could make this much more efficient by combining it with a cached filter: -we can exclude most of the month's data by adding a filter that uses a fixed -point in time, such as midnight last night: - -[source,js] --------------------------------------------------- -"bool": { - "must": [ - { "range" : { - "timestamp" : { - "gt" : "now-1h/d" <1> - } - }}, - { "range" : { - "timestamp" : { - "gt" : "now-1h" <2> - } - }} - ] -} --------------------------------------------------- -<1> This filter is cached because it uses `now` rounded to midnight. - -<2> This filter is not cached because it uses `now` _without_ rounding. - -The `now-1h/d` clause rounds to the previous midnight and so excludes all documents -created before today. The resulting bitset is cached because `now` is used -with rounding, which means that it is executed only once a day, when the value -for _midnight-last-night_ changes. The `now-1h` clause isn't cached because -`now` produces a time accurate to the nearest millisecond. However, thanks to -the first filter, this second filter need only check documents that have been -created since midnight. - -The order of these clauses is important. This approach works only because the -_since-midnight_ clause comes before the _last-hour_ clause. If they were the -other way around, then the _last-hour_ clause would need to examine all -documents in the index, instead of just documents created since midnight. - diff --git a/100_Full_Text_Search/00_Intro.asciidoc b/100_Full_Text_Search/00_Intro.asciidoc index 3afee39b7..339011d12 100644 --- a/100_Full_Text_Search/00_Intro.asciidoc +++ b/100_Full_Text_Search/00_Intro.asciidoc @@ -81,18 +81,19 @@ internally). [NOTE] ==== If you do find yourself wanting to use a query on an exact value -`not_analyzed` field, ((("exact values", "not_analyzed fields, querying")))think about whether you really want a query or a filter. +`not_analyzed` field, ((("exact values", "not_analyzed fields, querying")))think +about whether you really want a scoring query, or if a non-scoring query might be better. Single-term queries usually represent binary yes/no questions and are -almost always better expressed as a ((("filters", "single-term queries better expressed as")))filter, so that they can benefit from -<>: +almost always better expressed as a ((("non-scoring query", "filter", "single-term queries better expressed as"))) +filter, so that they can benefit from <>: [source,js] -------------------------------------------------- GET /_search { "query": { - "filtered": { + "constant_score": { "filter": { "term": { "gender": "female" } } diff --git a/300_Aggregations/115_eager.asciidoc b/300_Aggregations/115_eager.asciidoc index 7d55bee3f..5adacad68 100644 --- a/300_Aggregations/115_eager.asciidoc +++ b/300_Aggregations/115_eager.asciidoc @@ -17,7 +17,7 @@ There are three methods to combat this latency spike: - Eagerly load global ordinals - Prepopulate caches with warmers -All are variations on the same concept: preload the fielddata so that there is +All are variations on the same concept: preload the fielddata so that there is no latency spike when the user needs to execute a search. [[eager-fielddata]] @@ -157,7 +157,7 @@ PUT /music/_mapping/_song <1> Setting `eager_global_ordinals` also implies loading fielddata eagerly. Just like the eager preloading of fielddata, eager global ordinals are built -before a new segment becomes visible to search. +before a new segment becomes visible to search. [NOTE] ========================= @@ -226,7 +226,7 @@ Let's register a warmer and then talk about what's happening: PUT /music/_warmer/warmer_1 <1> { "query" : { - "filtered" : { + "bool" : { "filter" : { "bool": { "should": [ <2> @@ -250,7 +250,7 @@ PUT /music/_warmer/warmer_1 <1> ---- <1> Warmers are associated with an index (`music`) and are registered using the `_warmer` endpoint and a unique ID (`warmer_1`). -<2> The three most popular music genres have their filter caches prebuilt. +<2> The three most popular music genres are pre-warmed to help encourage caching. <3> The fielddata and global ordinals for the `price` field will be preloaded. Warmers are registered against a specific index.((("warmers", see="index warmers"))) Each warmer is given a @@ -279,7 +279,3 @@ user's queries and register those. Some administrative details (such as getting existing warmers and deleting warmers) that have been omitted from this explanation. Refer to the {ref}/indices-warmers.html[warmers documentation] for the rest of the details. - - - - diff --git a/300_Aggregations/45_filtering.asciidoc b/300_Aggregations/45_filtering.asciidoc index a4027ec10..eddc3d356 100644 --- a/300_Aggregations/45_filtering.asciidoc +++ b/300_Aggregations/45_filtering.asciidoc @@ -1,14 +1,15 @@ == Filtering Queries and Aggregations -A natural extension to aggregation scoping is filtering. Because the aggregation +A natural extension to aggregation scoping are filtering queries. Because the aggregation operates in the context of the query scope, any filter applied to the query will also apply to the aggregation. [float="true"] -=== Filtered Query +=== Filtering Queres If we want to find all cars over $10,000 and also calculate the average price -for those cars,((("filtering", "serch query results")))((("filtered query")))((("queries", "filtered"))) we can simply use a `filtered` query: +for those cars,((("filtering", "serch query results")))((("filtering query"))) +((("queries"))) we can use a `constant_score` query and its `filter` clause: [source,js] -------------------------------------------------- @@ -16,7 +17,7 @@ GET /cars/transactions/_search { "size" : 0, "query" : { - "filtered": { + "constant_score": { "filter": { "range": { "price": { @@ -35,21 +36,21 @@ GET /cars/transactions/_search -------------------------------------------------- // SENSE: 300_Aggregations/45_filtering.json -Fundamentally, using a `filtered` query is no different from using a `match` -query, as we discussed in the previous chapter. The query (which happens to include -a filter) returns a certain subset of documents, and the aggregation operates -on those documents. +Fundamentally, using a non-scoring query is no different from using a `match` +query, as we discussed in the previous chapter. The query returns a certain +subset of documents, and the aggregation operates on those documents. It just happens +to omit scoring and may proactively cache bitsets, etc. [float="true"] === Filter Bucket -But what if you would like to filter just the aggregation results?((("filtering", "aggregation results, not the query")))((("aggregations", "filtering just aggregations"))) Imagine we +But what if you would like to filter just the aggregation results?((("filtering", "aggregation results, not the query")))((("aggregations", "filtering just aggregations"))) Imagine we are building the search page for our car dealership. We want to display search results according to what the user searches for. But we also want to enrich the page by including the average price of cars (matching the search) that were sold in the last month. -We can't use simple scoping here, since there are two different criteria. The +We can't use simple scoping here, since there are two different criteria. The search results must match +ford+, but the aggregation results must match +ford+ AND +sold > now - 1M+. @@ -101,7 +102,7 @@ This allows you to filter selective portions of the aggregation as required. === Post Filter So far, we have a way to filter both the search results and aggregations (a -`filtered` query), as well as filtering individual portions of the aggregation +non-scoring `filter` query), as well as filtering individual portions of the aggregation (`filter` bucket). You may be thinking to yourself, "hmm...is there a way to filter _just_ the search @@ -114,7 +115,7 @@ it does not affect the query scope--and thus does not affect the aggregations either. We can use this behavior to apply additional filters to our search -criteria that don't affect things like categorical facets in your UI. Let's +criteria that don't affect things like categorical facets in your UI. Let's design another search page for our car dealer. This page will allow the user to search for a car and filter by color. Color choices are populated via an aggregation: @@ -153,15 +154,15 @@ Finally, the `post_filter` will filter the search results to show only green +ford+ cars. This happens _after_ the query is executed, so the aggregations are unaffected. -This is often important for coherent UIs. Imagine that a user clicks a category in +This is often important for coherent UIs. Imagine that a user clicks a category in your UI (for example, green). The expectation is that the search results are filtered, -but _not_ the UI options. If you applied a `filtered` query, the UI would +but _not_ the UI options. If you applied a Boolean `filter` query, the UI would instantly transform to show _only_ +green+ as an option--not what the user wants! [WARNING] .Performance consideration ==== -Use a `post_filter` _only_ if you need to differentially filter search results +Use a `post_filter` _only_ if you need to differentially filter search results and aggregations. ((("post filter", "performance and")))Sometimes people will use `post_filter` for regular searches. Don't do this! The nature of the `post_filter` means it runs _after_ the query, @@ -179,12 +180,6 @@ both--often boils down to how you want your user interface to behave. Choose the appropriate filter (or combinations) depending on how you want to display results to your user. - - A `filtered` query affects both search results and aggregations. + - A non-scoring query inside a `filter` clause affects both search results and aggregations. - A `filter` bucket affects just aggregations. - A `post_filter` affects just search results. - - - - - - diff --git a/310_Geopoints.asciidoc b/310_Geopoints.asciidoc index 8333dbfcf..40a710cea 100644 --- a/310_Geopoints.asciidoc +++ b/310_Geopoints.asciidoc @@ -6,9 +6,6 @@ include::310_Geopoints/32_Bounding_box.asciidoc[] include::310_Geopoints/34_Geo_distance.asciidoc[] -include::310_Geopoints/36_Caching_geofilters.asciidoc[] - include::310_Geopoints/38_Reducing_memory.asciidoc[] include::310_Geopoints/50_Sorting_by_distance.asciidoc[] - diff --git a/310_Geopoints/36_Caching_geofilters.asciidoc b/310_Geopoints/36_Caching_geofilters.asciidoc deleted file mode 100644 index 43bdbec6a..000000000 --- a/310_Geopoints/36_Caching_geofilters.asciidoc +++ /dev/null @@ -1,74 +0,0 @@ -[[geo-caching]] -=== Caching Geo Filters - -The results of geo-filters are not cached by default,((("caching", "of geo-filters")))((("filters", "caching geo-filters")))((("geo-filters, caching"))) for two reasons: - -* Geo-filters are usually used to find entities that are near to a user's - current location. The problem is that users move, and no two users - are in exactly the same location. A cached filter would have little - chance of being reused. - -* Filters are cached as bitsets that represent all documents in a - <>. Imagine that our query excludes all - documents but one in a particular segment. An uncached geo-filter just - needs to check the one remaining document, but a cached geo-filter would - need to check all of the documents in the segment. - -That said, caching can be used to good effect with geo-filters. Imagine that -your index contains restaurants from all over the United States. A user in New -York is not interested in restaurants in San Francisco. We can treat New York -as a _hot spot_ and draw a big bounding box around the city and neighboring -areas. - -This `geo_bounding_box` filter can be cached and((("geo_bounding_box filter", "caching and reusing"))) reused whenever we have a -user within the city limits of New York. It will exclude all restaurants -from the rest of the country. We can then use an uncached, more specific -`geo_bounding_box` or `geo_distance` filter((("geo_distance filter"))) to narrow the remaining results to those that are close to the user: - -[source,json] ---------------------- -GET /attractions/restaurant/_search -{ - "query": { - "filtered": { - "filter": { - "bool": { - "must": [ - { - "geo_bounding_box": { - "type": "indexed", - "_cache": true, <1> - "location": { - "top_left": { - "lat": 40,8, - "lon": -74.1 - }, - "bottom_right": { - "lat": 40.4, - "lon": -73.7 - } - } - } - }, - { - "geo_distance": { <2> - "distance": "1km", - "location": { - "lat": 40.715, - "lon": -73.988 - } - } - } - ] - } - } - } - } -} ---------------------- -<1> The cached bounding box filter reduces all results down to those in the - greater New York area. -<2> The more costly `geo_distance` filter narrows the results to those - within 1km of the user. - - diff --git a/320_Geohashes/60_Geohash_cell_filter.asciidoc b/320_Geohashes/60_Geohash_cell_filter.asciidoc index 05b5952d4..ee6e73eb2 100644 --- a/320_Geohashes/60_Geohash_cell_filter.asciidoc +++ b/320_Geohashes/60_Geohash_cell_filter.asciidoc @@ -1,16 +1,16 @@ -[[geohash-cell-filter]] -=== Geohash Cell Filter +[[geohash-cell-query]] +=== Geohash Cell Query -The `geohash_cell` filter simply translates a `lat/lon` location((("geohash_cell filter")))((("filters", "geohash_cell"))) into a +The `geohash_cell` query simply translates a `lat/lon` location((("geohash_cell query")))((("geohash_cell"))) into a geohash with the specified precision and finds all locations that contain -that geohash--a very efficient filter indeed. +that geohash--a very efficient query indeed. [source,json] ---------------------------- GET /attractions/restaurant/_search { "query": { - "filtered": { + "constant_score": { "filter": { "geohash_cell": { "location": { @@ -27,17 +27,17 @@ GET /attractions/restaurant/_search <1> The `precision` cannot be more precise than that specified in the `geohash_precision` mapping. -This filter translates the `lat/lon` point into a geohash of the appropriate +This query translates the `lat/lon` point into a geohash of the appropriate length--in this example `dr5rsk`—and looks for all locations that contain that exact term. -However, the filter as written in the preceding example may not return all restaurants within 5km +However, the query as written in the preceding example may not return all restaurants within 5km of the specified point. Remember that a geohash is just a rectangle, and the point may fall anywhere within that rectangle. If the point happens to fall near the edge of a geohash cell, the filter may well exclude any restaurants in the adjacent cell. -To fix that, we can tell the filter to include the neigboring cells, by +To fix that, we can tell the query to include the neigboring cells, by setting `neighbors` to((("neighbors setting (geohash_cell)"))) `true`: [source,json] @@ -45,7 +45,7 @@ setting `neighbors` to((("neighbors setting (geohash_cell)"))) `true`: GET /attractions/restaurant/_search { "query": { - "filtered": { + "constant_score": { "filter": { "geohash_cell": { "location": { @@ -61,11 +61,11 @@ GET /attractions/restaurant/_search } ---------------------------- -<1> This filter will look for the resolved geohash and all surrounding +<1> This query will look for the resolved geohash and all surrounding geohashes. Clearly, looking for a geohash with precision `2km` plus all the neighboring -cells results in quite a large search area. This filter is not built for +cells results in quite a large search area. This query is not built for accuracy, but it is very efficient and can be used as a prefiltering step before applying a more accurate geo-filter. @@ -74,9 +74,7 @@ of `2km` is converted to a geohash of length 6, which actually has dimensions of about 1.2km x 0.6km. You may find it more understandable to specify an actual length such as `5` or `6`. -The other advantage that this filter has over a `geo_bounding_box` filter is -that it supports multiple locations per field.((("latitude/longitude pairs", "multiple lat/lon points per field, geohash_cell"))) The `lat_lon` option that we -discussed in <> is efficient, but only when there -is a single `lat/lon` point per field. - - +The other advantage that this query has over a `geo_bounding_box` query is +that it supports multiple locations per field.((("latitude/longitude pairs", "multiple lat/lon points per field, geohash_cell"))) +The `lat_lon` option that we discussed in <> is efficient, +but only when there is a single `lat/lon` point per field. diff --git a/330_Geo_aggs/62_Geo_distance_agg.asciidoc b/330_Geo_aggs/62_Geo_distance_agg.asciidoc index b56d07a28..c9e838ef9 100644 --- a/330_Geo_aggs/62_Geo_distance_agg.asciidoc +++ b/330_Geo_aggs/62_Geo_distance_agg.asciidoc @@ -11,8 +11,8 @@ add ``another result found within 2km'': GET /attractions/restaurant/_search { "query": { - "filtered": { - "query": { + "bool": { + "must": { "match": { <1> "name": "pizza" } @@ -116,6 +116,6 @@ The response from ((("post filter", "geo_distance aggregation")))the preceding r restaurant within 2km of the user. In this example, we have counted the number of restaurants that fall -into each concentric ring. Of course, we could nest subaggregations under +into each concentric ring. Of course, we could nest sub-aggregations under the `per_rings` aggregation to calculate the average price per ring, the -maximium popularity, and more. +maximum popularity, and more. diff --git a/330_Geo_aggs/64_Geohash_grid_agg.asciidoc b/330_Geo_aggs/64_Geohash_grid_agg.asciidoc index 0152a266a..e2dee7416 100644 --- a/330_Geo_aggs/64_Geohash_grid_agg.asciidoc +++ b/330_Geo_aggs/64_Geohash_grid_agg.asciidoc @@ -17,7 +17,7 @@ most documents.((("buckets", "generated by geohash_grid aggregation, controlling order to figure out which are the most populous 10,000. You need to control the number of buckets generated by doing the following: -1. Limit the result with a `geo_bounding_box` filter. +1. Limit the result with a `geo_bounding_box` query. 2. Choose an appropriate `precision` for the size of your bounding box. [source,json] @@ -26,7 +26,7 @@ GET /attractions/restaurant/_search { "size" : 0, "query": { - "filtered": { + "constant_score": { "filter": { "geo_bounding_box": { "location": { <1> @@ -84,7 +84,7 @@ The response from the preceding request looks like this: ---------------------------- <1> Each bucket contains the geohash as the `key`. -Again, we didn't specify any subaggregations, so all we got back was the +Again, we didn't specify any sub-aggregations, so all we got back was the document count. We could have asked for popular restaurant types, average price, or other details. @@ -96,4 +96,3 @@ central point. Libraries exist in JavaScript and other languages that will perform this conversion for you, but you can also use information from <> to perform a similar job. ==== - diff --git a/330_Geo_aggs/66_Geo_bounds_agg.asciidoc b/330_Geo_aggs/66_Geo_bounds_agg.asciidoc index f0684d331..471d78553 100644 --- a/330_Geo_aggs/66_Geo_bounds_agg.asciidoc +++ b/330_Geo_aggs/66_Geo_bounds_agg.asciidoc @@ -16,7 +16,7 @@ GET /attractions/restaurant/_search { "size" : 0, "query": { - "filtered": { + "constant_score": { "filter": { "geo_bounding_box": { "location": { @@ -48,7 +48,7 @@ GET /attractions/restaurant/_search } } ---------------------------- -<1> The `geo_bounds` aggregation will calculate the smallest bounding box required to encapsulate all of the documents matching our query. +<1> The `geo_bounds` aggregation will calculate the smallest bounding box required to encapsulate all of the documents matching our query. The response now includes a bounding box that we can use to zoom our map: @@ -81,7 +81,7 @@ GET /attractions/restaurant/_search { "size" : 0, "query": { - "filtered": { + "constant_score": { "filter": { "geo_bounding_box": { "location": { @@ -115,7 +115,7 @@ GET /attractions/restaurant/_search } } ---------------------------- -<1> The `cell_bounds` subaggregation is calculated for every geohash cell. +<1> The `cell_bounds` sub-aggregation is calculated for every geohash cell. Now the ((("cell_bounds aggregation")))points in each cell have a bounding box: @@ -143,5 +143,3 @@ Now the ((("cell_bounds aggregation")))points in each cell have a bounding box: }, ... ---------------------------- - - diff --git a/340_Geoshapes.asciidoc b/340_Geoshapes.asciidoc index 1c632a9aa..75c13199c 100644 --- a/340_Geoshapes.asciidoc +++ b/340_Geoshapes.asciidoc @@ -7,7 +7,3 @@ include::340_Geoshapes/74_Indexing_geo_shapes.asciidoc[] include::340_Geoshapes/76_Querying_geo_shapes.asciidoc[] include::340_Geoshapes/78_Indexed_geo_shapes.asciidoc[] - -include::340_Geoshapes/80_Caching_geo_shapes.asciidoc[] - - diff --git a/340_Geoshapes/80_Caching_geo_shapes.asciidoc b/340_Geoshapes/80_Caching_geo_shapes.asciidoc deleted file mode 100644 index c9f2a6a73..000000000 --- a/340_Geoshapes/80_Caching_geo_shapes.asciidoc +++ /dev/null @@ -1,38 +0,0 @@ -[[geo-shape-caching]] -=== Geo Shape Filters and Caching - -The `geo_shape` query and filter perform the same function.((("caching", "geo-shape filters and")))((("filters", "geo_shape")))((("geo-shapes", "geo_shape filters, caching and"))) The query simply -acts as a filter: any matching documents receive a relevance `_score` of -`1`. Query results cannot be cached, but filter results can be. - -The results are not cached by default. Just as with geo-points, any -change in the coordinates in a shape are likely to produce a different set of -geohashes, so there is little point in caching filter results. That said, if -you filter using the same shapes repeatedly, it can be worth caching the -results, by setting `_cache` to `true`: - -[source,json] ------------------------ -GET /attractions/neighborhood/_search -{ - "query": { - "filtered": { - "filter": { - "geo_shape": { - "_cache": true, <1> - "location": { - "indexed_shape": { - "index": "attractions", - "type": "landmark", - "id": "dam_square", - "path": "location" - } - } - } - } - } - } -} ------------------------ -<1> The results of this `geo_shape` filter will be cached. - diff --git a/402_Nested/32_Nested_query.asciidoc b/402_Nested/32_Nested_query.asciidoc index 16b6b0cda..54c7d7702 100644 --- a/402_Nested/32_Nested_query.asciidoc +++ b/402_Nested/32_Nested_query.asciidoc @@ -78,11 +78,8 @@ GET /my_index/blogpost/_search [NOTE] ==== -A `nested` filter behaves much like a `nested` query, except that it doesn't -accept the `score_mode` parameter. It can be used only in _filter context_—such as inside a `filtered` query--and it behaves like any other filter: -it includes or excludes, but it doesn't score. - -While the results of the `nested` filter itself are not cached, the usual -caching rules apply to the filter _inside_ the `nested` filter. +If placed inside the `filter` clause of a Boolean query, a `nested` query behaves +much like a `nested` query, except that it doesn't accept the `score_mode` +parameter. Because it is being used as a non-scoring query -- it includes or excludes, +but doesn't score -- a `score_mode` doesn't make sense since there is nothing to score. ==== - diff --git a/404_Parent_Child/55_Has_parent.asciidoc b/404_Parent_Child/55_Has_parent.asciidoc index c0b83dcf3..fcc37e404 100644 --- a/404_Parent_Child/55_Has_parent.asciidoc +++ b/404_Parent_Child/55_Has_parent.asciidoc @@ -34,16 +34,11 @@ parent, so there is no need to reduce multiple scores into a single score for the child. The choice is simply between using the score (`score`) or not (`none`). -.has_parent Filter +.Non-scoring has_parent Query ************************** -The `has_parent` filter works in the same way((("has_parent query and filter", "filter"))) as the `has_parent` query, except -that it doesn't support the `score_mode` parameter. It can be used only in -_filter context_—such as inside a `filtered` query--and behaves -like any other filter: it includes or excludes, but doesn't score. - -While the results of a `has_parent` filter are not cached, the usual caching -rules apply to the filter _inside_ the `has_parent` filter. - +When used in non-scoring mode (e.g. inside a `filter` clause), the `has_parent` +query no longer supports the `score_mode` parameter. Because it is merely +including/excluding documents and not scoring, the `score_mode` parameter +no longer applies. ************************** - diff --git a/410_Scaling/65_Shared_index.asciidoc b/410_Scaling/65_Shared_index.asciidoc index e127ec3dd..da58147de 100644 --- a/410_Scaling/65_Shared_index.asciidoc +++ b/410_Scaling/65_Shared_index.asciidoc @@ -36,21 +36,21 @@ PUT /forums/post/1 We can use the `forum_id` as a filter to search within a single forum. The filter will exclude most of the documents in the index (those from other -forums), and filter caching will ensure that responses are fast: +forums), and caching will ensure that responses are fast: [source,json] ------------------------------ GET /forums/post/_search { "query": { - "filtered": { - "query": { + "bool": { + "must": { "match": { "title": "ginger nuts" } }, "filter": { - "term": { <1> + "term": { "forum_id": { "baking" } @@ -60,7 +60,6 @@ GET /forums/post/_search } } ------------------------------ -<1> The `term` filter is cached by default. This approach works, but we can do better. ((("shards", "routing a document to"))) The posts from a single forum would fit easily onto one shard, but currently they are scattered across all ten @@ -98,8 +97,8 @@ holds our documents: GET /forums/post/_search?routing=baking <1> { "query": { - "filtered": { - "query": { + "bool": { + "must": { "match": { "title": "ginger nuts" } @@ -116,18 +115,18 @@ GET /forums/post/_search?routing=baking <1> } ------------------------------ <1> The query is run on only the shard that corresponds to this `routing` value. -<2> We still need the filter, as a single shard can hold posts from many forums. +<2> We still need the filtering query, as a single shard can hold posts from many forums. Multiple forums can be queried by passing a comma-separated list of `routing` -values, and including each `forum_id` in a `terms` filter: +values, and including each `forum_id` in a `terms` query: [source,json] ------------------------------ GET /forums/post/_search?routing=baking,cooking,recipes { "query": { - "filtered": { - "query": { + "bool": { + "must": { "match": { "title": "ginger nuts" } @@ -145,7 +144,5 @@ GET /forums/post/_search?routing=baking,cooking,recipes ------------------------------ While this approach is technically efficient, it looks a bit clumsy because of -the need to specify `routing` values and `terms` filters on every query or +the need to specify `routing` values and `terms` queries on every query or indexing request. Index aliases to the rescue! - - diff --git a/snippets/054_Query_DSL/60_Bool_query.json b/snippets/054_Query_DSL/60_Bool_query.json index 4472beef2..a87039081 100644 --- a/snippets/054_Query_DSL/60_Bool_query.json +++ b/snippets/054_Query_DSL/60_Bool_query.json @@ -20,6 +20,13 @@ GET /_search "match": { "tweet": "full text" } + }, + "filter": { + "range": { + "age" : { + "gt" : 30 + } + } } } } diff --git a/snippets/054_Query_DSL/70_Bool_filter.json b/snippets/054_Query_DSL/70_Bool_filter.json index 7dcb7ab25..0a4b543ab 100644 --- a/snippets/054_Query_DSL/70_Bool_filter.json +++ b/snippets/054_Query_DSL/70_Bool_filter.json @@ -40,7 +40,7 @@ PUT /test/test/3 GET /test/test/_search { "query": { - "filtered": { + "bool": { "filter": { "bool": { "must": { diff --git a/snippets/054_Query_DSL/70_Bool_query.json b/snippets/054_Query_DSL/70_Bool_query.json index e9cae2e9c..95601e5fb 100644 --- a/snippets/054_Query_DSL/70_Bool_query.json +++ b/snippets/054_Query_DSL/70_Bool_query.json @@ -65,4 +65,91 @@ GET /test/test/_search ] } } -} \ No newline at end of file +} + + +# Move the date range query into the filter clause +# to turn it into a non-scoring query +GET /test/test/_search +{ + "query": { + "bool": { + "must": { + "match": { + "title": "how to make millions" + } + }, + "must_not": { + "match": { + "tag": "spam" + } + }, + "should": [ + { + "match": { + "tag": "starred" + } + } + ], + "filter": { + "range": { + "date": { + "gte": "2014-01-01" + } + } + } + } + } +} + + +# Embed another bool inside the first bool's filter clause, to +# enable boolean logic among non-scoring queries +GET /test/test/_search +{ + "query": { + "bool": { + "must": { + "match": { + "title": "how to make millions" + } + }, + "must_not": { + "match": { + "tag": "spam" + } + }, + "should": [ + { + "match": { + "tag": "starred" + } + } + ], + "filter": { + "bool": { + "must": [ + { "range": { "date": { "gte": "2014-01-01" }}}, + { "range": { "price": { "lte": 29.99 }}} + ], + "must_not": [ + { "term": { "category": "ebooks" }} + ] + } + } + } + } +} + +# If you just want to apply a filter and nothing else, you can use +# constant_score query +GET /test/test/_search +{ + "query": { + "constant_score": { + "filter": { + "term": { "category": "ebooks" } + } + } + } +} diff --git a/snippets/054_Query_DSL/70_Exists_filter.json b/snippets/054_Query_DSL/70_Exists_filter.json index 49b7b1120..3963d1e79 100644 --- a/snippets/054_Query_DSL/70_Exists_filter.json +++ b/snippets/054_Query_DSL/70_Exists_filter.json @@ -30,7 +30,7 @@ PUT /test/test/2 GET /test/test/_search { "query": { - "filtered": { + "bool": { "filter": { "exists": { "field": "title" @@ -44,7 +44,7 @@ GET /test/test/_search GET /test/test/_search { "query": { - "filtered": { + "bool": { "filter": { "missing": { "field": "title" diff --git a/snippets/054_Query_DSL/70_Terms_filter.json b/snippets/054_Query_DSL/70_Terms_filter.json index 3f629fcce..2f9d86cd9 100644 --- a/snippets/054_Query_DSL/70_Terms_filter.json +++ b/snippets/054_Query_DSL/70_Terms_filter.json @@ -31,7 +31,7 @@ PUT /test/test/2 GET /test/test/_search { "query": { - "filtered": { + "bool": { "filter": { "terms": { "tag": [ @@ -44,4 +44,3 @@ GET /test/test/_search } } } - diff --git a/snippets/056_Sorting/85_Multilevel_sort.json b/snippets/056_Sorting/85_Multilevel_sort.json index 1f88cffc2..24c39c299 100644 --- a/snippets/056_Sorting/85_Multilevel_sort.json +++ b/snippets/056_Sorting/85_Multilevel_sort.json @@ -38,8 +38,8 @@ PUT /test/tweet/4 GET /test/_search { "query": { - "filtered": { - "query": { + "bool": { + "must": { "match": { "tweet": "full text search" } diff --git a/snippets/056_Sorting/85_Sort_by_date.json b/snippets/056_Sorting/85_Sort_by_date.json index c3710c27e..4494c4fef 100644 --- a/snippets/056_Sorting/85_Sort_by_date.json +++ b/snippets/056_Sorting/85_Sort_by_date.json @@ -38,7 +38,7 @@ PUT /test/tweet/4 GET /test/_search { "query": { - "filtered": { + "bool": { "filter": { "term": { "user_id": 1 @@ -52,4 +52,3 @@ GET /test/_search } } } - diff --git a/snippets/080_Structured_Search/05_Term_number.json b/snippets/080_Structured_Search/05_Term_number.json index 2143fdef7..25a3b7f99 100644 --- a/snippets/080_Structured_Search/05_Term_number.json +++ b/snippets/080_Structured_Search/05_Term_number.json @@ -16,10 +16,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { - "query": { - "match_all": {} - }, + "constant_score": { "filter": { "term": { "price": 20 @@ -33,7 +30,7 @@ GET /my_store/products/_search GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "term": { "price": 20 diff --git a/snippets/080_Structured_Search/05_Term_text.json b/snippets/080_Structured_Search/05_Term_text.json index 7a448f35d..671bcbc40 100644 --- a/snippets/080_Structured_Search/05_Term_text.json +++ b/snippets/080_Structured_Search/05_Term_text.json @@ -16,7 +16,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { + "bool": { "filter": { "term": { "productID": "XHDK-A-1293-#fJ3" @@ -65,7 +65,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { + "bool": { "filter": { "term": { "productID": "XHDK-A-1293-#fJ3" @@ -74,4 +74,3 @@ GET /my_store/products/_search } } } - diff --git a/snippets/080_Structured_Search/10_Bool_filter.json b/snippets/080_Structured_Search/10_Bool_filter.json index 7bc248d63..6341b131b 100644 --- a/snippets/080_Structured_Search/10_Bool_filter.json +++ b/snippets/080_Structured_Search/10_Bool_filter.json @@ -34,7 +34,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "bool": { "should": [ @@ -68,7 +68,7 @@ GET /my_store/products/_search GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "bool": { "should": [ diff --git a/snippets/080_Structured_Search/15_Terms_filter.json b/snippets/080_Structured_Search/15_Terms_filter.json index 1887df14d..eedd714f8 100644 --- a/snippets/080_Structured_Search/15_Terms_filter.json +++ b/snippets/080_Structured_Search/15_Terms_filter.json @@ -16,7 +16,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "terms": { "price": [ 20, 30 ] diff --git a/snippets/080_Structured_Search/20_Exact.json b/snippets/080_Structured_Search/20_Exact.json index 8904346fa..833aa8538 100644 --- a/snippets/080_Structured_Search/20_Exact.json +++ b/snippets/080_Structured_Search/20_Exact.json @@ -23,7 +23,7 @@ PUT /my_index/my_type/2 GET /my_index/my_type/_search { "query": { - "filtered": { + "constant_score": { "filter": { "bool": { "must": [ @@ -42,4 +42,4 @@ GET /my_index/my_type/_search } } } -} \ No newline at end of file +} diff --git a/snippets/080_Structured_Search/25_Range_filter.json b/snippets/080_Structured_Search/25_Range_filter.json index e005b1a0c..723a160a6 100644 --- a/snippets/080_Structured_Search/25_Range_filter.json +++ b/snippets/080_Structured_Search/25_Range_filter.json @@ -17,7 +17,7 @@ POST /my_store/products/_bulk GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "range": { "price": { @@ -34,7 +34,7 @@ GET /my_store/products/_search GET /my_store/products/_search { "query": { - "filtered": { + "constant_score": { "filter": { "range": { "price": { diff --git a/snippets/080_Structured_Search/30_Exists_missing.json b/snippets/080_Structured_Search/30_Exists_missing.json index 34efef177..942966ce4 100644 --- a/snippets/080_Structured_Search/30_Exists_missing.json +++ b/snippets/080_Structured_Search/30_Exists_missing.json @@ -18,7 +18,7 @@ POST /my_index/posts/_bulk GET /my_index/posts/_search { "query": { - "filtered": { + "constant_score": { "filter": { "exists": { "field": "tags" @@ -32,7 +32,7 @@ GET /my_index/posts/_search GET /my_index/posts/_search { "query": { - "filtered": { + "constant_score": { "filter": { "missing": { "field": "tags" @@ -40,4 +40,4 @@ GET /my_index/posts/_search } } } -} \ No newline at end of file +} diff --git a/snippets/300_Aggregations/45_filtering.json b/snippets/300_Aggregations/45_filtering.json index 5170608c1..67672f9a1 100644 --- a/snippets/300_Aggregations/45_filtering.json +++ b/snippets/300_Aggregations/45_filtering.json @@ -4,7 +4,7 @@ GET /cars/transactions/_search { "size" : 0, "query" : { - "filtered": { + "constant_score": { "filter": { "range": { "price": { @@ -22,7 +22,7 @@ GET /cars/transactions/_search } # Filter bucket lets you filter just portions of the aggregation -# in this example, we filter the agg to calculate average price of +# in this example, we filter the agg to calculate average price of # "ford" sold in the last month, while search results show all Fords GET /cars/transactions/_search { diff --git a/snippets/300_Aggregations/75_sigterms.json b/snippets/300_Aggregations/75_sigterms.json index e141dda9d..5a42abf95 100644 --- a/snippets/300_Aggregations/75_sigterms.json +++ b/snippets/300_Aggregations/75_sigterms.json @@ -38,7 +38,7 @@ GET mlratings/_search { "size" : 0, "query": { - "filtered": { + "bool": { "filter": { "term": { "movie": 46970 @@ -60,7 +60,7 @@ GET mlratings/_search GET mlmovies/_search { "query": { - "filtered": { + "bool": { "filter": { "ids": { "values": [2571,318,296,2959,260] @@ -92,7 +92,7 @@ GET mlratings/_search { "size" : 0, "query": { - "filtered": { + "bool": { "filter": { "term": { "movie": 46970 @@ -108,4 +108,4 @@ GET mlratings/_search } } } -} \ No newline at end of file +} diff --git a/stash/50_cachekey.asciidoc b/stash/50_cachekey.asciidoc deleted file mode 100644 index ddd0236a0..000000000 --- a/stash/50_cachekey.asciidoc +++ /dev/null @@ -1,114 +0,0 @@ -=== Cache key - -The lookup example works great from the developer's point of view...but it is -going to make your filter cache very unhappy. If you were to run this type -of query in production and watch the filter cache stats, you'd see a large -amount of evictions -- the dreaded cache churn we discussed in -<<_controlling_caching>>. - -So what's going on? To understand the problem, we need to understand how -Elasticsearch names the filters in memory. Cached filters are stored in a map, -and each entry has a key name which is used to retrieve the filter. - -By default, Elasticsearch uses the filter itself as the key. When -parsing a query, it can simply take the entire filter contents and uses that as -the key. - -As a simple example, a filter map may look like this: - - - |----------------------------------------------------------------| - | Terms | Key | Bitset | - |----------------------------------------------------------------| - | marketing | "marketing" | 01101011001010 | - | sales, pr | "sales_pr" | 11010011111011 | - | management, sales, pr | "management_sales_pr" | 11110111111011 | - | engineering | "engineering" | 00000011001010 | - |----------------------------------------------------------------| - -As you can see, we have a list of `terms` filters. The key is simply -the list of terms concatenated together. Filters need to be uniquely identified, -so the filter itself works as a perfect unique key. - -.Does term order matter? -**** -You may wonder if the order of terms matters. For example, is -`{"terms" : ["a", "b"]}` cached separately from `{"terms" : ["b", "a"]}`? - -The answer is no, they are not cached separately. In reality, Elasticsearch -sorts the terms and caches the filter *object* itself, not a string concatenation. -But this complicates the subject, so you can safely think of the map keys as -strings where term order does not matter. -**** - -This approach can, however, lead to certain edge-cases where memory usage balloons -out of control. Our previous twitter example -- aka `terms` filters with -10,000 terms -- is one such example - -Elasticsearch will simply concatenate all 10,000 terms together...which means -the key itself may potentially use more memory than the cached bitset! -Obviously, this is not ideal. The way to fix this is to manually over-ride -the cache key using an option called `_cache_key`. Instead of concatenating -all the terms together, Elasticsearch will use the key name that you provide. - -If we tweak our lookup example from before, we can manually assign a key name: - -[source,js] --------------------------------------------------- -GET /my_index/users/_search -{ - "query" : { - "filtered" : { - "filter" : { - "terms" : { - "user" : { - "index" : "my_index", - "type" : "user_following", - "id" : "1", - "path" : "following" - }, - "_cache_key" : "user_following_1" <1> - } - } - } - } -} --------------------------------------------------- -<1> We manually add a cache key, arbitrarily set to some unique value - -For this example, we are setting the cache key to the type + field name + user -ID. The actual key value doesn't really matter, so long as it is unique to the -filter and short. Depending on your business requirements, it may make sense -to simply hash the list of IDs (with SHA1, etc) and use that as a key. - -._cache_key's cannot be automatically evicted -**** -If you use custom _cache_key's, you need to manually evict them from cache -when appropriate. In our twitter example, if you were to add a new "following" -ID to the document, the saved cache is no longer valid and needs to be manually -removed. - -You can do this with the Clear Cache API, which will clear the cache for our -"following" field: - -[source,js] --------------------------------------------------- -POST /my_index/_cache/clear?filter=true&fields=following --------------------------------------------------- - -Alternatively, you can set a timeout for the filter cache which will expire -filters after a certain age. This is set using the Update Cluster Settings API: - -[source,js] --------------------------------------------------- -POST /_cluster/settings -{ - "persistent" : { - "indices.cache.filter.expire" : "10m" - } -} --------------------------------------------------- - -This is a less useful method, however, since it affects your entire cluster -and filters which are not using a _custom_cache key. -**** diff --git a/stash/55_revisittermslookup.asciidoc b/stash/55_revisittermslookup.asciidoc index 297464d14..d0ec82d31 100644 --- a/stash/55_revisittermslookup.asciidoc +++ b/stash/55_revisittermslookup.asciidoc @@ -52,7 +52,7 @@ use just a single `term` filter looking for a single term: GET /my_index/users/_search { "query" : { - "filtered" : { + "bool" : { "filter" : { "term" : { "followed_by" : 1 @@ -86,4 +86,3 @@ Moving that data to the user documents itself may seem unnatural, but in many cases can work substantially better as seen here. When thinking about data organization and query structure, think about how you would like to search for your data rather than how you would like to store it. -