From 7ca64a183126194a3ceb30b81e649c65fcf41ae1 Mon Sep 17 00:00:00 2001 From: Shiny Kalapurakkel Date: Mon, 12 Jan 2015 14:27:23 +0000 Subject: [PATCH] Changes in ellen_troutman-at29536905925 branch --- 010_Intro/10_Installing_ES.asciidoc | 4 +-- 010_Intro/15_API.asciidoc | 4 +-- 010_Intro/25_Tutorial_Indexing.asciidoc | 6 ++-- 010_Intro/30_Tutorial_Search.asciidoc | 8 +++--- 020_Distributed_Cluster/00_Intro.asciidoc | 2 +- .../15_Add_an_index.asciidoc | 6 ++-- .../20_Add_failover.asciidoc | 2 +- .../25_Scale_horizontally.asciidoc | 2 +- .../30_Scale_more.asciidoc | 4 +-- .../35_Coping_with_failure.asciidoc | 2 +- 030_Data/00_Intro.asciidoc | 2 +- 030_Data/05_Document.asciidoc | 8 +++--- 030_Data/10_Index.asciidoc | 4 +-- 030_Data/15_Get.asciidoc | 4 +-- 030_Data/20_Exists.asciidoc | 4 +-- 030_Data/25_Update.asciidoc | 2 +- 030_Data/30_Create.asciidoc | 4 +-- 030_Data/35_Delete.asciidoc | 2 +- 030_Data/55_Bulk.asciidoc | 4 +-- 040_Distributed_CRUD/05_Routing.asciidoc | 2 +- .../15_Create_index_delete.asciidoc | 4 +-- 04_Geolocation.asciidoc | 2 +- 050_Search/00_Intro.asciidoc | 4 +-- 050_Search/05_Empty_search.asciidoc | 4 +-- 050_Search/20_Query_string.asciidoc | 4 +-- .../25_Data_type_differences.asciidoc | 2 +- .../35_Inverted_index.asciidoc | 6 ++-- 052_Mapping_Analysis/40_Analysis.asciidoc | 2 +- 052_Mapping_Analysis/45_Mapping.asciidoc | 12 ++++---- .../50_Complex_datatypes.asciidoc | 2 +- 054_Query_DSL/65_Queries_vs_filters.asciidoc | 2 +- 054_Query_DSL/70_Important_clauses.asciidoc | 6 ++-- .../75_Queries_with_filters.asciidoc | 4 +-- 054_Query_DSL/80_Validating_queries.asciidoc | 2 +- 056_Sorting/85_Sorting.asciidoc | 10 +++---- 056_Sorting/88_String_sorting.asciidoc | 4 +-- 056_Sorting/90_What_is_relevance.asciidoc | 18 ++++++------ 056_Sorting/95_Fielddata.asciidoc | 2 +- .../05_Query_phase.asciidoc | 2 +- 070_Index_Mgmt/05_Create_Delete.asciidoc | 6 ++-- 070_Index_Mgmt/15_Configure_Analyzer.asciidoc | 2 +- 070_Index_Mgmt/20_Custom_Analyzers.asciidoc | 4 +-- 070_Index_Mgmt/45_Default_Mapping.asciidoc | 2 +- .../30_Dynamic_indices.asciidoc | 8 +++--- 075_Inside_a_shard/40_Near_real_time.asciidoc | 2 +- .../50_Persistent_changes.asciidoc | 12 ++++---- .../60_Segment_merging.asciidoc | 2 +- 080_Structured_Search/25_ranges.asciidoc | 4 +-- 080_Structured_Search/40_bitsets.asciidoc | 8 +++--- 110_Multi_Field_Search/00_Intro.asciidoc | 2 +- .../05_Multiple_query_strings.asciidoc | 6 ++-- .../10_Single_query_string.asciidoc | 2 +- 110_Multi_Field_Search/15_Best_field.asciidoc | 2 +- .../20_Tuning_best_field_queries.asciidoc | 8 +++--- .../25_Multi_match_query.asciidoc | 4 +-- .../30_Most_fields.asciidoc | 4 +-- .../35_Entity_search.asciidoc | 2 +- .../40_Field_centric.asciidoc | 4 +-- 110_Multi_Field_Search/45_Custom_all.asciidoc | 2 +- .../50_Cross_field.asciidoc | 2 +- .../55_Not_analyzed.asciidoc | 2 +- .../05_Phrase_matching.asciidoc | 2 +- .../15_Multi_value_fields.asciidoc | 2 +- 120_Proximity_Matching/35_Shingles.asciidoc | 2 +- 130_Partial_Matching/05_Postcodes.asciidoc | 2 +- .../20_Match_phrase_prefix.asciidoc | 2 +- .../35_Search_as_you_type.asciidoc | 2 +- 170_Relevance/10_Scoring_theory.asciidoc | 22 +++++++-------- 170_Relevance/15_Practical_scoring.asciidoc | 14 +++++----- 170_Relevance/20_Query_time_boosting.asciidoc | 6 ++-- 170_Relevance/45_Popularity.asciidoc | 4 +-- 170_Relevance/55_Random_scoring.asciidoc | 2 +- 170_Relevance/60_Decay_functions.asciidoc | 6 ++-- 170_Relevance/65_Script_score.asciidoc | 2 +- 200_Language_intro/10_Using.asciidoc | 2 +- .../30_Language_pitfalls.asciidoc | 2 +- .../40_One_language_per_doc.asciidoc | 2 +- .../60_Mixed_language_fields.asciidoc | 4 +-- .../10_Standard_analyzer.asciidoc | 2 +- 210_Identifying_words/30_ICU_plugin.asciidoc | 2 +- .../50_Tidying_text.asciidoc | 2 +- .../20_Removing_diacritics.asciidoc | 2 +- 230_Stemming/10_Algorithmic_stemmers.asciidoc | 2 +- 230_Stemming/40_Choosing_a_stemmer.asciidoc | 2 +- 230_Stemming/50_Controlling_stemming.asciidoc | 4 +-- 240_Stopwords/20_Using_stopwords.asciidoc | 4 +-- 240_Stopwords/40_Divide_and_conquer.asciidoc | 4 +-- 240_Stopwords/50_Phrase_queries.asciidoc | 2 +- 260_Synonyms/60_Multi_word_synonyms.asciidoc | 8 +++--- .../60_Phonetic_matching.asciidoc | 2 +- .../100_circuit_breaker_fd_settings.asciidoc | 4 +-- .../120_breadth_vs_depth.asciidoc | 2 +- 300_Aggregations/30_histogram.asciidoc | 4 +-- 300_Aggregations/35_date_histogram.asciidoc | 4 +-- 300_Aggregations/40_scope.asciidoc | 6 ++-- 300_Aggregations/60_cardinality.asciidoc | 6 ++-- 300_Aggregations/90_fielddata.asciidoc | 2 +- 310_Geopoints/20_Geopoints.asciidoc | 2 +- 330_Geo_aggs/62_Geo_distance_agg.asciidoc | 2 +- 330_Geo_aggs/66_Geo_bounds_agg.asciidoc | 2 +- 340_Geoshapes/72_Mapping_geo_shapes.asciidoc | 2 +- 400_Relationships/25_Concurrency.asciidoc | 2 +- 402_Nested/35_Nested_aggs.asciidoc | 2 +- 404_Parent_Child/50_Has_child.asciidoc | 8 +++--- 404_Parent_Child/55_Has_parent.asciidoc | 6 ++-- 410_Scaling/40_Multiple_indices.asciidoc | 2 +- 410_Scaling/45_Index_per_timeframe.asciidoc | 4 +-- 410_Scaling/50_Index_templates.asciidoc | 4 +-- 410_Scaling/55_Retiring_data.asciidoc | 10 +++---- 410_Scaling/60_Index_per_user.asciidoc | 2 +- 410_Scaling/65_Shared_index.asciidoc | 2 +- 410_Scaling/75_One_big_user.asciidoc | 2 +- 500_Cluster_Admin/20_health.asciidoc | 2 +- 500_Cluster_Admin/30_node_stats.asciidoc | 28 +++++++++---------- 500_Cluster_Admin/40_other_stats.asciidoc | 2 +- 510_Deployment/20_hardware.asciidoc | 2 +- 510_Deployment/30_other.asciidoc | 6 ++-- 510_Deployment/50_heap.asciidoc | 2 +- 520_Post_Deployment/20_logging.asciidoc | 2 +- 520_Post_Deployment/50_backup.asciidoc | 4 +-- 520_Post_Deployment/60_restore.asciidoc | 2 +- 121 files changed, 251 insertions(+), 251 deletions(-) diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc index beab67ffe..fc037200e 100644 --- a/010_Intro/10_Installing_ES.asciidoc +++ b/010_Intro/10_Installing_ES.asciidoc @@ -34,7 +34,7 @@ https://github.com/elasticsearch/cookbook-elasticsearch[Chef cookbook]. http://www.elasticsearch.com/products/marvel[Marvel] is a management((("Marvel", "defined"))) and monitoring tool for Elasticsearch, which is free for development use. It comes with an -interactive console called Sense, which makes it easy to talk to +interactive console called Sense,((("Sense console (Marvel plugin)"))) which makes it easy to talk to Elasticsearch directly from your browser. Many of the code examples in the online version of this book include a View in Sense link. When @@ -124,6 +124,6 @@ If you installed the <> management ((("Marvel", "viewing")))and m view it in a web browser by visiting http://localhost:9200/_plugin/marvel/. -You can reach the _Sense_ developer((("Sense", "viewing developer console"))) console either by clicking the ``Marvel +You can reach the _Sense_ developer((("Sense console (Marvel plugin)", "viewing"))) console either by clicking the ``Marvel dashboards'' drop-down in Marvel, or by visiting http://localhost:9200/_plugin/marvel/sense/. diff --git a/010_Intro/15_API.asciidoc b/010_Intro/15_API.asciidoc index c84d8006f..92c8b6c84 100644 --- a/010_Intro/15_API.asciidoc +++ b/010_Intro/15_API.asciidoc @@ -35,7 +35,7 @@ of the http://www.elasticsearch.org/guide/[Guide]. ==== RESTful API with JSON over HTTP All other languages can communicate with Elasticsearch((("port 9200 for non-Java clients"))) over port _9200_ using -a ((("RESTful API", "communicating with Elasticseach")))RESTful API, accessible with your favorite web client. In fact, as you have +a ((("RESTful API, communicating with Elasticseach")))RESTful API, accessible with your favorite web client. In fact, as you have seen, you can even talk to Elasticsearch from the command line by using the `curl` command.((("curl command", "talking to Elasticsearch with"))) @@ -128,6 +128,6 @@ GET /_count -------------------------------------------------- // SENSE: 010_Intro/15_Count.json -In fact, this is the same format that is used by the ((("Sense", "curl requests in")))Sense console that we +In fact, this is the same format that is used by the ((("Marvel", "Sense console")))((("Sense console (Marvel plugin)", "curl requests in")))Sense console that we installed with <>. If in the online version of this book, you can open and run this code example in Sense by clicking the View in Sense link above. diff --git a/010_Intro/25_Tutorial_Indexing.asciidoc b/010_Intro/25_Tutorial_Indexing.asciidoc index a18d7c169..42685a673 100644 --- a/010_Intro/25_Tutorial_Indexing.asciidoc +++ b/010_Intro/25_Tutorial_Indexing.asciidoc @@ -35,7 +35,7 @@ employee. The act of storing data in Elasticsearch is called _indexing_, but before we can index a document, we need to decide _where_ to store it. In Elasticsearch, a document belongs to a _type_, and those((("types"))) types live inside -an _index_. You can draw some (rough) parallels to a traditional relational database: +an _index_. ((("indices")))You can draw some (rough) parallels to a traditional relational database: ---- Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns @@ -68,9 +68,9 @@ replace the old. Inverted index:: -Relational databases add an _index_, such as a B-tree index,((("relational databases", "indexes"))) to specific +Relational databases add an _index_, such as a B-tree index,((("relational databases", "indices"))) to specific columns in order to improve the speed of data retrieval. Elasticsearch and -Lucene use a structure called((("inverted indexes"))) an _inverted index_ for exactly the same +Lucene use a structure called((("inverted index"))) an _inverted index_ for exactly the same purpose. + By default, every field in a document is _indexed_ (has an inverted index) diff --git a/010_Intro/30_Tutorial_Search.asciidoc b/010_Intro/30_Tutorial_Search.asciidoc index 9e0bfe2e3..6993f9302 100644 --- a/010_Intro/30_Tutorial_Search.asciidoc +++ b/010_Intro/30_Tutorial_Search.asciidoc @@ -5,7 +5,7 @@ business requirements for this application. The first requirement is the ability to retrieve individual employee data. This is easy in Elasticsearch. We simply execute((("HTTP requests", "retrieving a document with GET"))) an HTTP +GET+ request and -specify the _address_ of the document--the index, type, and ID.((("id", "speccifying in a request")))((("indexes", "specifying index in a request")))((("types", "specifying type in a request"))) Using +specify the _address_ of the document--the index, type, and ID.((("id", "specifying in a request")))((("indices", "specifying index in a request")))((("types", "specifying type in a request"))) Using those three pieces of information, we can return the original JSON document: [source,js] @@ -15,7 +15,7 @@ GET /megacorp/employee/1 // SENSE: 010_Intro/30_Get.json And the response contains some metadata about the document, and John Smith's -original JSON document ((("source field")))as the `_source` field: +original JSON document ((("_source field", sortas="source field")))as the `_source` field: [source,js] -------------------------------------------------- @@ -260,7 +260,7 @@ range search, and reused the same `match` query as before. Now our results show === Full-Text Search The searches so far have been simple: single names, filtered by age. Let's -try a more advanced, full-text search--a ((("full-text search")))task that traditional databases +try a more advanced, full-text search--a ((("full text search")))task that traditional databases would really struggle with. We are going to search for all employees who enjoy rock climbing: @@ -334,7 +334,7 @@ traditional relational databases, in which a record either matches or it doesn't === Phrase Search Finding individual words in a field is all well and good, but sometimes you -want to match exact sequences of words or _phrases_.((("prhase search"))) For instance, we could +want to match exact sequences of words or _phrases_.((("phrase matching"))) For instance, we could perform a query that will match only employee records that contain both ``rock'' _and_ ``climbing'' _and_ that display the words are next to each other in the phrase ``rock climbing.'' diff --git a/020_Distributed_Cluster/00_Intro.asciidoc b/020_Distributed_Cluster/00_Intro.asciidoc index 446bf68f1..cc998ab38 100644 --- a/020_Distributed_Cluster/00_Intro.asciidoc +++ b/020_Distributed_Cluster/00_Intro.asciidoc @@ -17,7 +17,7 @@ to skim through the chapter and to refer to it again later. **** -Elasticsearch is built to be ((("scalability", "Elasticsearch and")))always available, and to scale with your needs. +Elasticsearch is built to be ((("scalability, Elasticsearch and")))always available, and to scale with your needs. Scale can come from buying bigger ((("vertical scaling, Elasticsearch and")))servers (_vertical scale_, or _scaling up_) or from buying more ((("horizontal scaling, Elasticsearch and")))servers (_horizontal scale_, or _scaling out_). diff --git a/020_Distributed_Cluster/15_Add_an_index.asciidoc b/020_Distributed_Cluster/15_Add_an_index.asciidoc index 5dc7f75ef..15742fb41 100644 --- a/020_Distributed_Cluster/15_Add_an_index.asciidoc +++ b/020_Distributed_Cluster/15_Add_an_index.asciidoc @@ -1,7 +1,7 @@ === Add an Index To add data to Elasticsearch, we need an _index_—a place to store related -data.((("indexes")))((("clusters", "adding an index"))) In reality, an index is just a _logical namespace_ that points to +data.((("indices")))((("clusters", "adding an index"))) In reality, an index is just a _logical namespace_ that points to one or more physical _shards_. A _shard_ is a low-level _worker unit_ that holds((("shards", "defined"))) just a slice of all the @@ -37,8 +37,8 @@ serve read requests like searching or retrieving a document. The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time. -Let's create an index called `blogs` in our empty one-node cluster.((("indexes", "creating"))) By -default, indices are assigned five primary shards,((("primary shards", "assigned to indexes")))((("replica shards", "assigned to indexes"))) but for the purpose of this +Let's create an index called `blogs` in our empty one-node cluster.((("indices", "creating"))) By +default, indices are assigned five primary shards,((("primary shards", "assigned to indices")))((("replica shards", "assigned to indices"))) but for the purpose of this demonstration, we'll assign just three primary shards and one replica (one replica of every primary shard): diff --git a/020_Distributed_Cluster/20_Add_failover.asciidoc b/020_Distributed_Cluster/20_Add_failover.asciidoc index 9a4d8ef5c..328be9e60 100644 --- a/020_Distributed_Cluster/20_Add_failover.asciidoc +++ b/020_Distributed_Cluster/20_Add_failover.asciidoc @@ -1,7 +1,7 @@ === Add Failover Running a single node means that you have a single point of failure--there -is no redundancy.((("failover", "adding"))) Fortunately, all we need to do to protect ourselves from data +is no redundancy.((("failover, adding"))) Fortunately, all we need to do to protect ourselves from data loss is to start another node. .Starting a Second Node diff --git a/020_Distributed_Cluster/25_Scale_horizontally.asciidoc b/020_Distributed_Cluster/25_Scale_horizontally.asciidoc index dfddb414d..6e6158b21 100644 --- a/020_Distributed_Cluster/25_Scale_horizontally.asciidoc +++ b/020_Distributed_Cluster/25_Scale_horizontally.asciidoc @@ -1,6 +1,6 @@ === Scale Horizontally -What about scaling as the demand for our application grows?((("scaling horizontally")))((("clusters", "three-node cluster")))((("primary shards", "in three-node cluster"))) If we start a +What about scaling as the demand for our application grows?((("scaling", "horizontally")))((("clusters", "three-node cluster")))((("primary shards", "in three-node cluster"))) If we start a third node, our cluster reorganizes itself to look like <>. diff --git a/020_Distributed_Cluster/30_Scale_more.asciidoc b/020_Distributed_Cluster/30_Scale_more.asciidoc index a4c744095..c298939aa 100644 --- a/020_Distributed_Cluster/30_Scale_more.asciidoc +++ b/020_Distributed_Cluster/30_Scale_more.asciidoc @@ -2,13 +2,13 @@ But what if we want to scale our search to more than six nodes? -The number of primary shards is fixed at the moment an((("indexes", "fixed number of primary shards")))((("primary shards", "fixed number in an index"))) index is created. +The number of primary shards is fixed at the moment an((("indices", "fixed number of primary shards")))((("primary shards", "fixed number in an index"))) index is created. Effectively, that number defines the maximum amount of data that can be _stored_ in the index. (The actual number depends on your data, your hardware and your use case.) However, read requests--searches or document retrieval--can be handled by a primary _or_ a replica shard, so the more copies of data that you have, the more search throughput you can handle. -The number of replica shards can be changed dynamically on a live cluster, +The number of ((("scaling", "increasing number of replica shards")))replica shards can be changed dynamically on a live cluster, allowing us to scale up or down as demand requires. Let's increase the number of replicas from the default of `1` to `2`: diff --git a/020_Distributed_Cluster/35_Coping_with_failure.asciidoc b/020_Distributed_Cluster/35_Coping_with_failure.asciidoc index 3d91dc256..ae2cb6918 100644 --- a/020_Distributed_Cluster/35_Coping_with_failure.asciidoc +++ b/020_Distributed_Cluster/35_Coping_with_failure.asciidoc @@ -1,7 +1,7 @@ === Coping with Failure We've said that Elasticsearch can cope when nodes fail, so let's go -ahead and try it out. ((("shards", "horizontal scaling and safety of data")))((("failure", "coping with")))((("master node", "killing and replacing")))((("nodes", "failure of")))((("clusters", "coping with failure of nodes")))If we kill the first node, our cluster looks like +ahead and try it out. ((("shards", "horizontal scaling and safety of data")))((("failure of nodes, coping with")))((("master node", "killing and replacing")))((("nodes", "failure of")))((("clusters", "coping with failure of nodes")))If we kill the first node, our cluster looks like <>. [[cluster-post-kill]] diff --git a/030_Data/00_Intro.asciidoc b/030_Data/00_Intro.asciidoc index 6aa9575c9..a64374fe1 100644 --- a/030_Data/00_Intro.asciidoc +++ b/030_Data/00_Intro.asciidoc @@ -43,7 +43,7 @@ objects as documents, they still require us to think about how we want to query our data, and which fields require an index in order to make data retrieval fast. -In Elasticsearch, _all data in every field_ is _indexed by default_. That is, +In Elasticsearch, _all data in every field_ is _indexed by default_.((("indexing", "in Elasticsearch"))) That is, every field has a dedicated inverted index for fast retrieval. And, unlike most other databases, it can use all of those inverted indices _in the same query_, to return results at breathtaking speed. diff --git a/030_Data/05_Document.asciidoc b/030_Data/05_Document.asciidoc index 2724574b9..6aa5e7f85 100644 --- a/030_Data/05_Document.asciidoc +++ b/030_Data/05_Document.asciidoc @@ -58,13 +58,13 @@ elements are as follows: ==== _index An _index_ is like a database in a relational database; it's the place -we store and index related data.((("indexes"))) +we store and index related data.((("indices", "_index, in document metadata"))) [TIP] ==== Actually, in Elasticsearch, our data is stored and indexed in _shards_, while an index is just a logical namespace that groups together one or more -shards.((("shards", "grouped in indexes"))) However, this is an internal detail; our application shouldn't care +shards.((("shards", "grouped in indices"))) However, this is an internal detail; our application shouldn't care about shards at all. As far as our application is concerned, our documents live in an _index_. Elasticsearch takes care of the details. ==== @@ -83,7 +83,7 @@ may have a name, a gender, an age, and an email address. In a relational database, we usually store objects of the same class in the same table, because they share the same data structure. For the same reason, in -Elasticsearch we use the same _type_ for ((("types")))documents that represent the same +Elasticsearch we use the same _type_ for ((("types", "_type, in document metadata)))documents that represent the same class of _thing_, because they share the same data structure. Every _type_ has its own <> or schema ((("mapping (types)")))((("schema definition, types")))definition, which @@ -101,7 +101,7 @@ underscore or contain commas.((("types", "names of"))) We will use `blog` for o ==== _id -The _ID_ is a string that,((("id", "in document metadata"))) when combined with the `_index` and `_type`, +The _ID_ is a string that,((("id", "_id, in document metadata"))) when combined with the `_index` and `_type`, uniquely identifies a document in Elasticsearch. When creating a new document, you can either provide your own `_id` or let Elasticsearch generate one for you. diff --git a/030_Data/10_Index.asciidoc b/030_Data/10_Index.asciidoc index e758bc513..190970365 100644 --- a/030_Data/10_Index.asciidoc +++ b/030_Data/10_Index.asciidoc @@ -1,7 +1,7 @@ [[index-doc]] === Indexing a Document -Documents are _indexed_—stored and made ((("documents", "indexing")))((("indexes", "indexing a document")))searchable--by using the `index` +Documents are _indexed_—stored and made ((("documents", "indexing")))((("indexing", "a document")))searchable--by using the `index` API. But first, we need to decide where the document lives. As we just discussed, a document's `_index`, `_type`, and `_id` uniquely identify the document. We can either provide our own `_id` value or let the `index` API @@ -65,7 +65,7 @@ made by another part. ==== Autogenerating IDs If our data doesn't have a natural ID, we can let Elasticsearch autogenerate -one for us. ((("id", "autogenerating")))The structure of the request changes: instead of using ((("HTP methods", "POST")))the `PUT` +one for us. ((("id", "autogenerating")))The structure of the request changes: instead of using ((("HTTP methods", "POST")))((("POST method")))the `PUT` verb (``store this document at this URL''), we use the `POST` verb (``store this document _under_ this URL''). The URL now contains just the `_index` and the `_type`: diff --git a/030_Data/15_Get.asciidoc b/030_Data/15_Get.asciidoc index 0c065e5a7..ff8edbbc7 100644 --- a/030_Data/15_Get.asciidoc +++ b/030_Data/15_Get.asciidoc @@ -11,7 +11,7 @@ GET /website/blog/123?pretty // SENSE: 030_Data/15_Get_document.json -The response includes the by-now-familiar metadata elements, plus ((("source field")))the `_source` +The response includes the by-now-familiar metadata elements, plus ((("_source field", sortas="source field")))the `_source` field, which contains the original JSON document that we sent to Elasticsearch when we indexed it: @@ -40,7 +40,7 @@ Instead we get back exactly the same JSON string that we passed in. ==== The response to the +GET+ request includes `{"found": true}`. This confirms that -the document was found. ((("documents", "requesting non-existant document")))If we were to request a document that doesn't exist, +the document was found. ((("documents", "requesting non-existent document")))If we were to request a document that doesn't exist, we would still get a JSON response, but `found` would be set to `false`. Also, the HTTP response code would be `404 Not Found` instead of `200 OK`. diff --git a/030_Data/20_Exists.asciidoc b/030_Data/20_Exists.asciidoc index a01ad2bca..da3826b45 100644 --- a/030_Data/20_Exists.asciidoc +++ b/030_Data/20_Exists.asciidoc @@ -1,8 +1,8 @@ [[doc-exists]] === Checking Whether a Document Exists -If all you want to do is to check whether a document exists--you're not -interested in the content at all--then use the `HEAD` method instead +If all you want to do is to check whether a ((("documents", "checking whether a document exists")))document exists--you're not +interested in the content at all--then use((("HEAD method")))((("HTTP methods", "HEAD"))) the `HEAD` method instead of the `GET` method. `HEAD` requests don't return a body, just HTTP headers: [source,js] diff --git a/030_Data/25_Update.asciidoc b/030_Data/25_Update.asciidoc index 5df46e76c..ae13d16a7 100644 --- a/030_Data/25_Update.asciidoc +++ b/030_Data/25_Update.asciidoc @@ -2,7 +2,7 @@ === Updating a Whole Document Documents in Elasticsearch are _immutable_; we cannot change them.((("documents", "updating whole document")))((("updating documents", "whole document"))) Instead, if -we need to update an existing document, we _reindex_ or replace it, which we +we need to update an existing document, we _reindex_ or replace it,((("reindexing")))((("indexing", seealso="reindexing"))) which we can do using the same `index` API that we have already discussed in <>. diff --git a/030_Data/30_Create.asciidoc b/030_Data/30_Create.asciidoc index 7dc0eb886..c7052787a 100644 --- a/030_Data/30_Create.asciidoc +++ b/030_Data/30_Create.asciidoc @@ -7,7 +7,7 @@ new document and not overwriting an existing one? Remember that the combination of `_index`, `_type`, and `_id` uniquely identifies a document. So the easiest way to ensure that our document is new is by letting Elasticsearch autogenerate a new unique `_id`, using the `POST` -version of ((("HTTP methods", "POST")))the index request: +version of ((("POST method")))((("HTTP methods", "POST")))the index request: [source,js] -------------------------------------------------- @@ -21,7 +21,7 @@ the same `_index`, `_type`, and `_id` doesn't exist already. There are two ways of doing this, both of which amount to the same thing. Use whichever method is more convenient for you. -The first method uses the `op_type` query((("query strings", "op_type parameter")))((("op_type query string parameter")))-string parameter: +The first method uses the `op_type` query((("PUT method")))((("HTTP methods", "PUT")))((("query strings", "op_type parameter")))((("op_type query string parameter")))-string parameter: [source,js] -------------------------------------------------- diff --git a/030_Data/35_Delete.asciidoc b/030_Data/35_Delete.asciidoc index 36cae1759..6216df426 100644 --- a/030_Data/35_Delete.asciidoc +++ b/030_Data/35_Delete.asciidoc @@ -2,7 +2,7 @@ === Deleting a Document The syntax for deleting a document((("documents", "deleting"))) follows the same pattern that we have seen -already, but ((("deleting documents")))((("HTTP methods", "DELETE")))uses the `DELETE` method : +already, but ((("DELETE method", "deleting documents")))((("HTTP methods", "DELETE")))uses the `DELETE` method : [source,js] -------------------------------------------------- diff --git a/030_Data/55_Bulk.asciidoc b/030_Data/55_Bulk.asciidoc index fa6b998d2..20845328f 100644 --- a/030_Data/55_Bulk.asciidoc +++ b/030_Data/55_Bulk.asciidoc @@ -19,7 +19,7 @@ The `bulk` request body has the following, slightly unusual, format: -------------------------------------------------- This format is like a _stream_ of valid one-line JSON documents joined -together by newline (`\n`) characters.((("\n (newline) characters in buk requests", sortas="n (newline)"))) Two important points to note: +together by newline (`\n`) characters.((("\n (newline) characters in bulk requests", sortas="n (newline)"))) Two important points to note: * Every line must end with a newline character (`\n`), _including the last line_. These are used as markers to allow for efficient line separation. @@ -194,7 +194,7 @@ succeeded: <3> The error message explaining why the request failed. <4> The second request succeeded with an HTTP status code of `200 OK`. -That also means ((("bulk API", "bulk requests, not atomic")))that `bulk` requests are not atomic: they cannot be used to +That also means ((("bulk API", "bulk requests, not transactions")))that `bulk` requests are not atomic: they cannot be used to implement transactions. Each request is processed separately, so the success or failure of one request will not interfere with the others. diff --git a/040_Distributed_CRUD/05_Routing.asciidoc b/040_Distributed_CRUD/05_Routing.asciidoc index d79807931..9625a459f 100644 --- a/040_Distributed_CRUD/05_Routing.asciidoc +++ b/040_Distributed_CRUD/05_Routing.asciidoc @@ -1,7 +1,7 @@ [[routing-value]] === Routing a Document to a Shard -When you index a document, it is stored on a single primary shard.((("shards", "routing a document to")))((("routing a document to a shard"))) How does +When you index a document, it is stored on a single primary shard.((("shards", "routing a document to")))((("documents", "routing a document to a shard")))((("routing a document to a shard"))) How does Elasticsearch know which shard a document belongs to? When we create a new document, how does it know whether it should store that document on shard 1 or shard 2? diff --git a/040_Distributed_CRUD/15_Create_index_delete.asciidoc b/040_Distributed_CRUD/15_Create_index_delete.asciidoc index b70083237..79f2b0983 100644 --- a/040_Distributed_CRUD/15_Create_index_delete.asciidoc +++ b/040_Distributed_CRUD/15_Create_index_delete.asciidoc @@ -35,7 +35,7 @@ are explained here for the sake of completeness: `replication`:: + -- -The default value for ((("replication request parmeter", "sync and async values")))replication is `sync`. This causes the primary shard to +The default value for ((("replication request parmeter", "sync and async values")))replication is `sync`. ((("sync value, replication parameter")))This causes the primary shard to wait for successful responses from the replica shards before returning. If you set `replication` to `async`,((("async value, replication parameter"))) it will return success to the client @@ -54,7 +54,7 @@ completion. `consistency`:: + -- -By default, the primary shard((("consistency request parameter")))((("quorum of shard copies"))) requires a _quorum_, or majority, of shard copies +By default, the primary shard((("consistency request parameter")))((("quorum"))) requires a _quorum_, or majority, of shard copies (where a shard copy can be a primary or a replica shard) to be available before even attempting a write operation. This is to prevent writing data to the ``wrong side'' of a network partition. A quorum is defined as follows: diff --git a/04_Geolocation.asciidoc b/04_Geolocation.asciidoc index bacd4bc32..acbb46d4c 100644 --- a/04_Geolocation.asciidoc +++ b/04_Geolocation.asciidoc @@ -20,7 +20,7 @@ rating, distance, and price. Another example: show me a map of vacation rental properties available in August throughout the city, and calculate the average price per zone. -Elasticsearch offers two ways of ((("Elastisearch", "representing geolocations")))representing geolocations: latitude-longitude +Elasticsearch offers two ways of ((("Elasticsearch", "representing geolocations")))representing geolocations: latitude-longitude points using the `geo_point` field type,((("geo_point field type"))) and complex shapes defined in http://en.wikipedia.org/wiki/GeoJSON[GeoJSON], using the `geo_shape` field type.((("geo_shape field type"))) diff --git a/050_Search/00_Intro.asciidoc b/050_Search/00_Intro.asciidoc index cf9735b5a..95297f413 100644 --- a/050_Search/00_Intro.asciidoc +++ b/050_Search/00_Intro.asciidoc @@ -10,7 +10,7 @@ This is the reason that we use structured JSON documents, rather than amorphous blobs of data. Elasticsearch not only _stores_ the document, but also _indexes_ the content of the document in order to make it searchable. -_Every field in a document is indexed and can be queried_. And it's not just +_Every field in a document is indexed and can be queried_. ((("indexing"))) And it's not just that. During a single query, Elasticsearch can use _all_ of these indices, to return results at breath-taking speed. That's something that you could never consider doing with a traditional database. @@ -26,7 +26,7 @@ A _search_ can be any of the following: * A combination of the two -While many searches will just work out of((("full-text search"))) the box, to use Elasticsearch to +While many searches will just work out of((("full text search"))) the box, to use Elasticsearch to its full potential, you need to understand three subjects: _Mapping_:: diff --git a/050_Search/05_Empty_search.asciidoc b/050_Search/05_Empty_search.asciidoc index ec878f0ab..25cb69a86 100644 --- a/050_Search/05_Empty_search.asciidoc +++ b/050_Search/05_Empty_search.asciidoc @@ -69,13 +69,13 @@ query.((("max_score value"))) ==== took -The `took` value((("took value", "time taken for empty search"))) tells us how many milliseconds the entire search request took +The `took` value((("took value (empty search)"))) tells us how many milliseconds the entire search request took to execute. ==== shards The `_shards` element((("shards", "number involved in an empty search"))) tells us the `total` number of shards that were involved -in the query and,((("failed shards (in a search)")))((("successful shards", "in a search"))) of them, how many were `successful` and how many `failed`. +in the query and,((("failed shards (in a search)")))((("successful shards (in a search)"))) of them, how many were `successful` and how many `failed`. We wouldn't normally expect shards to fail, but it can happen. If we were to suffer a major disaster in which we lost both the primary and the replica copy of the same shard, there would be no copies of that shard available to respond diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc index 63e807154..4b15e969b 100644 --- a/050_Search/20_Query_string.asciidoc +++ b/050_Search/20_Query_string.asciidoc @@ -2,7 +2,7 @@ === Search _Lite_ There are two forms of the `search` API: a ``lite'' _query-string_ version -that expects all its((("searching", "query string searches")))((("query string", "searching with"))) parameters to be passed in the query string, and the full +that expects all its((("searching", "query string searches")))((("query strings", "searching with"))) parameters to be passed in the query string, and the full _request body_ version that expects a JSON request body and uses a rich search language called the query DSL. @@ -115,7 +115,7 @@ readable result: -------------------------------------------------- As you can see from the preceding examples, this _lite_ query-string search is -surprisingly powerful. Its query syntax, which is explained in detail in the +surprisingly powerful.((("query strings", "syntax, reference for"))) Its query syntax, which is explained in detail in the http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax[Query String Syntax] reference docs, allows us to express quite complex queries succinctly. This makes it great for throwaway queries from the command line or during diff --git a/052_Mapping_Analysis/25_Data_type_differences.asciidoc b/052_Mapping_Analysis/25_Data_type_differences.asciidoc index 5460c0524..0a06f4276 100644 --- a/052_Mapping_Analysis/25_Data_type_differences.asciidoc +++ b/052_Mapping_Analysis/25_Data_type_differences.asciidoc @@ -65,7 +65,7 @@ This gives us the following: Elasticsearch has dynamically generated a mapping for us, based on what it could guess about our field types. The response shows us that the `date` field -has been recognized as a field of type `date`. ((("&#x5f;all field", sortas="all field")))The `_all` field isn't +has been recognized as a field of type `date`. ((("_all field", sortas="all field")))The `_all` field isn't mentioned because it is a default field, but we know that the `_all` field is of type `string`.((("string fields"))) diff --git a/052_Mapping_Analysis/35_Inverted_index.asciidoc b/052_Mapping_Analysis/35_Inverted_index.asciidoc index 50de1577f..e73bbbfba 100644 --- a/052_Mapping_Analysis/35_Inverted_index.asciidoc +++ b/052_Mapping_Analysis/35_Inverted_index.asciidoc @@ -1,7 +1,7 @@ [[inverted-index]] === Inverted Index -Elasticsearch uses a structure called((("inverted index"))) an _inverted index_, which is designed +Elasticsearch uses a structure called((("inverted index", id="ix_invertidx", range="startofrange"))) an _inverted index_, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. @@ -48,7 +48,7 @@ documents in which each term appears: Total | 2 | 1 Both documents match, but the first document has more matches than the second. -If we apply a naive _similarity algorithm_ that((("similarity algorithm"))) just counts the number of +If we apply a naive _similarity algorithm_ that((("similarity algorithms"))) just counts the number of matching terms, then we can say that the first document is a better match--is _more relevant_ to our query--than the second document. But there are a few problems with our current inverted index: @@ -110,4 +110,4 @@ index, so _both the indexed text and the query string must be normalized into the same form_. This process of tokenization and normalization is called _analysis_, which we -discuss in the next section. +discuss in the next section.((("inverted index", range="endofrange", startref="ix_invertidx"))) diff --git a/052_Mapping_Analysis/40_Analysis.asciidoc b/052_Mapping_Analysis/40_Analysis.asciidoc index 5328cf12c..0f93e8c4f 100644 --- a/052_Mapping_Analysis/40_Analysis.asciidoc +++ b/052_Mapping_Analysis/40_Analysis.asciidoc @@ -89,7 +89,7 @@ their root form. ==== When Analyzers Are Used When we _index_ a document, its full-text fields are analyzed into terms that -are used to create the inverted index. However, when we _search_ on a full-text field, we need to pass the query string through the _same analysis +are used to create the inverted index.((("indexing", "analyzers, use on full text fields"))) However, when we _search_ on a full-text field, we need to pass the query string through the _same analysis process_, to ensure that we are searching for terms in the same form as those that exist in the index. diff --git a/052_Mapping_Analysis/45_Mapping.asciidoc b/052_Mapping_Analysis/45_Mapping.asciidoc index 0d7b70c89..d1d9a0ea9 100644 --- a/052_Mapping_Analysis/45_Mapping.asciidoc +++ b/052_Mapping_Analysis/45_Mapping.asciidoc @@ -18,7 +18,7 @@ to look at just enough to get you started. [[core-fields]] ==== Core Simple Field Types -Elasticsearch supports the ((("types", "core simple field types")))following simple field types: +Elasticsearch supports the ((("fields", "core simple types")))((("types", "core simple field types")))following simple field types: [horizontal] * String: `string` @@ -28,7 +28,7 @@ Elasticsearch supports the ((("types", "core simple field types")))following sim * Date: `date` When you index a document that contains a new field--one previously not -seen--Elasticsearch ((("types", "mapping for", "dynamic mapping of new types")))((("JSON", "datatypes", "simple core types")))((("dynamic mapping")))((("boolean type")))((("long type")))((("double type")))((("date type")))((("strings")))will use <> to try +seen--Elasticsearch ((("types", "mapping for", "dynamic mapping of new types")))((("JSON", "datatypes", "simple core types")))((("dynamic mapping")))((("boolean type")))((("long type")))((("double type")))((("date type")))((("strings", "sring type")))will use <> to try to guess the field type from the basic datatypes available in JSON, using the following rules: @@ -104,8 +104,8 @@ Instead of assuming that your mapping is correct, check it! [[custom-field-mappings]] ==== Customizing Field Mappings -While the basic field datatypes are ((("mapping (types)", "customizing field mappings")))sufficient for many cases, you will often -need to customize the mapping for individual fields, especially string fields. +While the basic field datatypes are ((("mapping (types)", "customizing field mappings")))((("fields", "customizing field mappings")))sufficient for many cases, you will often +need to customize the mapping ((("string fields", "customized mappings")))for individual fields, especially string fields. Custom mappings allow you to do the following: * Distinguish between full-text string fields and exact value string fields @@ -133,7 +133,7 @@ That is, their value will be passed through((("analyzers", "string values passed and a full-text query on the field will pass the query string through an analyzer before searching. -The two most important mapping((("strings", "mapping attributes, index and analyzer"))) attributes for `string` fields are +The two most important mapping((("string fields", "mapping attributes, index and analyzer"))) attributes for `string` fields are `index` and `analyzer`. ===== index @@ -173,7 +173,7 @@ as their values are never analyzed. ===== analyzer -For `analyzed` string fields, use ((("analyzer attribute, strings, using to specify analyzer")))the `analyzer` attribute to +For `analyzed` string fields, use ((("analyzer attribute, string fields")))the `analyzer` attribute to specify which analyzer to apply both at search time and at index time. By default, Elasticsearch uses the `standard` analyzer,((("standard analyzer", "specifying another analyzer for strings"))) but you can change this by specifying one of the built-in analyzers, such((("english analyzer"))) as diff --git a/052_Mapping_Analysis/50_Complex_datatypes.asciidoc b/052_Mapping_Analysis/50_Complex_datatypes.asciidoc index 709f05ebe..bf39512ce 100644 --- a/052_Mapping_Analysis/50_Complex_datatypes.asciidoc +++ b/052_Mapping_Analysis/50_Complex_datatypes.asciidoc @@ -62,7 +62,7 @@ The last native JSON datatype that we need to ((("objects")))discuss is the _obj -- known in other languages as a hash, hashmap, dictionary or associative array. -_Inner objects_ are often used((("inner objects"))) to embed one entity or object inside +_Inner objects_ are often used((("objects", "inner objects")))((("inner objects"))) to embed one entity or object inside another. For instance, instead of having fields called `user_name` and `user_id` inside our `tweet` document, we could write it as follows: diff --git a/054_Query_DSL/65_Queries_vs_filters.asciidoc b/054_Query_DSL/65_Queries_vs_filters.asciidoc index 8d79d8b73..616b6157e 100644 --- a/054_Query_DSL/65_Queries_vs_filters.asciidoc +++ b/054_Query_DSL/65_Queries_vs_filters.asciidoc @@ -1,7 +1,7 @@ === Queries and Filters Although we refer to the query DSL, in reality there are two DSLs: the -query DSL and the filter DSL.((("DSL (Domain Specific Language)", "Filter DSL")))((("Filter DSL"))) Query clauses and filter clauses are similar +query DSL and the filter DSL.((("DSL (Domain Specific Language)", "Query and Filter DSL")))((("Filter DSL"))) Query clauses and filter clauses are similar in nature, but have slightly different purposes. A _filter_ asks a yes|no question of((("filters", "queries versus")))((("exact values", "filters with yes|no questions for fields containing"))) every document and is used diff --git a/054_Query_DSL/70_Important_clauses.asciidoc b/054_Query_DSL/70_Important_clauses.asciidoc index 8ed00fded..ff6050867 100644 --- a/054_Query_DSL/70_Important_clauses.asciidoc +++ b/054_Query_DSL/70_Important_clauses.asciidoc @@ -87,7 +87,7 @@ present, and to apply a different condition if it is missing. ==== bool Filter The `bool` filter is used ((("bool filter")))((("must clause", "in bool filters")))((("must_not clause", "in bool filters")))((("should clause", "in bool filters")))to combine multiple filter clauses using -Boolean logic. It accepts three parameters: +Boolean logic. ((("bool filter", "must, must_not, and should clauses"))) It accepts three parameters: `must`:: These clauses _must_ match, like `and`. @@ -172,7 +172,7 @@ it is not prone to throwing syntax errors. ==== multi_match Query -The `multi_match` query allows((("multi_match query"))) to run the same `match` query on multiple +The `multi_match` query allows((("multi_match queries"))) to run the same `match` query on multiple fields: [source,js] @@ -192,7 +192,7 @@ The `bool` query, like the `bool` filter,((("bool query"))) is used to combine m query clauses. However, there are some differences. Remember that while filters give binary yes/no answers, queries calculate a relevance score instead. The `bool` query combines the `_score` from each `must` or -`should` clause that matches.((("should clause", "in bool queries")))((("must_not clause", "in bool queries")))((("must clause", "in bool queries"))) This query accepts the following parameters: +`should` clause that matches.((("bool query", "must, must_not, and should clauses")))((("should clause", "in bool queries")))((("must_not clause", "in bool queries")))((("must clause", "in bool queries"))) This query accepts the following parameters: `must`:: Clauses that _must_ match for the document to be included. diff --git a/054_Query_DSL/75_Queries_with_filters.asciidoc b/054_Query_DSL/75_Queries_with_filters.asciidoc index b12de6c4e..ed81a4007 100644 --- a/054_Query_DSL/75_Queries_with_filters.asciidoc +++ b/054_Query_DSL/75_Queries_with_filters.asciidoc @@ -19,7 +19,7 @@ to achieve your goal in the most efficient way. [[filtered-query]] ==== Filtering a Query -Let's say we have((("filters", "combining with queries", "filtering a query"))) this query: +Let's say we have((("queries", "combining with filters", "filtering a query")))((("filters", "combining with queries", "filtering a query"))) this query: [source,js] -------------------------------------------------- @@ -69,7 +69,7 @@ GET /_search [role="pagebreak-before"] ==== Just a Filter -While in query context, if ((("filters", "combining with queries", "using just a filter in query context")))you need to use a filter without a query (for +While in query context, if ((("filters", "combining with queries", "using just a filter in query context")))((("queries", "combining with filters", "using just a filter in query context")))you need to use a filter without a query (for instance, to match all emails in the inbox), you can just omit the query: diff --git a/054_Query_DSL/80_Validating_queries.asciidoc b/054_Query_DSL/80_Validating_queries.asciidoc index 33c70c66c..fae28d722 100644 --- a/054_Query_DSL/80_Validating_queries.asciidoc +++ b/054_Query_DSL/80_Validating_queries.asciidoc @@ -91,7 +91,7 @@ GET /_validate/query?explain -------------------------------------------------- // SENSE: 054_Query_DSL/80_Understanding_queries.json -An `explanation` is returned for each index that we query, because each +An `explanation` is returned for each index ((("indices", "explanation for each index queried")))that we query, because each index can have different mappings and analyzers: [source,js] diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 682f4ad13..28d1e7977 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -33,7 +33,7 @@ GET /_search } -------------------------------------------------- -Filters have no bearing on `_score`, and the((("match_all query", "score as neutral 1")))((("filters", "score and"))) missing-but-implied `match_all` +Filters have no bearing on `_score`, and the((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and"))) missing-but-implied `match_all` query just sets the `_score` to a neutral value of `1` for all documents. In other words, all documents are considered to be equally relevant. @@ -81,7 +81,7 @@ You will notice two differences in the results: <2> The value of the `date` field, expressed as milliseconds since the epoch, is returned in the `sort` values. -The first is that we have ((("date field", "sorting search results by")))a new element in each result called `sort`, which +The first is that we have ((("date field, sorting search results by")))a new element in each result called `sort`, which contains the value(s) that was used for sorting. In this case, we sorted on `date`, which internally is((("milliseconds-since-the-epoch (date)"))) indexed as _milliseconds since the epoch_. The long number `1411516800000` is equivalent to the date string `2014-09-24 00:00:00 @@ -108,7 +108,7 @@ the `_score` value in descending order. ==== Multilevel Sorting -Perhaps we want to combine the `_score` from a((("sorting", "multi-level")))((("multi-level sorting"))) query with the `date`, and +Perhaps we want to combine the `_score` from a((("sorting", "multilevel")))((("multilevel sorting"))) query with the `date`, and show all matching results sorted first by date, then by relevance: [source,js] @@ -150,12 +150,12 @@ GET /_search?sort=date:desc&sort=_score&q=search ==== Sorting on Multivalue Fields -When sorting on fields with more than one value,((("sorting", "on multi-value fields")))((("fields", "multi-value", "sorting on"))) remember that the values do +When sorting on fields with more than one value,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) remember that the values do not have any intrinsic order; a multivalue field is just a bag of values. Which one do you choose to sort on? For numbers and dates, you can reduce a multivalue field to a single value -by using the `min`, `max`, `avg`, or `sum` _sort modes_. ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field", "sorting on earliest value")))For instance, you +by using the `min`, `max`, `avg`, or `sum` _sort modes_. ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))For instance, you could sort on the earliest date in each `dates` field by using the following: [role="pagebreak-before"] diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 01700551e..8f57c4ad8 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -1,7 +1,7 @@ [[multi-fields]] === String Sorting and Multifields -Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields"))) but sorting on them seldom +Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) but sorting on them seldom gives you the results you want. If you analyze a string like `fine old art`, it results in three terms. We probably want to sort alphabetically on the first term, then the second term, and so forth, but Elasticsearch doesn't have this @@ -21,7 +21,7 @@ and one that is `not_analyzed` for sorting. But storing the same string twice in the `_source` field is waste of space. What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, -Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multi-field mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multi-field mapping")))that allows you to transform a +Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a simple mapping like [source,js] diff --git a/056_Sorting/90_What_is_relevance.asciidoc b/056_Sorting/90_What_is_relevance.asciidoc index d514ab4e6..d7f0ebd04 100644 --- a/056_Sorting/90_What_is_relevance.asciidoc +++ b/056_Sorting/90_What_is_relevance.asciidoc @@ -4,20 +4,20 @@ We've mentioned that, by default, results are returned in descending order of relevance.((("relevance", "defined"))) But what is relevance? How is it calculated? -The relevance score of each document is represented by a positive floating-point number called the `_score`. The higher the `_score`, the more relevant +The relevance score of each document is represented by a positive floating-point number called the `_score`.((("score", "calculation of"))) The higher the `_score`, the more relevant the document. A query clause generates a `_score` for each document. How that score is -calculated depends on the type of query clause. Different query clauses are +calculated depends on the type of query clause.((("fuzzy queries", "calculation of relevence score"))) Different query clauses are used for different purposes: a `fuzzy` query might determine the `_score` by calculating how similar the spelling of the found word is to the original search term; a `terms` query would incorporate the percentage of terms that were found. However, what we usually mean by _relevance_ is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string. -The standard _similarity algorithm_ used in Elasticsearch is((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithm", "Term Frequency/Inverse Document Frequency (TF/IDF)"))) known as _term +The standard _similarity algorithm_ used in Elasticsearch is((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)"))) known as _term frequency/inverse document frequency_, or _TF/IDF_, which takes the following -factors into account: +factors into((("inverse document frequency"))) account: Term frequency:: @@ -37,7 +37,7 @@ Field-length norm:: the field will be relevant. A term appearing in a short `title` field carries more weight than the same term appearing in a long `content` field. -Individual queries may combine the TF/IDF score with other factors +Individual ((("field-length norm")))queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries. @@ -74,7 +74,7 @@ GET /_search?explain <1> [NOTE] ==== -Adding `explain` produces a lot of output for every hit, which can look +Adding `explain` produces a lot((("explain parameter", "for relevance score calculation"))) of output for every hit, which can look overwhelming, but it is worth taking the time to understand what it all means. Don't worry if it doesn't all make sense now; you can refer to this section when you need it. We'll work through the output for one `hit` bit by bit. @@ -153,7 +153,7 @@ The first part is the summary of the calculation. It tells us that it has calculated the _weight_—the ((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))TF/IDF--of the term `honeymoon` in the field `tweet`, for document `0`. (This is an internal document ID and, for our purposes, can be ignored.) -It then provides details of how the weight was calculated: +It then provides details((("field-length norm")))((("inverse document frequency"))) of how the weight was calculated: Term frequency:: @@ -178,7 +178,7 @@ results appear in the order that they do. [TIP] ================================================================== The output from `explain` can be difficult to read in JSON, but it is easier -when it is formatted as YAML.((("explain parameter", "formatting output in YAML")))((("YAML", "formatting explain output in"))) Just add `format=yaml` to the query string. +when it is formatted as YAML.((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) Just add `format=yaml` to the query string. ================================================================== @@ -187,7 +187,7 @@ when it is formatted as YAML.((("explain parameter", "formatting output in YAML" While the `explain` option adds an explanation for every result, you can use the `explain` API to understand why one particular document matched or, more -important, why it _didn't_ match.((("relevance", "understanding why a document matched")))((("explain API", "understanding why a document matched"))) +important, why it _didn't_ match.((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched"))) The path for the request is `/index/type/id/_explain`, as in the following: diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc index 99f657d1b..10ba6a947 100644 --- a/056_Sorting/95_Fielddata.asciidoc +++ b/056_Sorting/95_Fielddata.asciidoc @@ -7,7 +7,7 @@ important topic that we will refer to repeatedly, and is something that you should be aware of.((("fielddata"))) When you sort on a field, Elasticsearch needs access to the value of that -field for every document that matches the query. The inverted index, which +field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which performs very well when searching, is not the ideal structure for sorting on field values: diff --git a/060_Distributed_Search/05_Query_phase.asciidoc b/060_Distributed_Search/05_Query_phase.asciidoc index ec9b88c4b..5690abd6c 100644 --- a/060_Distributed_Search/05_Query_phase.asciidoc +++ b/060_Distributed_Search/05_Query_phase.asciidoc @@ -49,7 +49,7 @@ set that it can return to the client. The first step is to broadcast the request to a shard copy of every node in the index. Just like <>, search requests -can be handled by a primary shard or by any of its replicas.((("shards", "handling serch requests"))) This is how more +can be handled by a primary shard or by any of its replicas.((("shards", "handling search requests"))) This is how more replicas (when combined with more hardware) can increase search throughput. A coordinating node will round-robin through all shard copies on subsequent requests in order to spread the load. diff --git a/070_Index_Mgmt/05_Create_Delete.asciidoc b/070_Index_Mgmt/05_Create_Delete.asciidoc index 84515648d..d640a61e9 100644 --- a/070_Index_Mgmt/05_Create_Delete.asciidoc +++ b/070_Index_Mgmt/05_Create_Delete.asciidoc @@ -1,6 +1,6 @@ === Creating an Index -Until now, we have created a new index((("indexes", "creating"))) by simply indexing a document into it. The index is created with the default settings, and new fields are added to the type mapping by using dynamic mapping. Now we need more control over the process: we want to ensure that the index has been created with the appropriate number of primary shards, and that analyzers and mappings are set up _before_ we index any data. +Until now, we have created a new index((("indices", "creating"))) by simply indexing a document into it. The index is created with the default settings, and new fields are added to the type mapping by using dynamic mapping. Now we need more control over the process: we want to ensure that the index has been created with the appropriate number of primary shards, and that analyzers and mappings are set up _before_ we index any data. To do this, we have to create the index manually, passing in any settings or type mappings in the request body, as follows: @@ -19,7 +19,7 @@ PUT /my_index -------------------------------------------------- -In fact, if you want to, you ((("indexes", "preventing automatic creation of")))can prevent the automatic creation of indices by +In fact, if you want to, you ((("indices", "preventing automatic creation of")))can prevent the automatic creation of indices by adding the following setting to the `config/elasticsearch.yml` file on each node: @@ -39,7 +39,7 @@ existence. === Deleting an Index -To delete an index, use ((("HTTP methods", "DELETE")))((("DELETE method", "deleting indexes")))((("indexes", "deleting")))the following request: +To delete an index, use ((("HTTP methods", "DELETE")))((("DELETE method", "deleting indices")))((("indices", "deleting")))the following request: [source,js] -------------------------------------------------- diff --git a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc index 0b725bb2b..eb6617435 100644 --- a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc +++ b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc @@ -27,7 +27,7 @@ parameter.((("stopwords parameter"))) Either provide a list of stopwords or tell stopwords list from a particular language. In the following example, we create a new analyzer called the `es_std` -analyzer, which uses the predefined list of ((("Spanish stopwords, analyzer using")))Spanish stopwords: +analyzer, which uses the predefined list of ((("Spanish", "analyzer using Spanish stopwords")))Spanish stopwords: [source,js] -------------------------------------------------- diff --git a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc index bcf947c88..c8af4ddd8 100644 --- a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc +++ b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc @@ -7,7 +7,7 @@ by combining character filters, tokenizers, and token filters in a configuration that suits your particular data. In <>, we said that an _analyzer_ is a wrapper that combines -three functions into a single package,((("analyzers", "functions executed in sequence"))) which are executed in sequence: +three functions into a single package,((("analyzers", "character filters, tokenizers, and token filters in"))) which are executed in sequence: Character filters:: + @@ -38,7 +38,7 @@ outputs exactly((("keyword tokenizer"))) the same string as it received, without http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html[`whitespace` tokenizer] splits text((("whitespace tokenizer"))) on whitespace only. The http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html[`pattern` tokenizer] can -be used to split text on a matching regular expression. +be used to split text on a ((("pattern tokenizer")))matching regular expression. -- Token filters:: diff --git a/070_Index_Mgmt/45_Default_Mapping.asciidoc b/070_Index_Mgmt/45_Default_Mapping.asciidoc index 0c1c7d109..9367e1f0c 100644 --- a/070_Index_Mgmt/45_Default_Mapping.asciidoc +++ b/070_Index_Mgmt/45_Default_Mapping.asciidoc @@ -8,7 +8,7 @@ instead of having to repeat yourself every time you create a new type. The _after_ the `_default_` mapping will include all of these default settings, unless explicitly overridden in the type mapping itself. -For instance, we can disable the `_all` field for all types,((("all field", "disabling for all types using defaul mapping"))) using the +For instance, we can disable the `_all` field for all types,((("_all field", "disabling for all types using default mapping", sortas="all field"))) using the `_default_` mapping, but enable it just for the `blog` type, as follows: [source,js] diff --git a/075_Inside_a_shard/30_Dynamic_indices.asciidoc b/075_Inside_a_shard/30_Dynamic_indices.asciidoc index d687cad43..a6fc06562 100644 --- a/075_Inside_a_shard/30_Dynamic_indices.asciidoc +++ b/075_Inside_a_shard/30_Dynamic_indices.asciidoc @@ -1,7 +1,7 @@ [[dynamic-indices]] === Dynamically Updatable Indices -The next problem that needed to be ((("indexes", "dynamically updatable")))solved was how to make an inverted index +The next problem that needed to be ((("indices", "dynamically updatable")))solved was how to make an inverted index updatable without losing the benefits of immutability? The answer turned out to be: use more than one index. @@ -9,7 +9,7 @@ Instead of rewriting the whole inverted index, add new supplementary indices to reflect more-recent changes. Each inverted index can be queried in turn--starting with the oldest--and the results combined. Lucene, the Java libraries on which Elasticsearch is based, introduced the -concept of _per-segment search_. ((("per-segment search")))((("segments")))((("indexes", "in Lucene"))) A _segment_ is an inverted index in its own +concept of _per-segment search_. ((("per-segment search")))((("segments")))((("indices", "in Lucene"))) A _segment_ is an inverted index in its own right, but now the word _index_ in Lucene came to mean a _collection of segments_ plus a _commit point_—a file((("commit point"))) that lists all known segments, as depicted in <>. New documents are first added to an in-memory indexing buffer, as shown in <>, before being written to an on-disk segment, as in <> @@ -21,7 +21,7 @@ image::images/elas_1101.png["A Lucene index with a commit point and three segmen *************************************** To add to the confusion, a _Lucene index_ is what we call a _shard_ in -Elasticsearch, while an _index_ in Elasticsearch((("indexes", "in Elasticsearch")))((("shards", "indexes versus"))) is a collection of shards. +Elasticsearch, while an _index_ in Elasticsearch((("indices", "in Elasticsearch")))((("shards", "indices versus"))) is a collection of shards. When Elasticsearch searches an index, it sends the query out to a copy of every shard (Lucene index) that belongs to the index, and then reduces the per-shards results to a global result set, as described in @@ -62,7 +62,7 @@ documents can be added to the index relatively cheaply. Segments are immutable, so documents cannot be removed from older segments, nor can older segments be updated to reflect a newer version of a document. -Instead, every commit point includes a `.del` file that lists which documents +Instead, every ((("deleted documents")))commit point includes a `.del` file that lists which documents in which segments have been deleted. When a document is ``deleted,'' it is actually just _marked_ as deleted in the diff --git a/075_Inside_a_shard/40_Near_real_time.asciidoc b/075_Inside_a_shard/40_Near_real_time.asciidoc index 189edc5df..2f1d5b3e2 100644 --- a/075_Inside_a_shard/40_Near_real_time.asciidoc +++ b/075_Inside_a_shard/40_Near_real_time.asciidoc @@ -57,7 +57,7 @@ POST /blogs/_refresh <2> [TIP] ==== While a refresh is much lighter than a commit, it still has a performance -cost. A manual refresh can be useful when writing tests, but don't do a +cost.((("indices", "refresh_interval"))) A manual refresh can be useful when writing tests, but don't do a manual refresh every time you index a document in production; it will hurt your performance. Instead, your application needs to be aware of the near real-time nature of Elasticsearch and make allowances for it. diff --git a/075_Inside_a_shard/50_Persistent_changes.asciidoc b/075_Inside_a_shard/50_Persistent_changes.asciidoc index 87fe2da74..b097bb907 100644 --- a/075_Inside_a_shard/50_Persistent_changes.asciidoc +++ b/075_Inside_a_shard/50_Persistent_changes.asciidoc @@ -7,7 +7,7 @@ exiting the application normally. For Elasticsearch to be reliable, it needs to ensure that changes are persisted to disk. In <>, we said that a full commit flushes segments to disk and -writes a commit point, which lists all known segments. Elasticsearch uses +writes a commit point, which lists all known segments.((("commit point"))) Elasticsearch uses this commit point during startup or when reopening an index to decide which segments belong to the current shard. @@ -16,7 +16,7 @@ need to do full commits regularly to make sure that we can recover from failure. But what about the document changes that happen between commits? We don't want to lose those either. -Elasticsearch added a _translog_, or transaction log, which records every +Elasticsearch added a _translog_, or transaction log,((("translog (transaction log)"))) which records every operation in Elasticsearch as it happens. With the translog, the process now looks like this: @@ -81,12 +81,12 @@ the document, in real-time. ==== flush API The action of performing a commit and truncating the translog is known in -Elasticsearch as a _flush_. Shards are flushed automatically every 30 +Elasticsearch as a _flush_. ((("flushes"))) Shards are flushed automatically every 30 minutes, or when the translog becomes too big. See the http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-translog.html[`translog` documentation] for settings -that can be used to control these thresholds: +that can be used((("translog (transaction log)", "flushes and"))) to control these thresholds: -The http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-flush.html[`flush` API] can be used to perform a manual flush: +The http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-flush.html[`flush` API] can ((("indices", "flushing")))((("flush API")))be used to perform a manual flush: [source,json] ----------------------------- @@ -108,7 +108,7 @@ That said, it is beneficial to <> your indices before restartin **************************************** The purpose of the translog is to ensure that operations are not lost. This -begs the question: how safe is the translog? +begs the question: how safe((("translog (transaction log)", "safety of"))) is the translog? Writes to a file will not survive a reboot until the file has been +fsync+'ed to disk. By default, the translog is +fsync+'ed every 5 diff --git a/075_Inside_a_shard/60_Segment_merging.asciidoc b/075_Inside_a_shard/60_Segment_merging.asciidoc index 9cc3de121..f9d89e1c0 100644 --- a/075_Inside_a_shard/60_Segment_merging.asciidoc +++ b/075_Inside_a_shard/60_Segment_merging.asciidoc @@ -53,7 +53,7 @@ case. [[optimize-api]] ==== optimize API -The `optimize` API is best ((("merging segments", "optimize API and")))((("optimize API")))described as the _forced merge_ API. It forces a +The `optimize` API is best ((("merging segments", "optimize API and")))((("optimize API")))((("segments", "merging", "optimize API")))described as the _forced merge_ API. It forces a shard to be merged down to the number of segments specified in the `max_num_segments` parameter. The intention is to reduce the number of segments (usually to one) in order to speed up search performance. diff --git a/080_Structured_Search/25_ranges.asciidoc b/080_Structured_Search/25_ranges.asciidoc index a42172023..429a96e46 100644 --- a/080_Structured_Search/25_ranges.asciidoc +++ b/080_Structured_Search/25_ranges.asciidoc @@ -71,7 +71,7 @@ boundaries: ==== Ranges on Dates -The `range` filter can be used on date ((("dates", "range filter used on")))((("range filters", "using on dates")))fields too: +The `range` filter can be used on date ((("date ranges")))((("range filters", "using on dates")))fields too: [source,js] -------------------------------------------------- @@ -149,7 +149,7 @@ If we want a range from `a` up to but not including `b`, we can use the same .Be Careful of Cardinality **** Numeric and date fields are indexed in such a way that ranges are efficient -to calculate.((("cardinality, string ranges and"))) This is not the case for string fields, however. To perform +to calculate.((("cardinality", "string ranges and"))) This is not the case for string fields, however. To perform a range on a string field, Elasticsearch is effectively performing a `term` filter for every term that falls in the range. This is much slower than a date or numeric range. diff --git a/080_Structured_Search/40_bitsets.asciidoc b/080_Structured_Search/40_bitsets.asciidoc index bd0dcfc6a..9c465ca0a 100644 --- a/080_Structured_Search/40_bitsets.asciidoc +++ b/080_Structured_Search/40_bitsets.asciidoc @@ -63,7 +63,7 @@ performance benefits. ==== Controlling Caching Most _leaf filters_—those dealing directly with fields like the `term` -filter--are cached, while((("leaf filters", "caching of")))((("caching", "of leaf filters, controlling")))((("filters", "controlling caching of"))) compound filters, like the `bool` filter, are not. +filter--are cached, while((("leaf filters, caching of")))((("caching", "of leaf filters, controlling")))((("filters", "controlling caching of"))) compound filters, like the `bool` filter, are not. [NOTE] ==== @@ -78,12 +78,12 @@ doesn't make sense to do so: Script filters:: -The results((("script filters", "no caching of results"))) from http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html#_controlling_caching[`script` filters] cannot +The results((("script filters, no caching of results"))) from http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html#_controlling_caching[`script` filters] cannot be cached because the meaning of the script is opaque to Elasticsearch. Geo-filters:: -The geolocation filters, which((("geolocation filters", "no caching of results"))) we cover in more detail in <>, are +The geolocation filters, which((("geolocation filters, no caching of results"))) we cover in more detail in <>, are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them. @@ -98,7 +98,7 @@ caching is enabled by default. Sometimes the default caching strategy is not correct. Perhaps you have a complicated `bool` expression that is reused several times in the same query. Or you have a filter on a `date` field that will never be reused. The default -caching strategy ((("cache flag")))((("filters", "overriding default caching strategy on")))can be overridden on almost any filter by setting the +caching strategy ((("_cache flag", sortas="cache flag")))((("filters", "overriding default caching strategy on")))can be overridden on almost any filter by setting the `_cache` flag: [source,js] diff --git a/110_Multi_Field_Search/00_Intro.asciidoc b/110_Multi_Field_Search/00_Intro.asciidoc index 69a9445c0..d6090dd5e 100644 --- a/110_Multi_Field_Search/00_Intro.asciidoc +++ b/110_Multi_Field_Search/00_Intro.asciidoc @@ -1,7 +1,7 @@ [[multi-field-search]] == Multifield Search -Queries are seldom simple one-clause `match` queries. ((("multi-field search"))) We frequently need to +Queries are seldom simple one-clause `match` queries. ((("multifield search"))) We frequently need to search for the same or different query strings in one or more fields, which means that we need to be able to combine multiple query clauses and their relevance scores in a way that makes sense. diff --git a/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc b/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc index b01533148..ef2f0c54e 100644 --- a/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc +++ b/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc @@ -1,10 +1,10 @@ [[multi-query-strings]] === Multiple Query Strings -The simplest multifield query to deal with is the ((("multi-field search", "multiple query strings")))one where we can _map +The simplest multifield query to deal with is the ((("multifield search", "multiple query strings")))one where we can _map search terms to specific fields_. If we know that _War and Peace_ is the title, and Leo Tolstoy is the author, it is easy to write each of these -conditions as a `match` clause ((("match clause", "mapping search terms to specific fields")))((("bool query", "mapping search terms to specific fields in match clause")))and to combine them with a <>: [source,js] @@ -72,7 +72,7 @@ would have reduced the contribution of the title and author clauses to one-quart ==== Prioritizing Clauses It is likely that an even one-third split between clauses is not what we need for -the preceding query. ((("multi-field search", "multiple query strings", "prioritizing query clauses")))((("bool query", "prioritizing clauses"))) Probably we're more interested in the title and author +the preceding query. ((("multifield search", "multiple query strings", "prioritizing query clauses")))((("bool query", "prioritizing clauses"))) Probably we're more interested in the title and author clauses then we are in the translator clauses. We need to tune the query to make the title and author clauses relatively more important. diff --git a/110_Multi_Field_Search/10_Single_query_string.asciidoc b/110_Multi_Field_Search/10_Single_query_string.asciidoc index dd02a1314..dabe3b569 100644 --- a/110_Multi_Field_Search/10_Single_query_string.asciidoc +++ b/110_Multi_Field_Search/10_Single_query_string.asciidoc @@ -1,6 +1,6 @@ === Single Query String -The `bool` query is the mainstay of multiclause queries.((("multi-field search", "single query string"))) It works well +The `bool` query is the mainstay of multiclause queries.((("multifield search", "single query string"))) It works well for many cases, especially when you are able to map different query strings to individual fields. diff --git a/110_Multi_Field_Search/15_Best_field.asciidoc b/110_Multi_Field_Search/15_Best_field.asciidoc index fa8047d64..f9c3a1439 100644 --- a/110_Multi_Field_Search/15_Best_field.asciidoc +++ b/110_Multi_Field_Search/15_Best_field.asciidoc @@ -1,6 +1,6 @@ === Best Fields -Imagine that we have a website that allows ((("multi-field search", "best fields queries")))((("best fields queries")))users to search blog posts, such +Imagine that we have a website that allows ((("multifield search", "best fields queries")))((("best fields queries")))users to search blog posts, such as these two documents: [source,js] diff --git a/110_Multi_Field_Search/20_Tuning_best_field_queries.asciidoc b/110_Multi_Field_Search/20_Tuning_best_field_queries.asciidoc index 227cd39b3..454e8907a 100644 --- a/110_Multi_Field_Search/20_Tuning_best_field_queries.asciidoc +++ b/110_Multi_Field_Search/20_Tuning_best_field_queries.asciidoc @@ -1,10 +1,10 @@ === Tuning Best Fields Queries -What would happen if the user((("multi-field search", "best fields queries", "tuning")))((("best fields queries", "tuning"))) had searched instead for ``quick pets''? Both +What would happen if the user((("multifield search", "best fields queries", "tuning")))((("best fields queries", "tuning"))) had searched instead for ``quick pets''? Both documents contain the word `quick`, but only document 2 contains the word `pets`. Neither document contains _both words_ in the _same field_. -A simple `dis_max` query like the following would ((("dis_max (disjunction max) query")))choose the single best +A simple `dis_max` query like the following would ((("dis_max (disjunction max) query")))((("relevance scores", "calculation in dis_max queries")))choose the single best matching field, and ignore the other: [source,js] @@ -54,7 +54,7 @@ but this isn't the case. Remember: the `dis_max` query simply uses the ==== tie_breaker -It is possible, however, to((("dis_max (disjunction max) query", "using tie_breaker parameter")))((("relevance scores", "calculation in dis_match queries"))) also take the `_score` from the other matching +It is possible, however, to((("dis_max (disjunction max) query", "using tie_breaker parameter")))((("relevance scores", "calculation in dis_max queries", "using tie_breaker parameter"))) also take the `_score` from the other matching clauses into account, by specifying ((("tie_breaker parameter")))the `tie_breaker` parameter: [source,js] @@ -100,7 +100,7 @@ This gives us the following results: -------------------------------------------------- <1> Document 2 now has a small lead over document 1. -The `tie_breaker` parameter ((("relevance scores", "calculation in dis_max query using tie_breaker parameter")))makes the `dis_max` query behave more like a +The `tie_breaker` parameter makes the `dis_max` query behave more like a halfway house between `dis_max` and `bool`. It changes the score calculation as follows: diff --git a/110_Multi_Field_Search/25_Multi_match_query.asciidoc b/110_Multi_Field_Search/25_Multi_match_query.asciidoc index 1356025cf..32ae1f264 100644 --- a/110_Multi_Field_Search/25_Multi_match_query.asciidoc +++ b/110_Multi_Field_Search/25_Multi_match_query.asciidoc @@ -1,7 +1,7 @@ [[multi-match-query]] === multi_match Query -The `multi_match` query provides ((("multi-field search", "multi-match query")))((("multi_match queries")))((("match query", "multi_match queries"))) a convenient shorthand way of running +The `multi_match` query provides ((("multifield search", "multi_match query")))((("multi_match queries")))((("match query", "multi_match queries"))) a convenient shorthand way of running the same query against multiple fields. [NOTE] @@ -65,7 +65,7 @@ could be rewritten more concisely with `multi_match` as follows: ==== Using Wildcards in Field Names Field names can be specified with wildcards: any field that matches the -wildcard pattern((("multi_match queries", "wildcards in field names")))((("wildcards", "in field names")))((("fields", "wildcards in field names"))) will be included in the search. You could match on the +wildcard pattern((("multi_match queries", "wildcards in field names")))((("wildcards in field names")))((("fields", "wildcards in field names"))) will be included in the search. You could match on the `book_title`, `chapter_title`, and `section_title` fields, with the following: [source,js] diff --git a/110_Multi_Field_Search/30_Most_fields.asciidoc b/110_Multi_Field_Search/30_Most_fields.asciidoc index ccfa60a90..7a4cb944c 100644 --- a/110_Multi_Field_Search/30_Most_fields.asciidoc +++ b/110_Multi_Field_Search/30_Most_fields.asciidoc @@ -2,7 +2,7 @@ === Most Fields Full-text search is a battle between _recall_—returning all the -documents that are ((("most fields queries")))((("multi-field search", "most fields queries")))relevant--and _precision_—not returning irrelevant +documents that are ((("most fields queries")))((("multifield search", "most fields queries")))relevant--and _precision_—not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results. @@ -52,7 +52,7 @@ unstemmed fields to illustrate this technique. ==== Multifield Mapping -The first thing to do is to set up our ((("most fields queries", "multi-field mapping")))((("mapping (types)", "multi-field mapping")))field to be indexed twice: once in a +The first thing to do is to set up our ((("most fields queries", "multifield mapping")))((("mapping (types)", "multifield mapping")))field to be indexed twice: once in a stemmed form and once in an unstemmed form. To do this, we will use _multifields_, which we introduced in <>: diff --git a/110_Multi_Field_Search/35_Entity_search.asciidoc b/110_Multi_Field_Search/35_Entity_search.asciidoc index dbc22ce10..16d9f2c6d 100644 --- a/110_Multi_Field_Search/35_Entity_search.asciidoc +++ b/110_Multi_Field_Search/35_Entity_search.asciidoc @@ -1,6 +1,6 @@ === Cross-fields Entity Search -Now we come to a common pattern: cross-fields entity search. ((("cross-fields entity search")))((("multi-field search", "cross-fields entity search"))) With entities +Now we come to a common pattern: cross-fields entity search. ((("cross-fields entity search")))((("multifield search", "cross-fields entity search"))) With entities like `person`, `product`, or `address`, the identifying information is spread across several fields. We may have a `person` indexed as follows: diff --git a/110_Multi_Field_Search/40_Field_centric.asciidoc b/110_Multi_Field_Search/40_Field_centric.asciidoc index 306920ad3..f1fd1466b 100644 --- a/110_Multi_Field_Search/40_Field_centric.asciidoc +++ b/110_Multi_Field_Search/40_Field_centric.asciidoc @@ -1,7 +1,7 @@ [[field-centric]] === Field-Centric Queries -All three of the preceding problems stem from ((("field-centric queries")))((("multi-field search", "field-centric queries, problems with")))((("most fields queries", "problems with field-centric queries")))`most_fields` being +All three of the preceding problems stem from ((("field-centric queries")))((("multifield search", "field-centric queries, problems with")))((("most fields queries", "problems with field-centric queries")))`most_fields` being _field-centric_ rather than _term-centric_: it looks for the most matching _fields_, when really what we're interested is the most matching _terms_. @@ -100,7 +100,7 @@ When searching against multiple fields, TF/IDF can((("Term Frequency/Inverse Doc results. Consider our example of searching for ``Peter Smith'' using the `first_name` -and `last_name` fields. Peter is a common first name and Smith is a common +and `last_name` fields.((("inverse document frequency", "field-centric queries and"))) Peter is a common first name and Smith is a common last name--both will have low IDFs. But what if we have another person in the index whose name is Smith Williams? Smith as a first name is very uncommon and so will have a high IDF! diff --git a/110_Multi_Field_Search/45_Custom_all.asciidoc b/110_Multi_Field_Search/45_Custom_all.asciidoc index 7f1746bc9..5d4a4da6e 100644 --- a/110_Multi_Field_Search/45_Custom_all.asciidoc +++ b/110_Multi_Field_Search/45_Custom_all.asciidoc @@ -2,7 +2,7 @@ === Custom _all Fields In <>, we explained that the special `_all` field indexes the values -from all other fields as one big string.((("_all field", "custom", sortas="all field")))((("multi-field search", "custom _all fields"))) Having all fields indexed into one +from all other fields as one big string.((("_all field", "custom", sortas="all field")))((("multifield search", "custom _all fields"))) Having all fields indexed into one field is not terribly flexible, though. It would be nice to have one custom `_all` field for the person's name, and another custom `_all` field for the address. diff --git a/110_Multi_Field_Search/50_Cross_field.asciidoc b/110_Multi_Field_Search/50_Cross_field.asciidoc index 167c28938..593939bce 100644 --- a/110_Multi_Field_Search/50_Cross_field.asciidoc +++ b/110_Multi_Field_Search/50_Cross_field.asciidoc @@ -1,7 +1,7 @@ === cross-fields Queries The custom `_all` approach is a good solution, as long as you thought -about setting it up before you indexed your((("multi-field search", "cross-fields queries")))((("cross-fields queries"))) documents. However, Elasticsearch +about setting it up before you indexed your((("multifield search", "cross-fields queries")))((("cross-fields queries"))) documents. However, Elasticsearch also provides a search-time solution to the problem: the `multi_match` query with type `cross_fields`.((("multi_match queries", "cross_fields type"))) The `cross_fields` type takes a term-centric approach, quite different from the diff --git a/110_Multi_Field_Search/55_Not_analyzed.asciidoc b/110_Multi_Field_Search/55_Not_analyzed.asciidoc index 6dbf8e40b..2af19eef1 100644 --- a/110_Multi_Field_Search/55_Not_analyzed.asciidoc +++ b/110_Multi_Field_Search/55_Not_analyzed.asciidoc @@ -1,7 +1,7 @@ === Exact-Value Fields The final topic that we should touch on before leaving multifield queries is -that of exact-value `not_analyzed` fields. ((("not_analyzed fields", "exact value, in multi-field queries")))((("multithreaded programming", "avoiding use of not_analyzed fields in")))((("multi-field search", "exact value fields")))((("exact values", "exact value not_analyzed fields in multi-field search")))((("analyzed fields", "avoiding mixing with not analyzed fields in multi_match queries"))) It is not useful to mix +that of exact-value `not_analyzed` fields. ((("not_analyzed fields", "exact value, in multi-field queries")))((("multifield search", "exact value fields")))((("exact values", "exact value not_analyzed fields in multifield search")))((("analyzed fields", "avoiding mixing with not analyzed fields in multi_match queries"))) It is not useful to mix `not_analyzed` fields with `analyzed` fields in `multi_match` queries. The reason for this can be demonstrated easily by looking at a query diff --git a/120_Proximity_Matching/05_Phrase_matching.asciidoc b/120_Proximity_Matching/05_Phrase_matching.asciidoc index 7ffee8709..645f8aedb 100644 --- a/120_Proximity_Matching/05_Phrase_matching.asciidoc +++ b/120_Proximity_Matching/05_Phrase_matching.asciidoc @@ -95,7 +95,7 @@ all the words in exactly the order specified, with no words in-between. ==== What Is a Phrase -For a document to be considered a((("match_phrase query", "documents matching a phrase"))) match for the phrase ``quick brown fox,'' the following must be true: +For a document to be considered a((("match_phrase query", "documents matching a phrase")))((("phrase matching", "criteria for matching documents"))) match for the phrase ``quick brown fox,'' the following must be true: * `quick`, `brown`, and `fox` must all appear in the field. diff --git a/120_Proximity_Matching/15_Multi_value_fields.asciidoc b/120_Proximity_Matching/15_Multi_value_fields.asciidoc index 0cb06ca50..a9c293773 100644 --- a/120_Proximity_Matching/15_Multi_value_fields.asciidoc +++ b/120_Proximity_Matching/15_Multi_value_fields.asciidoc @@ -1,7 +1,7 @@ === Multivalue Fields A curious thing can happen when you try to use phrase matching on multivalue -fields. ((("proximity matching", "on multi-value fields")))((("match_phrase query", "on multi-value fields"))) Imagine that you index this document: +fields. ((("proximity matching", "on multivalue fields")))((("match_phrase query", "on multivalue fields"))) Imagine that you index this document: [source,js] -------------------------------------------------- diff --git a/120_Proximity_Matching/35_Shingles.asciidoc b/120_Proximity_Matching/35_Shingles.asciidoc index 79a7d40ce..f533b604a 100644 --- a/120_Proximity_Matching/35_Shingles.asciidoc +++ b/120_Proximity_Matching/35_Shingles.asciidoc @@ -123,7 +123,7 @@ Now we can proceed to setting up a field to use the new analyzer. ==== Multifields We said that it is cleaner to index unigrams and bigrams separately, so we -will create the `title` field ((("multi-fields")))as a multifield (see <>): +will create the `title` field ((("multifields")))as a multifield (see <>): [source,js] -------------------------------------------------- diff --git a/130_Partial_Matching/05_Postcodes.asciidoc b/130_Partial_Matching/05_Postcodes.asciidoc index 4db696907..6b2ccf8bd 100644 --- a/130_Partial_Matching/05_Postcodes.asciidoc +++ b/130_Partial_Matching/05_Postcodes.asciidoc @@ -2,7 +2,7 @@ We will use United Kingdom postcodes (postal codes in the United States) to illustrate how((("partial matching", "postcodes and structured data"))) to use partial matching with structured data.((("structured data", "partial matching with postcodes"))) UK postcodes have a well-defined structure. For instance, the -postcode `W1V 3DG` can((("postcodes (UK), partial mapping with"))) be broken down as follows: +postcode `W1V 3DG` can((("postcodes (UK), partial matching with"))) be broken down as follows: * `W1V`: This outer part identifies the postal area and district: diff --git a/130_Partial_Matching/20_Match_phrase_prefix.asciidoc b/130_Partial_Matching/20_Match_phrase_prefix.asciidoc index f5863d236..6f03eec55 100644 --- a/130_Partial_Matching/20_Match_phrase_prefix.asciidoc +++ b/130_Partial_Matching/20_Match_phrase_prefix.asciidoc @@ -16,7 +16,7 @@ full-text field. In <>, we introduced the `match_phrase` query, which matches all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query, -called ((("prefixes", "match_phrase_prefix query")))((("match_phrase_prefix query")))the `match_phrase_prefix` query: +called ((("prefix query", "match_phrase_prefix query")))((("match_phrase_prefix query")))the `match_phrase_prefix` query: [source,js] -------------------------------------------------- diff --git a/130_Partial_Matching/35_Search_as_you_type.asciidoc b/130_Partial_Matching/35_Search_as_you_type.asciidoc index fe8ef9ac7..157ca18cd 100644 --- a/130_Partial_Matching/35_Search_as_you_type.asciidoc +++ b/130_Partial_Matching/35_Search_as_you_type.asciidoc @@ -112,7 +112,7 @@ terms: * `brown` To use the analyzer, we need to apply it to a field, which we can do -with((("update-mapping API", "applying custom autocomplete analyzer to a field"))) the `update-mapping` API: +with((("update-mapping API, applying custom autocomplete analyzer to a field"))) the `update-mapping` API: [source,js] -------------------------------------------------- diff --git a/170_Relevance/10_Scoring_theory.asciidoc b/170_Relevance/10_Scoring_theory.asciidoc index 4c2024347..f52570859 100644 --- a/170_Relevance/10_Scoring_theory.asciidoc +++ b/170_Relevance/10_Scoring_theory.asciidoc @@ -3,7 +3,7 @@ Lucene (and thus Elasticsearch) uses the http://en.wikipedia.org/wiki/Standard_Boolean_model[_Boolean model_] -to find matching documents, and a formula called the +to find matching documents,((("relevance scores", "theory behind", id="ix_relscore", range="startofrange")))((("Boolean Model"))) and a formula called the <> to calculate relevance. This formula borrows concepts from http://en.wikipedia.org/wiki/Tfidf[_term frequency/inverse document frequency_] and the @@ -24,7 +24,7 @@ influence the outcome. ==== Boolean Model The _Boolean model_ simply applies the `AND`, `OR`, and `NOT` conditions -expressed in the query to find all the documents that match. A query for +expressed in the query to find all the documents that match.((("and operator")))((("not operator")))((("or operator"))) A query for full AND text AND search AND (elasticsearch OR lucene) @@ -38,7 +38,7 @@ cannot possibly match the query. ==== Term Frequency/Inverse Document Frequency (TF/IDF) Once we have a list of matching documents, they need to be ranked by -relevance. Not all documents will contain all the terms, and some terms are +relevance.((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm"))) Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the _weight_ of each query term that appears in that document. @@ -50,7 +50,7 @@ sake, but you are not required to remember them. [[tf]] ===== Term frequency -How often does the term appear in this document? The more often, the +How often does the term appear in this document?((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "term frequency"))) The more often, the _higher_ the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows: @@ -90,7 +90,7 @@ PUT /my_index ===== Inverse document frequency How often does the term appear in all documents in the collection? The more -often, the _lower_ the weight. Common terms like `and` or `the` contribute +often, the _lower_ the weight.((("inverse document frequency")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "inverse document frequency"))) Common terms like `and` or `the` contribute little to relevance, as they appear in most documents, while uncommon terms like `elastic` or `hippopotamus` help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows: @@ -106,7 +106,7 @@ idf(t) = 1 + log ( numDocs / (docFreq + 1)) <1> [[field-norm]] ===== Field-length norm -How long is the field? The shorter the field, the _higher_ the weight. If a +How long is the field? ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "field-length norm")))((("field-length norm")))The shorter the field, the _higher_ the weight. If a term appears in a short field, such as a `title` field, it is more likely that the content of that field is _about_ the term than if the same term appears in a much bigger `body` field. The field length norm is calculated as follows: @@ -117,7 +117,7 @@ norm(d) = 1 / √numTerms <1> <1> The field-length norm (`norm`) is the inverse square root of the number of terms in the field. -While the field-length norm is important for full-text search, many other +While the field-length ((("string fields", "field-length norm")))norm is important for full-text search, many other fields don't need norms. Norms consume approximately 1 byte per `string` field per document in the index, whether or not a document contains the field. Exact-value `not_analyzed` string fields have norms disabled by default, but you can use the field mapping to disable norms on `analyzed` fields as @@ -149,7 +149,7 @@ norms can save a significant amount of memory. ===== Putting it together -These three factors--term frequency, inverse document frequency, and field-length norm--are calculated and stored at index time. Together, they are +These three factors--term frequency, inverse document frequency, and field-length norm--are calculated and stored at index time.((("weight", "calculation of"))) Together, they are used to calculate the _weight_ of a single term in a particular document. [TIP] @@ -207,7 +207,7 @@ vector space model. [[vector-space-model]] ==== Vector Space Model -The _vector space model_ provides a way of comparing a multiterm query +The _vector space model_ provides a way of ((("Vector Space Model")))comparing a multiterm query against a document. The output is a single score that represents how well the document matches the query. In order to do this, the model represents both the document and the query as _vectors_. @@ -216,7 +216,7 @@ A vector is really just a one-dimensional array containing numbers, for example: [1,2,5,22,3,8] -In the vector space model, each number in the vector is the _weight_ of a term, +In the vector space((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "in Vector Space Model"))) model, each number in the vector is((("weight", "calculation of", "in Vector Space Model"))) the _weight_ of a term, as calculated with <>. [TIP] @@ -281,4 +281,4 @@ at http://en.wikipedia.org/wiki/Cosine_similarity. ================================================== Now that we have talked about the theoretical basis of scoring, we can move on -to see how scoring is implemented in Lucene. +to see how scoring is implemented in Lucene.((("relevance scores", "theory behind", range="endofrange", startref="ix_relscore"))) diff --git a/170_Relevance/15_Practical_scoring.asciidoc b/170_Relevance/15_Practical_scoring.asciidoc index 616950299..02caa0d1d 100644 --- a/170_Relevance/15_Practical_scoring.asciidoc +++ b/170_Relevance/15_Practical_scoring.asciidoc @@ -1,7 +1,7 @@ [[practical-scoring-function]] === Lucene's Practical Scoring Function -For multiterm queries, Lucene takes((("Boolean Model"))) the <>, +For multiterm queries, Lucene takes((("relevance", "controlling", "Lucene's practical scoring function", id="ix_relcontPCF", range="startofrange")))((("Boolean Model"))) the <>, <>, and the <> and combines ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("Vector Space Model"))) them in a single efficient package that collects matching documents and scores them as it goes. @@ -43,7 +43,7 @@ both. As soon as a document matches a query, Lucene calculates its score for that query, combining the scores of each matching term. The formula used for -scoring is called the _practical scoring function_.((("Practical Scoring Function"))) It looks intimidating, but +scoring is called the _practical scoring function_.((("practical scoring function"))) It looks intimidating, but don't be put off--most of the components you already know. It introduces a few new elements that we discuss next. @@ -80,7 +80,7 @@ index-time field-level boosting out of the way. [[query-norm]] ==== Query Normalization Factor -The _query normalization factor_ (`queryNorm`) is ((("query normalization factor")))((("normalization", "query normalization factor")))an attempt to _normalize_ a +The _query normalization factor_ (`queryNorm`) is ((("practical scoring function", "query normalization factor")))((("query normalization factor")))((("normalization", "query normalization factor")))an attempt to _normalize_ a query so that the results from one query may be compared with the results of another. @@ -111,7 +111,7 @@ have no way of changing it. For all intents and purposes, it can be ignored. [[coord]] ==== Query Coordination -The _coordination factor_ (`coord`) is used to((("coordination factor (coord)")))((("query coordination"))) reward documents that contain a +The _coordination factor_ (`coord`) is used to((("coordination factor (coord)")))((("query coordination")))((("practical scoring function", "coordination factor"))) reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query. @@ -194,13 +194,13 @@ automatically; you don't need to worry about it. [[index-boost]] ==== Index-Time Field-Level Boosting -We will talk about _boosting_ a field--making it ((("boosting", "index time field-level boosting")))more important than other +We will talk about _boosting_ a field--making it ((("indexing", "field-level index time boosts")))((("boosting", "index time field-level boosting")))((("practical scoring function", "index time field-level boosting")))more important than other fields--at query time in <>. It is also possible to apply a boost to a field at index time. Actually, this boost is applied to every term in the field, rather than to the field itself. To store this boost value in the index without using more space -than necessary, this field-level index-time boost is combined with the ((("field length norm")))field-length norm (see <>) and stored in the index as a single byte. +than necessary, this field-level index-time boost is combined with the ((("field-length norm")))field-length norm (see <>) and stored in the index as a single byte. This is the value returned by `norm(t,d)` in the preceding formula. [WARNING] @@ -228,6 +228,6 @@ flexible option. With query normalization, coordination, and index-time boosting out of the way, we can now move on to the most useful tool for influencing the relevance -calculation: query-time boosting. +calculation: query-time boosting.((("relevance", "controlling", "Lucene's practical scoring function", range="endofrange", startref="ix_relcontPCF"))) diff --git a/170_Relevance/20_Query_time_boosting.asciidoc b/170_Relevance/20_Query_time_boosting.asciidoc index fd60e910d..55588a63d 100644 --- a/170_Relevance/20_Query_time_boosting.asciidoc +++ b/170_Relevance/20_Query_time_boosting.asciidoc @@ -46,14 +46,14 @@ value for a particular query clause. It's a matter of try-it-and-see. Remember that `boost` is just one of the factors involved in the relevance score; it has to compete with the other factors. For instance, in the preceding example, the `title` field will probably already have a ``natural'' boost over -the `content` field thanks ((("field length norm")))to the <> (titles +the `content` field thanks ((("field-length norm")))to the <> (titles are usually shorter than the related content), so don't blindly boost fields just because you think they should be boosted. Apply a boost and check the results. Change the boost and check again. ==== Boosting an Index -When searching across multiple indices, you((("boosting", "query-time", "boosting an index")))((("indexes", "boosting an index"))) can boost an entire index over +When searching across multiple indices, you((("boosting", "query-time", "boosting an index")))((("indices", "boosting an index"))) can boost an entire index over the others with the `indices_boost` parameter.((("indices_boost parameter"))) This could be used, as in the next example, to give more weight to documents from a more recent index: @@ -81,7 +81,7 @@ GET /docs_2014_*/_search <1> ==== t.getBoost() These boost values are represented in the <> by -the `t.getBoost()` element.((("boosting", "query-time", "t.getBoost()")))((("t.getBoost() method"))) Boosts are not applied at the level that they +the `t.getBoost()` element.((("practical scoring function", "t.getBoost() method")))((("boosting", "query-time", "t.getBoost()")))((("t.getBoost() method"))) Boosts are not applied at the level that they appear in the query DSL. Instead, any boost values are combined and passsed down to the individual terms. The `t.getBoost()` method returns any `boost` value applied to the term itself or to any of the queries higher up the chain. diff --git a/170_Relevance/45_Popularity.asciidoc b/170_Relevance/45_Popularity.asciidoc index 3395e65bc..8f383a062 100644 --- a/170_Relevance/45_Popularity.asciidoc +++ b/170_Relevance/45_Popularity.asciidoc @@ -2,7 +2,7 @@ === Boosting by Popularity Imagine that we have a website that hosts blog posts and enables users to vote for the -blog posts that they like.((("relevance", "controlling", "boosting by popularity")))((("popularity, boosting by")))((("boosting", "by popularity"))) We would like more-popular posts to appear higher in the +blog posts that they like.((("relevance", "controlling", "boosting by popularity")))((("popularity", "boosting by")))((("boosting", "by popularity"))) We would like more-popular posts to appear higher in the results list, but still have the full-text score as the main relevance driver. We can do this easily by storing the number of votes with each blog post: @@ -115,7 +115,7 @@ http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl- ==== factor The strength of the popularity effect can be increased or decreased by -multiplying the value((("factor")))((("field_value_factor function", "factor parameter"))) in the `votes` field by some number, called the +multiplying the value((("factor (function_score)")))((("field_value_factor function", "factor parameter"))) in the `votes` field by some number, called the `factor`: [source,json] diff --git a/170_Relevance/55_Random_scoring.asciidoc b/170_Relevance/55_Random_scoring.asciidoc index 30c72e54d..b2f530e05 100644 --- a/170_Relevance/55_Random_scoring.asciidoc +++ b/170_Relevance/55_Random_scoring.asciidoc @@ -2,7 +2,7 @@ === Random Scoring You may have been wondering what _consistently random scoring_ is, or why -you would ever want to use it.((("consistently random scoring")))((("relevance", "controlling", "random scoring")))((("random scoring"))) The previous example provides a good use case. +you would ever want to use it.((("consistently random scoring")))((("relevance", "controlling", "random scoring"))) The previous example provides a good use case. All results from the previous example would receive a final `_score` of 1, 2, 3, 4, or 5. Maybe there are only a few homes that score 5, but presumably there would be a lot of homes scoring 2 or 3. diff --git a/170_Relevance/60_Decay_functions.asciidoc b/170_Relevance/60_Decay_functions.asciidoc index cdaaa2566..eb760cccd 100644 --- a/170_Relevance/60_Decay_functions.asciidoc +++ b/170_Relevance/60_Decay_functions.asciidoc @@ -107,7 +107,7 @@ GET /_search <3> See <> for the reason that `origin` is `50` instead of `100`. <4> The `price` clause has twice the weight of the `location` clause. -The `location` clause is((("location clause", "gauss (Gaussian) function"))) easy to understand: +The `location` clause is((("location clause, Gaussian function example"))) easy to understand: * We have specified an `origin` that corresponds to the center of London. * Any location within `2km` of the `origin` receives the full score of `1.0`. @@ -117,7 +117,7 @@ of `0.5`. [[Understanding-the-price-Clause]] === Understanding the price Clause -The `price` clause is a little trickier.((("price clause (gauss function)"))) The user's preferred price is +The `price` clause is a little trickier.((("price clause (Gaussian function example)"))) The user's preferred price is anything up to £100, but this example sets the origin to £50. Prices can't be negative, but the lower they are, the better. Really, any price between £0 and £100 should be considered optimal. @@ -130,7 +130,7 @@ way, the score decays only for any prices above £100 (`origin + offset`). ================================================== The `weight` parameter can be used to increase or decrease the contribution of -individual clauses. ((("weight parameter", "in function_score query"))) The `weight`, which defaults to `1.0`, is multiplied by +individual clauses. ((("weight parameter (in function_score query)"))) The `weight`, which defaults to `1.0`, is multiplied by the score from each clause before the scores are combined with the specified `score_mode`. diff --git a/170_Relevance/65_Script_score.asciidoc b/170_Relevance/65_Script_score.asciidoc index 6b11cab4b..00eed1bed 100644 --- a/170_Relevance/65_Script_score.asciidoc +++ b/170_Relevance/65_Script_score.asciidoc @@ -101,7 +101,7 @@ a profit. [TIP] ======================================== -The `script_score` function provides enormous flexibility.((("scripting", "performance and"))) Within a script, +The `script_score` function provides enormous flexibility.((("scripts", "performance and"))) Within a script, you have access to the fields of the document, to the current `_score`, and even to the term frequencies, inverse document frequencies, and field length norms (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html[Text scoring in scripts]). diff --git a/200_Language_intro/10_Using.asciidoc b/200_Language_intro/10_Using.asciidoc index e6e08ed7d..005c0b195 100644 --- a/200_Language_intro/10_Using.asciidoc +++ b/200_Language_intro/10_Using.asciidoc @@ -41,7 +41,7 @@ recall as we can match more loosely, but we have reduced our ability to rank documents accurately. To get the best of both worlds, we can use <> to -index the `title` field twice: once((("multi-fields", "using to index a field with two different analyzers"))) with the `english` analyzer and once with +index the `title` field twice: once((("multifields", "using to index a field with two different analyzers"))) with the `english` analyzer and once with the `standard` analyzer: [source,js] diff --git a/200_Language_intro/30_Language_pitfalls.asciidoc b/200_Language_intro/30_Language_pitfalls.asciidoc index 99fa6e36a..e3a91d82c 100644 --- a/200_Language_intro/30_Language_pitfalls.asciidoc +++ b/200_Language_intro/30_Language_pitfalls.asciidoc @@ -102,7 +102,7 @@ library from http://blog.mikemccandless.com/2013/08/a-new-version-of-compact-language.html[Mike McCandless], which uses the open source (http://www.apache.org/licenses/LICENSE-2.0[Apache License 2.0]) https://code.google.com/p/cld2/[Compact Language Detector] (CLD) from Google. It is -small, fast, and accurate, and can detect 160+ languages from as little as two +small, fast, ((("Compact Language Detector (CLD)")))and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R. diff --git a/200_Language_intro/40_One_language_per_doc.asciidoc b/200_Language_intro/40_One_language_per_doc.asciidoc index bff74c376..e62021a4e 100644 --- a/200_Language_intro/40_One_language_per_doc.asciidoc +++ b/200_Language_intro/40_One_language_per_doc.asciidoc @@ -1,7 +1,7 @@ [[one-lang-docs]] === One Language per Document -A single predominant language per document ((("languages", "one language per document")))((("indexes", "documents in different languages")))requires a relatively simple setup. +A single predominant language per document ((("languages", "one language per document")))((("indices", "documents in different languages")))requires a relatively simple setup. Documents from different languages can be stored in separate indices—`blogs-en`, `blogs-fr`, and so forth—that use the same type and the same fields for each index, just with different analyzers: diff --git a/200_Language_intro/60_Mixed_language_fields.asciidoc b/200_Language_intro/60_Mixed_language_fields.asciidoc index a882b437b..2db1f4141 100644 --- a/200_Language_intro/60_Mixed_language_fields.asciidoc +++ b/200_Language_intro/60_Mixed_language_fields.asciidoc @@ -25,14 +25,14 @@ Assuming that your mix of languages uses the same script such as Latin, you have ==== Split into Separate Fields -The Compact Language Detector ((("languages", "mixed language fields", "splitting into separate fields")))((("Compact Language Detector")))mentioned in <> can tell +The Compact Language Detector ((("languages", "mixed language fields", "splitting into separate fields")))((("Compact Language Detector (CLD)")))mentioned in <> can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in <>. ==== Analyze Multiple Times -If you primarily deal with a limited number of languages, ((("languages", "mixed language fields", "analyzing multiple times")))((("analyzers", "for mixed language fields")))((("multi-fields", "analying mixed language fields")))you could use +If you primarily deal with a limited number of languages, ((("languages", "mixed language fields", "analyzing multiple times")))((("analyzers", "for mixed language fields")))((("multifields", "analying mixed language fields")))you could use multi-fields to analyze the text once per language: [source,js] diff --git a/210_Identifying_words/10_Standard_analyzer.asciidoc b/210_Identifying_words/10_Standard_analyzer.asciidoc index a155e614c..8947d6ef5 100644 --- a/210_Identifying_words/10_Standard_analyzer.asciidoc +++ b/210_Identifying_words/10_Standard_analyzer.asciidoc @@ -2,7 +2,7 @@ === standard Analyzer The `standard` analyzer is used by default for any full-text `analyzed` string -field. If we were to reimplement the `standard` analyzer as a +field. ((("standard analyzer"))) If we were to reimplement the `standard` analyzer as a <>, it would be defined as follows: [role="pagebreak-before"] diff --git a/210_Identifying_words/30_ICU_plugin.asciidoc b/210_Identifying_words/30_ICU_plugin.asciidoc index 5d7b1ae35..9253de2b4 100644 --- a/210_Identifying_words/30_ICU_plugin.asciidoc +++ b/210_Identifying_words/30_ICU_plugin.asciidoc @@ -4,7 +4,7 @@ The https://github.com/elasticsearch/elasticsearch-analysis-icu[ICU analysis plug-in] for Elasticsearch uses the _International Components for Unicode_ (ICU) libraries (see http://site.icu-project.org[site.project.org]) to -provide a rich set of tools for dealing with Unicode.((("International Components for Unicode libraries", see="ICU plugin")))((("words", "identifying words", "ICU plugin, installing")))((("ICU plugin", "installing"))) These include the +provide a rich set of tools for dealing with Unicode.((("International Components for Unicode libraries", see="ICU plugin, installing")))((("words", "identifying words", "ICU plugin, installing")))((("ICU plugin, installing"))) These include the `icu_tokenizer`, which is particularly useful for Asian languages,((("Asian languages", "icu_tokenizer for"))) and a number of token filters that are essential for correct matching and sorting in all languages other than English. diff --git a/210_Identifying_words/50_Tidying_text.asciidoc b/210_Identifying_words/50_Tidying_text.asciidoc index 685f6c4dd..30d3c0365 100644 --- a/210_Identifying_words/50_Tidying_text.asciidoc +++ b/210_Identifying_words/50_Tidying_text.asciidoc @@ -10,7 +10,7 @@ of the output. ==== Tokenizing HTML Passing HTML through the `standard` tokenizer or the `icu_tokenizer` produces -poor results.((("HTML", "tokenizing"))) These tokenizers just don't know what to do with the HTML tags. +poor results.((("HTML, tokenizing"))) These tokenizers just don't know what to do with the HTML tags. For example: [source,js] diff --git a/220_Token_normalization/20_Removing_diacritics.asciidoc b/220_Token_normalization/20_Removing_diacritics.asciidoc index 183a73e03..cf7b1b759 100644 --- a/220_Token_normalization/20_Removing_diacritics.asciidoc +++ b/220_Token_normalization/20_Removing_diacritics.asciidoc @@ -47,7 +47,7 @@ My œsophagus caused a débâcle <1> ==== Retaining Meaning Of course, when you strip diacritical marks from a word, you lose meaning. -For instance, consider((("diacritics", "stripping, meaning loss from"))) these three ((("Spanis", "stripping diacritics, meaning loss from")))Spanish words: +For instance, consider((("diacritics", "stripping, meaning loss from"))) these three ((("Spanish", "stripping diacritics, meaning loss from")))Spanish words: `esta`:: Feminine form of the adjective _this_, as in _esta silla_ (this chair) or _esta_ (this one). diff --git a/230_Stemming/10_Algorithmic_stemmers.asciidoc b/230_Stemming/10_Algorithmic_stemmers.asciidoc index e5486e804..a3cf8412a 100644 --- a/230_Stemming/10_Algorithmic_stemmers.asciidoc +++ b/230_Stemming/10_Algorithmic_stemmers.asciidoc @@ -94,7 +94,7 @@ documentation, which shows the following: `english_keywords`, and `english_stemmer`. Having reviewed the current configuration, we can use it as the basis for -a new analyzer, with((("english analyzer", "customiing the stemmer"))) the following changes: +a new analyzer, with((("english analyzer", "customizing the stemmer"))) the following changes: * Change the `english_stemmer` from `english` (which maps to the http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-porterstem-tokenfilter.html[`porter_stem`] token filter) diff --git a/230_Stemming/40_Choosing_a_stemmer.asciidoc b/230_Stemming/40_Choosing_a_stemmer.asciidoc index 85906f83b..4140f276f 100644 --- a/230_Stemming/40_Choosing_a_stemmer.asciidoc +++ b/230_Stemming/40_Choosing_a_stemmer.asciidoc @@ -96,7 +96,7 @@ any stemmer.) [[stemmer-degree]] ==== Stemmer Degree -Different stemmers overstem and understem((("stemming words", "choosing as stemmer", "stemmer degree"))) to a different degree. The `light_` +Different stemmers overstem and understem((("stemming words", "choosing a stemmer", "stemmer degree"))) to a different degree. The `light_` stemmers stem less aggressively than the standard stemmers, and the `minimal_` stemmers less aggressively still. Hunspell stems aggressively. diff --git a/230_Stemming/50_Controlling_stemming.asciidoc b/230_Stemming/50_Controlling_stemming.asciidoc index 831cf819e..2ea34bd37 100644 --- a/230_Stemming/50_Controlling_stemming.asciidoc +++ b/230_Stemming/50_Controlling_stemming.asciidoc @@ -24,7 +24,7 @@ token filters from touching those words.((("keyword_marker token filter", "preve For instance, we can create a simple custom analyzer that uses the http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-porterstem-tokenfilter.html[`porter_stem`] token filter, -but prevents the word `skies` from being stemmed: +but prevents the word `skies` from((("porter_stem token filter"))) being stemmed: [source,json] ------------------------------------------ @@ -72,7 +72,7 @@ sky skies skiing skis <1> While the language analyzers allow ((("language analyzers", "stem_exclusion parameter")))us only to specify an array of words in the `stem_exclusion` parameter, the `keyword_marker` token filter also accepts a `keywords_path` parameter that allows us to store all of our keywords in a -file. ((("keyword_marker token filter", "keyword_path parameter")))The file should contain one word per line, and must be present on every +file. ((("keyword_marker token filter", "keywords_path parameter")))The file should contain one word per line, and must be present on every node in the cluster. See <> for tips on how to update this file. diff --git a/240_Stopwords/20_Using_stopwords.asciidoc b/240_Stopwords/20_Using_stopwords.asciidoc index c524b6672..a26e5b5e7 100644 --- a/240_Stopwords/20_Using_stopwords.asciidoc +++ b/240_Stopwords/20_Using_stopwords.asciidoc @@ -85,7 +85,7 @@ The quick and the dead <1> Note the `position` of each token. The stopwords have been filtered out, as expected, but the interesting part is -that the `position` of the((("phrase matching", "position of terms", "stopwords and"))) two remaining terms is unchanged: `quick` is the +that the `position` of the((("phrase matching", "stopwords and", "positions data"))) two remaining terms is unchanged: `quick` is the second word in the original sentence, and `dead` is the fifth. This is important for phrase queries--if the positions of each term had been adjusted, a phrase query for `quick dead` would have matched the preceding @@ -165,7 +165,7 @@ PUT /my_index The http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/using-stopwords.html#stop-token-filter[`stop` token filter] can be combined with a tokenizer((("stopwords", "using stop token filter")))((("stop token filter", "using in custom analyzer"))) and other token filters when you need to create a `custom` -analyzer. For instance, let's say that we wanted to ((("Spanish analyzer", "custom, creating")))((("light_spanish stemmer")))create a Spanish analyzer +analyzer. For instance, let's say that we wanted to ((("Spanish", "custom analyzer for")))((("light_spanish stemmer")))create a Spanish analyzer with the following: * A custom stopwords list diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc index 151dd1da6..3dfc5f48a 100644 --- a/240_Stopwords/40_Divide_and_conquer.asciidoc +++ b/240_Stopwords/40_Divide_and_conquer.asciidoc @@ -2,7 +2,7 @@ === Divide and Conquer The terms in a query string can be divided into more-important (low-frequency) -and less-important (high-frequency) terms.((("stopwods", "low and high frequency terms"))) Documents that match only the less +and less-important (high-frequency) terms.((("stopwords", "low and high frequency terms"))) Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible. @@ -17,7 +17,7 @@ gain a real boost of speed on previously slow queries. ********************************************* One of the benefits of `cutoff_frequency` is that you get _domain-specific_ -stopwords for free.((("domain specific stopwords"))) For instance, a website about movies may use the words +stopwords for free.((("domain specific stopwords")))((("stopwords", "domain specific"))) For instance, a website about movies may use the words _movie_, _color_, _black_, and _white_ so often that they could be considered almost meaningless. With the `stop` token filter, these domain-specific terms would have to be added to the stopwords list manually. However, because the `cutoff_frequency` looks at the actual frequency of terms in the diff --git a/240_Stopwords/50_Phrase_queries.asciidoc b/240_Stopwords/50_Phrase_queries.asciidoc index 698273bea..723e5ec13 100644 --- a/240_Stopwords/50_Phrase_queries.asciidoc +++ b/240_Stopwords/50_Phrase_queries.asciidoc @@ -9,7 +9,7 @@ has to do with the amount of data that is necessary to support proximity matching. In <>, we said that removing stopwords saves only a small -amount of space in the inverted index.((("indexes", "typical, data contained in"))) That was only partially true. A +amount of space in the inverted index.((("indices", "typical, data contained in"))) That was only partially true. A typical index may contain, among other data, some or all of the following: Terms dictionary:: diff --git a/260_Synonyms/60_Multi_word_synonyms.asciidoc b/260_Synonyms/60_Multi_word_synonyms.asciidoc index aa9a40818..a1e7d6128 100644 --- a/260_Synonyms/60_Multi_word_synonyms.asciidoc +++ b/260_Synonyms/60_Multi_word_synonyms.asciidoc @@ -2,7 +2,7 @@ === Multiword Synonyms and Phrase Queries So far, synonyms appear to be quite straightforward. Unfortunately, this is -where things start to go wrong.((("synonyms", "multi-word, and phrase queries")))((("phrase matching", "multi-word synonyms and"))) For <> to +where things start to go wrong.((("synonyms", "multiword, and phrase queries")))((("phrase matching", "multiword synonyms and"))) For <> to function correctly, Elasticsearch needs to know the position that each term occupies in the original text. Multiword synonyms can play havoc with term positions, especially when the injected synonyms are of differing lengths. @@ -95,7 +95,7 @@ document that didn't contain the term `america`. [TIP] ================================================== -Multiword synonyms ((("highlighting searches", "multi-word synonyms and")))affect highlighting in a similar way. A query for `USA` +Multiword synonyms ((("highlighting searches", "multiword synonyms and")))affect highlighting in a similar way. A query for `USA` could end up returning a highlighted snippet such as: ``The _United States is wealthy_''. @@ -104,7 +104,7 @@ is wealthy_''. ==== Use Simple Contraction for Phrase Queries The way to avoid this mess is to use <> -to inject a single((("synonyms", "multi-word, and phrase queries", "using simple contraction")))((("phrase matching", "multi-word synonyms and", "using simple contraction")))((("simple contraction (synonyms)", "using for phrase queries"))) term that represents all synonyms, and to use the same +to inject a single((("synonyms", "multiword, and phrase queries", "using simple contraction")))((("phrase matching", "multiword synonyms and", "using simple contraction")))((("simple contraction (synonyms)", "using for phrase queries"))) term that represents all synonyms, and to use the same synonym token filter at query time: [source,json] @@ -160,7 +160,7 @@ different analysis chain for that purpose. ==== Synonyms and the query_string Query -We have tried to avoid discussing the `query_string` query ((("query strings", "synonyms and")))((("synonyms", "multi-word, and query string queries")))because we don't +We have tried to avoid discussing the `query_string` query ((("query strings", "synonyms and")))((("synonyms", "multiword, and query string queries")))because we don't recommend using it. In <>, we said that, because the `query_string` query supports a terse mini _search-syntax_, it could frequently lead to surprising results or even syntax errors. diff --git a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc index c2ea19cb3..a5c82bf00 100644 --- a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc +++ b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc @@ -5,7 +5,7 @@ In a last, desperate, attempt to match something, anything, we could resort to searching for words that sound similar, ((("typoes and misspellings", "phonetic matching")))((("phonetic matching")))even if their spelling differs. Several algorithms exist for converting words into a phonetic -representation.((("phonet algorithms"))) The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is +representation.((("phonetic algorithms"))) The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is the granddaddy of them all, and most other phonetic algorithms are improvements or specializations of Soundex, such as http://en.wikipedia.org/wiki/Metaphone[Metaphone] and diff --git a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc index 94e9afde0..0ba604687 100644 --- a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc +++ b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc @@ -16,7 +16,7 @@ will probably need access to other documents in the next query. It is cheaper to load all values once, and to _keep them in memory_, than to have to scan the inverted index on every request. -The JVM heap ((("JVM heap")))is a limited resource that should be used wisely. A number of +The JVM heap ((("JVM (Java Virtual Machine", "heap usage, fielddata and")))is a limited resource that should be used wisely. A number of mechanisms exist to limit the impact of fielddata on heap usage. These limits are important because abuse of the heap will cause node instability (thanks to slow garbage collections) or even node death (with an OutOfMemory exception). @@ -24,7 +24,7 @@ slow garbage collections) or even node death (with an OutOfMemory exception). .Choosing a Heap Size ****************************************** -There are two rules to apply when setting ((("heap size, setting")))the Elasticsearch heap size, with +There are two rules to apply when setting ((("heap", rules for setting size of")))the Elasticsearch heap size, with the `$ES_HEAP_SIZE` environment variable: No more than 50% of available RAM:: diff --git a/300_Aggregations/120_breadth_vs_depth.asciidoc b/300_Aggregations/120_breadth_vs_depth.asciidoc index d12b71038..89e8a1489 100644 --- a/300_Aggregations/120_breadth_vs_depth.asciidoc +++ b/300_Aggregations/120_breadth_vs_depth.asciidoc @@ -125,7 +125,7 @@ for classes of queries that are amenable to breadth-first. .Populate full depth for remaining nodes image::images/300_120_breadth_first_4.svg["Step 4: populate full depth for remaining nodes"] -To use breadth-first, simply ((("collect parameter", "enabling breadth-first")))enable it via the `collect` parameter: +To use breadth-first, simply ((("collect parameter, enabling breadth-first")))enable it via the `collect` parameter: [source,js] ---- diff --git a/300_Aggregations/30_histogram.asciidoc b/300_Aggregations/30_histogram.asciidoc index 0c8692da6..6ef664b0c 100644 --- a/300_Aggregations/30_histogram.asciidoc +++ b/300_Aggregations/30_histogram.asciidoc @@ -2,7 +2,7 @@ == Building Bar Charts One of the exciting aspects of aggregations are how easily they are converted -into charts and graphs.((("bar charts", "building from aggregations")))((("aggregations", "building bar charts from"))) In this chapter, we are focusing +into charts and graphs.((("bar charts, building from aggregations", id="ix_barcharts", range="startofrange")))((("aggregations", "building bar charts from"))) In this chapter, we are focusing on various analytics that we can wring out of our example dataset. We will also demonstrate the types of charts aggregations can power. @@ -154,4 +154,4 @@ This will allow us to build a chart like <>: image::images/elas_28in02.png["Average price of all makes, with error bars"] - +((("bar charts, building from aggregations", range="endofrange", startref="ix_barcharts"))) diff --git a/300_Aggregations/35_date_histogram.asciidoc b/300_Aggregations/35_date_histogram.asciidoc index b95d477a3..db1ccb604 100644 --- a/300_Aggregations/35_date_histogram.asciidoc +++ b/300_Aggregations/35_date_histogram.asciidoc @@ -265,7 +265,7 @@ This returns a (heavily truncated) response: } -------------------------------------------------- -We can take this response and put it into a graph, ((("line charts, building from aggregations")))((("bar charts", "building from aggregations")))showing a line chart for +We can take this response and put it into a graph, ((("line charts, building from aggregations")))((("bar charts, building from aggregations")))showing a line chart for total sale price, and a bar chart for each individual make (per quarter), as shown in <>. [[date-histo-ts2]] @@ -275,7 +275,7 @@ image::images/elas_29in02.png["Sales per quarter, with distribution per make"] === The Sky's the Limit These were obviously simple examples, but the sky really is the limit -when it comes to charting aggregations. ((("dashboards", "building from aggregations")))((("Kibana, dashboard in"))) For example, here is a dashboard in +when it comes to charting aggregations. ((("dashboards", "building from aggregations")))((("Kibana", "dashboard in"))) For example, here is a dashboard in Kibana built with a variety of aggregations: [[kibana-img]] diff --git a/300_Aggregations/40_scope.asciidoc b/300_Aggregations/40_scope.asciidoc index 9e61a0c76..462c2ea65 100644 --- a/300_Aggregations/40_scope.asciidoc +++ b/300_Aggregations/40_scope.asciidoc @@ -7,7 +7,7 @@ omitted a `query` from the search request. ((("queries", "in aggregations")))((( simply an aggregation. Aggregations can be run at the same time as search requests, but you need to -understand a new concept: _scope_. ((("scope", "scoping aggregations"))) By default, aggregations operate in the same +understand a new concept: _scope_. ((("scoping aggregations", id="ix_scopeaggs", range="startofrange"))) By default, aggregations operate in the same scope as the query. Put another way, aggregations are calculated on the set of documents that match your query. @@ -142,7 +142,7 @@ update in real time. Try that with Hadoop! You'll often want your aggregation to be scoped to your query. But sometimes you'll want to search for a subset of data, but aggregate across _all_ of -your data.((("aggregations", "scoping", "global bucket")))((("scope", "scoping aggregations", "using a global bucket"))) +your data.((("aggregations", "scoping", "global bucket")))((("scoping aggregations", "using a global bucket"))) For example, say you want to know the average price of Ford cars compared to the average price of _all_ cars. We can use a regular aggregation (scoped to the query) @@ -193,5 +193,5 @@ the average price of all cars. If you've made it this far in the book, you'll recognize the mantra: use a filter wherever you can. The same applies to aggregations, and in the next chapter we show you how to filter an aggregation instead of just limiting the query -scope. +scope.((("scoping aggregations", range="endofrange", startref="ix_scopeaggs"))) diff --git a/300_Aggregations/60_cardinality.asciidoc b/300_Aggregations/60_cardinality.asciidoc index f0c981b81..9459a97e7 100644 --- a/300_Aggregations/60_cardinality.asciidoc +++ b/300_Aggregations/60_cardinality.asciidoc @@ -50,7 +50,7 @@ cars: -------------------------------------------------- We can make our example more useful: how many colors were sold each month? For -that metric, we just nest the `cardinality` metric under ((("date histograms, building", "cardinality metric nested under")))a `date_histogram`: +that metric, we just nest the `cardinality` metric under ((("date histograms, building")))a `date_histogram`: [source,js] -------------------------------------------------- @@ -77,7 +77,7 @@ GET /cars/transactions/_search?search_type=count ==== Understanding the Trade-offs As mentioned at the top of this chapter, the `cardinality` metric is an approximate -algorithm. ((("cardinality", "understanding the tradeoffs"))) It is based on the http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] (HLL) algorithm.((("HLL algorithm"))) HLL works by +algorithm. ((("cardinality", "understanding the tradeoffs"))) It is based on the http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] (HLL) algorithm.((("HLL (HyperLogLog) algorithm")))((("HyperLogLog (HLL) algorithm"))) HLL works by hashing your input and using the bits from the hash to make probabilistic estimations on the cardinality. @@ -129,7 +129,7 @@ counting millions of unique values. If you want a distinct count, you _usually_ want to query your entire dataset (or nearly all of it). ((("cardinality", "optimizing for speed")))((("distinct counts", "optimizing for speed"))) Any operation on all your data needs to execute quickly, for obvious reasons. HyperLogLog is very fast already--it simply -hashes your data and does some bit-twiddling.((("HLL algorithm"))) +hashes your data and does some bit-twiddling.((("HyperLogLog (HLL) algorithm")))((("HLL (HyperLogLog) algorithm"))) But if speed is important to you, we can optimize it a little bit further. Since HLL simply needs the hash of the field, we can precompute that hash at diff --git a/300_Aggregations/90_fielddata.asciidoc b/300_Aggregations/90_fielddata.asciidoc index 1955d34ba..1af3f9afa 100644 --- a/300_Aggregations/90_fielddata.asciidoc +++ b/300_Aggregations/90_fielddata.asciidoc @@ -93,7 +93,7 @@ take the union of the two sets. [TIP] ================================================== -The fielddata cache is per segment.((("segments", "fielddata cache")))((("caching", "fielddata"))) In other words, when a new segment becomes +The fielddata cache is per segment.((("fielddata cache")))((("segments", "fielddata cache"))) In other words, when a new segment becomes visible to search, the fielddata cached from old segments remains valid. Only the data for the new segment needs to be loaded into memory. diff --git a/310_Geopoints/20_Geopoints.asciidoc b/310_Geopoints/20_Geopoints.asciidoc index 2332bdcf6..159873c08 100644 --- a/310_Geopoints/20_Geopoints.asciidoc +++ b/310_Geopoints/20_Geopoints.asciidoc @@ -33,7 +33,7 @@ PUT /attractions === Lat/Lon Formats With the `location` field defined as a `geo_point`, we can proceed to index -documents containing latitude/longitude pairs,((("geo-points", "location fields defined as, lat/lon formats")))((("location field", "defined as geo-point")))((("latitude/longitude pairs", "lat/lon formats for geo-points")))((("arrays", "geo-point, lon/lat format")))((("strings", "geo-point, lat/lon format")))((("objects", "geo-point, lat/lon format"))) which can be formatted as +documents containing latitude/longitude pairs,((("geo-points", "location fields defined as, lat/lon formats")))((("location field, defined as geo-point")))((("latitude/longitude pairs", "lat/lon formats for geo-points")))((("arrays", "geo-point, lon/lat format")))((("strings", "geo-point, lat/lon format")))((("objects", "geo-point, lat/lon format"))) which can be formatted as strings, arrays, or objects: [role="pagebreak-before"] diff --git a/330_Geo_aggs/62_Geo_distance_agg.asciidoc b/330_Geo_aggs/62_Geo_distance_agg.asciidoc index 3ad2a37d1..9f2761e2a 100644 --- a/330_Geo_aggs/62_Geo_distance_agg.asciidoc +++ b/330_Geo_aggs/62_Geo_distance_agg.asciidoc @@ -118,4 +118,4 @@ The response from ((("post filter", "geo_distance aggregation")))the preceding r In this example, we have counted the number of restaurants that fall into each concentric ring. Of course, we could nest subaggregations under the `per_rings` aggregation to calculate the average price per ring, the -maximium popularity, and more.((("per_rings aggregation"))) +maximium popularity, and more. diff --git a/330_Geo_aggs/66_Geo_bounds_agg.asciidoc b/330_Geo_aggs/66_Geo_bounds_agg.asciidoc index b31a1bf01..32de81e87 100644 --- a/330_Geo_aggs/66_Geo_bounds_agg.asciidoc +++ b/330_Geo_aggs/66_Geo_bounds_agg.asciidoc @@ -71,7 +71,7 @@ The response now includes a bounding box that we can use to zoom our map: ---------------------------- In fact, we could even use the `geo_bounds` aggregation inside each geohash -cell,((("geohash cells", "geo_bounds aggregation in"))) in case the geo-points inside a cell are clustered in just a part of the +cell,((("geohash cells, geo_bounds aggregation in"))) in case the geo-points inside a cell are clustered in just a part of the cell: [source,json] diff --git a/340_Geoshapes/72_Mapping_geo_shapes.asciidoc b/340_Geoshapes/72_Mapping_geo_shapes.asciidoc index 8a6b1c135..5b67fe52c 100644 --- a/340_Geoshapes/72_Mapping_geo_shapes.asciidoc +++ b/340_Geoshapes/72_Mapping_geo_shapes.asciidoc @@ -27,7 +27,7 @@ There are two important settings that you should consider changing `precision` a ==== precision -The `precision` parameter ((("geo-shapes", "precision")))((("precision parameter", "geo-shapes")))controls the maximum length of the geohashes that +The `precision` parameter ((("geo-shapes", "precision")))((("precision parameter, geo-shapes")))controls the maximum length of the geohashes that are generated. It defaults to a precision of `9`, which equates to a <> with dimensions of about 5m x 5m. That is probably far more precise than you need. diff --git a/400_Relationships/25_Concurrency.asciidoc b/400_Relationships/25_Concurrency.asciidoc index dafdfc353..8fed83a8a 100644 --- a/400_Relationships/25_Concurrency.asciidoc +++ b/400_Relationships/25_Concurrency.asciidoc @@ -1,7 +1,7 @@ [[denormalization-concurrency]] === Denormalization and Concurrency -Of course, data denormalization has downsides too.((("relationships", "denormalization and concurrency")))((("concurreny", "denormalization and")))((("denormaliation", "and concurrency"))) The first disadvantage is +Of course, data denormalization has downsides too.((("relationships", "denormalization and concurrency")))((("concurrency", "denormalization and")))((("denormalization", "and concurrency"))) The first disadvantage is that the index will be bigger because the `_source` document for every blog post is bigger, and there are more indexed fields. This usually isn't a huge problem. The data written to disk is highly compressed, and disk space diff --git a/402_Nested/35_Nested_aggs.asciidoc b/402_Nested/35_Nested_aggs.asciidoc index 32ae70a5a..44b0be54d 100644 --- a/402_Nested/35_Nested_aggs.asciidoc +++ b/402_Nested/35_Nested_aggs.asciidoc @@ -157,7 +157,7 @@ The abbreviated results show us the following: ==== When to Use Nested Objects -Nested objects are useful when there is one main entity, like our `blogpost`, +Nested objects((("nested objects", "when to use"))) are useful when there is one main entity, like our `blogpost`, with a limited number of closely related but less important entities, such as comments. It is useful to be able to find blog posts based on the content of the comments, and the `nested` query and filter provide for fast query-time diff --git a/404_Parent_Child/50_Has_child.asciidoc b/404_Parent_Child/50_Has_child.asciidoc index 571e9bc22..131df72d4 100644 --- a/404_Parent_Child/50_Has_child.asciidoc +++ b/404_Parent_Child/50_Has_child.asciidoc @@ -2,7 +2,7 @@ === Finding Parents by Their Children The `has_child` query and filter can be used to find parent documents based on -the contents of their children.((("parent-child relationship", "finding parents by their children"))) For instance, we could find all branches that +the contents of their children.((("has_child query and filter")))((("parent-child relationship", "finding parents by their children"))) For instance, we could find all branches that have employees born after 1980 with a query like this: [source,json] @@ -25,7 +25,7 @@ GET /company/branch/_search ------------------------- Like the <>, the `has_child` query could -match several child documents,((("has_child query"))) each with a different relevance +match several child documents,((("has_child query and filter", "query"))) each with a different relevance score. How these scores are reduced to a single score for the parent document depends on the `score_mode` parameter. The default setting is `none`, which ignores the child scores and assigns a score of `1.0` to the parents, but it @@ -62,7 +62,7 @@ score.((("parent-child relationship", "finding parents by their children", "min_ ==== min_children and max_children The `has_child` query and filter both accept the `min_children` and -`max_children` parameters,((("min_children parameter")))((("max_children parameter")))((("has_child query"))) which will return the parent document only if the +`max_children` parameters,((("min_children parameter")))((("max_children parameter")))((("has_child query and filter", "min_children or max_children parameters"))) which will return the parent document only if the number of matching children is within the specified range. This query will match only branches that have at least two employees: @@ -91,7 +91,7 @@ enabled. .has_child Filter ************************** -The `has_child` filter works((("has_child filter"))) in the same way as the `has_child` query, except +The `has_child` filter works((("has_child query and filter", "filter"))) in the same way as the `has_child` query, except that it doesn't support the `score_mode` parameter. It can be used only in _filter context_—such as inside a `filtered` query--and behaves like any other filter: it includes or excludes, but doesn't score. diff --git a/404_Parent_Child/55_Has_parent.asciidoc b/404_Parent_Child/55_Has_parent.asciidoc index 205f5bb1d..c0b83dcf3 100644 --- a/404_Parent_Child/55_Has_parent.asciidoc +++ b/404_Parent_Child/55_Has_parent.asciidoc @@ -5,7 +5,7 @@ While a `nested` query can always ((("parent-child relationship", "finding child parent and child documents are independent and each can be queried independently. The `has_child` query allows us to return parents based on data in their children, and the `has_parent` query returns children based on -data in their parents.((("has_parent query"))) +data in their parents.((("has_parent query and filter", "query"))) It looks very similar to the `has_child` query. This example returns employees who work in the UK: @@ -28,7 +28,7 @@ GET /company/employee/_search ------------------------- <1> Returns children who have parents of type `branch` -The `has_parent` query also supports the `score_mode`, but it accepts only two +The `has_parent` query also supports the `score_mode`,((("score_mode parameter"))) but it accepts only two settings: `none` (the default) and `score`. Each child can have only one parent, so there is no need to reduce multiple scores into a single score for the child. The choice is simply between using the score (`score`) or not @@ -37,7 +37,7 @@ the child. The choice is simply between using the score (`score`) or not .has_parent Filter ************************** -The `has_parent` filter works in the same way((("has_parent filter"))) as the `has_parent` query, except +The `has_parent` filter works in the same way((("has_parent query and filter", "filter"))) as the `has_parent` query, except that it doesn't support the `score_mode` parameter. It can be used only in _filter context_—such as inside a `filtered` query--and behaves like any other filter: it includes or excludes, but doesn't score. diff --git a/410_Scaling/40_Multiple_indices.asciidoc b/410_Scaling/40_Multiple_indices.asciidoc index a2770c8a5..03b3f1f7f 100644 --- a/410_Scaling/40_Multiple_indices.asciidoc +++ b/410_Scaling/40_Multiple_indices.asciidoc @@ -2,7 +2,7 @@ === Multiple Indices Finally, remember that there is no rule that limits your application to using -only a single index.((("scaling", "using multiple indexes")))((("indexes", "multiple"))) When we issue a search request, it is forwarded to a +only a single index.((("scaling", "using multiple indices")))((("indices", "multiple"))) When we issue a search request, it is forwarded to a copy (a primary or a replica) of all the shards in an index. If we issue the same search request on multiple indices, the exact same thing happens--there are just more shards involved. diff --git a/410_Scaling/45_Index_per_timeframe.asciidoc b/410_Scaling/45_Index_per_timeframe.asciidoc index d8a84cd49..12d739732 100644 --- a/410_Scaling/45_Index_per_timeframe.asciidoc +++ b/410_Scaling/45_Index_per_timeframe.asciidoc @@ -1,7 +1,7 @@ [[time-based]] === Time-Based Data -One of the most common use cases for Elasticsearch is for logging,((("logging, use of Elasticsearch for")))((("time-based data")))((("scaling", "time-based data and"))) so common +One of the most common use cases for Elasticsearch is for logging,((("logging", "using Elasticsearch for")))((("time-based data")))((("scaling", "time-based data and"))) so common in fact that Elasticsearch provides an integrated((("ELK stack"))) logging platform called the _ELK stack_—Elasticsearch, Logstash, and Kibana--to make the process easy. @@ -51,7 +51,7 @@ But this approach is _very inefficient_. Remember that when you delete a document, it is only _marked_ as deleted (see <>). It won't be physically deleted until the segment containing it is merged away. -Instead, use an _index per time frame_. ((("index per-timeframe")))You could start out with an index per +Instead, use an _index per time frame_. ((("indices", "index per-timeframe")))You could start out with an index per year (`logs_2014`) or per month (`logs_2014-10`). Perhaps, when your website gets really busy, you need to switch to an index per day (`logs_2014-10-24`). Purging old data is easy: just delete old indices. diff --git a/410_Scaling/50_Index_templates.asciidoc b/410_Scaling/50_Index_templates.asciidoc index 4c9090065..a855894fb 100644 --- a/410_Scaling/50_Index_templates.asciidoc +++ b/410_Scaling/50_Index_templates.asciidoc @@ -1,11 +1,11 @@ [[index-templates]] === Index Templates -Elasticsearch doesn't require you to create an index before using it.((("index templates")))((("scaling", "index templates and"))) With +Elasticsearch doesn't require you to create an index before using it.((("indices", "templates")))((("scaling", "index templates and")))((("templates", "index"))) With logging, it is often more convenient to rely on index autocreation than to have to create indices manually. -Logstash uses the timestamp((("Logstash")))((("timestamps", "use by Logtech to create index names"))) from an event to derive the index name. By +Logstash uses the timestamp((("Logstash")))((("timestamps, use by Logstash to create index names"))) from an event to derive the index name. By default, it indexes into a different index every day, so an event with a `@timestamp` of `2014-10-01 00:00:01` will be sent to the index `logstash-2014.10.01`. If that index doesn't already exist, it will be diff --git a/410_Scaling/55_Retiring_data.asciidoc b/410_Scaling/55_Retiring_data.asciidoc index 09651b8f9..3f5f6936e 100644 --- a/410_Scaling/55_Retiring_data.asciidoc +++ b/410_Scaling/55_Retiring_data.asciidoc @@ -5,7 +5,7 @@ As time-based data ages, it becomes less relevant.((("scaling", "retiring data") will want to see what happened last week, last month, or even last year, but for the most part, we're interested in only the here and now. -The nice thing about an index per time frame ((("index per-timeframe", "deleting old data and")))((("indexes", "deleting")))is that it enables us to easily +The nice thing about an index per time frame ((("indices", "index per-timeframe", "deleting old data and")))((("indices", "deleting")))is that it enables us to easily delete old data: just delete the indices that are no longer relevant: [source,json] @@ -23,7 +23,7 @@ do to help data age gracefully, before we decide to delete it completely. ==== Migrate Old Indices With logging data, there is likely to be one _hot_ index--the index for -today.((("indexes", "migrating old indexes"))) All new documents will be added to that index, and almost all queries +today.((("indices", "migrating old indices"))) All new documents will be added to that index, and almost all queries will target that index. It should use your best hardware. How does Elasticsearch know which servers are your best servers? You tell it, @@ -63,7 +63,7 @@ POST /logs_2014-09-30/_settings [[optimize-indices]] ==== Optimize Indices -Yesterday's index is unlikely to change.((("indexes", "optimizing"))) Log events are static: what +Yesterday's index is unlikely to change.((("indices", "optimizing"))) Log events are static: what happened in the past stays in the past. If we merge each shard down to just a single segment, it'll use fewer resources and will be quicker to query. We can do this with the <>. @@ -97,7 +97,7 @@ http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-sn ==== Closing Old Indices As indices get even older, they reach a point where they are almost never -accessed.((("indexes", "closing old indexes"))) We could delete them at this stage, but perhaps you want to keep +accessed.((("indices", "closing old indices"))) We could delete them at this stage, but perhaps you want to keep them around just in case somebody asks for them in six months. These indices can be closed. They will still exist in the cluster, but they @@ -121,7 +121,7 @@ POST /logs_2014-01-*/_open <3> [[archive-indices]] ==== Archiving Old Indices -Finally, very old indices ((("indexes", "archiving old indexes")))can be archived off to some long-term storage like a +Finally, very old indices ((("indices", "archiving old indices")))can be archived off to some long-term storage like a shared disk or Amazon's S3 using the http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html[`snapshot-restore` API], just in case you may need to access them in the future. Once a backup exists, the index can be deleted diff --git a/410_Scaling/60_Index_per_user.asciidoc b/410_Scaling/60_Index_per_user.asciidoc index 5d0b11c58..1dcbf4d51 100644 --- a/410_Scaling/60_Index_per_user.asciidoc +++ b/410_Scaling/60_Index_per_user.asciidoc @@ -17,7 +17,7 @@ own documents. Some users have more documents than others, and some users will have heavier search loads than others, so the ability to specify the number of primary shards and replica shards that each index should have fits well with the index-per-user -model.((("index-per-user model")))((("primary shards", "number per-index"))) Similarly, busier indices can be allocated to stronger boxes with shard +model.((("indices", "index-per-user model")))((("primary shards", "number per-index"))) Similarly, busier indices can be allocated to stronger boxes with shard allocation filtering. (See <>.) TIP: Don't just use the default number of primary shards for every index. diff --git a/410_Scaling/65_Shared_index.asciidoc b/410_Scaling/65_Shared_index.asciidoc index 0fea4ca96..e127ec3dd 100644 --- a/410_Scaling/65_Shared_index.asciidoc +++ b/410_Scaling/65_Shared_index.asciidoc @@ -1,7 +1,7 @@ [[shared-index]] === Shared Index -We can use a large shared index for the many smaller ((("scaling", "shared index")))((("indexes", "shared")))forums by indexing +We can use a large shared index for the many smaller ((("scaling", "shared index")))((("indices", "shared")))forums by indexing the forum identifier in a field and using it as a filter: [source,json] diff --git a/410_Scaling/75_One_big_user.asciidoc b/410_Scaling/75_One_big_user.asciidoc index fab9571bd..a31f02712 100644 --- a/410_Scaling/75_One_big_user.asciidoc +++ b/410_Scaling/75_One_big_user.asciidoc @@ -7,7 +7,7 @@ because it holds the documents for a forum that has become very popular. That forum now needs its own index. The index aliases that we're using to fake an index per user give us a clean -migration path for the big forum.((("indexes", "shared", "migrating data to dedicated index"))) +migration path for the big forum.((("indices", "shared", "migrating data to dedicated index"))) The first step is to create a new index dedicated to the forum, and with the appropriate number of shards to allow for expected growth: diff --git a/500_Cluster_Admin/20_health.asciidoc b/500_Cluster_Admin/20_health.asciidoc index a252724b0..1adf814f0 100644 --- a/500_Cluster_Admin/20_health.asciidoc +++ b/500_Cluster_Admin/20_health.asciidoc @@ -79,7 +79,7 @@ cluster is `red` (since primaries are missing). ==== Drilling Deeper: Finding Problematic Indices -Imagine something goes wrong one day, and you notice that your cluster health +Imagine something goes wrong one day,((("indices", "problematic, finding"))) and you notice that your cluster health looks like this: [source,js] diff --git a/500_Cluster_Admin/30_node_stats.asciidoc b/500_Cluster_Admin/30_node_stats.asciidoc index 6d3cc8477..f30f0cb1b 100644 --- a/500_Cluster_Admin/30_node_stats.asciidoc +++ b/500_Cluster_Admin/30_node_stats.asciidoc @@ -2,7 +2,7 @@ === Monitoring Individual Nodes `Cluster-health` is at one end of the spectrum--a very high-level overview of -everything in your cluster. ((("clusters", "administration", "monitoring individual nodes")))((("nodes", "monitoring individual nodes"))) The `node-stats` API is at the other end. ((("Node Stats API"))) It provides +everything in your cluster. ((("clusters", "administration", "monitoring individual nodes")))((("nodes", "monitoring individual nodes"))) The `node-stats` API is at the other end. ((("Node Stats API", id="ix_NodeStats", range="startofrange"))) It provides a bewildering array of statistics about each node in your cluster. `Node-stats` provides so many stats that, until you are accustomed to the output, @@ -45,7 +45,7 @@ or the node is binding to the wrong IP address/interface. ==== indices Section -The `indices` section lists aggregate statistics((("indexes", "indices section in Node Stats API"))) for all the indices that reside +The `indices` section lists aggregate statistics((("indices", "indices section in Node Stats API"))) for all the indices that reside on this particular node: [source,js] @@ -186,14 +186,14 @@ There is little you can do to affect this memory usage, since it has a fairly li relationship with the number of parent/child docs. It is heap-resident, however, so it's a good idea to keep an eye on it. -- `field_data` displays the memory used by fielddata, which is used for aggregations, +- `field_data` displays the memory used by fielddata,((("fielddata", "statistics on"))) which is used for aggregations, sorting, and more. There is also an eviction count. Unlike `filter_cache`, the eviction count here is useful: it should be zero or very close. Since field data is not a cache, any eviction is costly and should be avoided. If you see evictions here, you need to reevaluate your memory situation, fielddata limits, queries, or all three. -- `segments` will tell you the number of Lucene segments this node currently serves. +- `segments` will tell you the number of Lucene segments this node currently serves.((("segments", "number served by a node"))) This can be an important number. Most indices should have around 50–150 segments, even if they are terabytes in size with billions of documents. Large numbers of segments can indicate a problem with merging (for example, merging is not keeping up @@ -201,7 +201,7 @@ with segment creation). Note that this statistic is the aggregate total of all indices on the node, so keep that in mind. + The `memory` statistic gives you an idea of the amount of memory being used by the -Lucene segments themselves. This includes low-level data structures such as +Lucene segments themselves.((("memory", "statistics on"))) This includes low-level data structures such as posting lists, dictionaries, and bloom filters. A very large number of segments will increase the amount of overhead lost to these data structures, and the memory usage can be a handy metric to gauge that overhead. @@ -209,7 +209,7 @@ usage can be a handy metric to gauge that overhead. ==== OS and Process Sections The `OS` and `Process` sections are fairly self-explanatory and won't be covered -in great detail. They list basic resource statistics such as CPU and load. The +in great detail.((("operating system (OS), statistics on"))) They list basic resource statistics such as CPU and load.((("process (Elasticsearch JVM), statistics on"))) The `OS` section describes it for the entire `OS`, while the `Process` section shows just what the Elasticsearch JVM process is using. @@ -225,14 +225,14 @@ monitoring stack. Some stats include the following: ==== JVM Section The `jvm` section contains some critical information about the JVM process that -is running Elasticsearch. Most important, it contains garbage collection details, +is running Elasticsearch.((("JVM (Java Virtual Machine)", "statistics on"))) Most important, it contains garbage collection details, which have a large impact on the stability of your Elasticsearch cluster. [[garbage_collector_primer]] .Garbage Collection Primer ********************************** Before we describe the stats, it is useful to give a crash course in garbage -collection and its impact on Elasticsearch. If you are familar with garbage +collection and its impact on Elasticsearch.((("garbage collection"))) If you are familar with garbage collection in the JVM, feel free to skip down. Java is a _garbage-collected_ language, which means that the programmer does @@ -393,7 +393,7 @@ discussed in <>. ==== Threadpool Section -Elasticsearch maintains threadpools internally. These threadpools +Elasticsearch maintains threadpools internally. ((("threadpools", "statistics on"))) These threadpools cooperate to get work done, passing work between each other as necessary. In general, you don't need to configure or tune the threadpools, but it is sometimes useful to see their stats so you can gain insight into how your cluster is behaving. @@ -425,7 +425,7 @@ up with the influx of work. .Bulk Rejections **** If you are going to encounter queue rejections, it will most likely be caused -by bulk indexing requests. It is easy to send many bulk requests to Elasticsearch +by bulk indexing requests.((("bulk API", "rejections of bulk requests"))) It is easy to send many bulk requests to Elasticsearch by using concurrent import processes. More is better, right? In reality, each cluster has a certain limit at which it can not keep up with @@ -480,7 +480,7 @@ are good to keep an eye on: ==== FS and Network Sections -Continuing down the `node-stats` API, you'll see a bunch of statistics about your +Continuing down the `node-stats` API, you'll see a((("filesystem, statistics on"))) bunch of statistics about your filesystem: free space, data directory paths, disk I/O stats, and more. If you are not monitoring free disk space, you can get those stats here. The disk I/O stats are also handy, but often more specialized command-line tools (`iostat`, for example) @@ -489,7 +489,7 @@ are more useful. Obviously, Elasticsearch has a difficult time functioning if you run out of disk space--so make sure you don't. -There are also two sections on network statistics: +There are also two sections on ((("network", "statistics on")))network statistics: [source,js] ---- @@ -520,7 +520,7 @@ are configured appropriately. ==== Circuit Breaker -Finally, we come to the last section: stats about the fielddata circuit breaker +Finally, we come to the last section: stats about the((("fielddata circuit breaker"))) fielddata circuit breaker (introduced in <>): [role="pagebreak-before"] @@ -544,7 +544,7 @@ the currently configured overhead. The overhead is used to pad estimates, becau The main thing to watch is the `tripped` metric. If this number is large or consistently increasing, it's a sign that your queries may need to be optimized or that you may need to obtain more memory (either per box or by adding more -nodes). +nodes).((("Node Stats API", range="endofrange", startref="ix_NodeStats"))) diff --git a/500_Cluster_Admin/40_other_stats.asciidoc b/500_Cluster_Admin/40_other_stats.asciidoc index 001bfd2b1..6b6facaf5 100644 --- a/500_Cluster_Admin/40_other_stats.asciidoc +++ b/500_Cluster_Admin/40_other_stats.asciidoc @@ -21,7 +21,7 @@ GET _cluster/stats === Index Stats -So far, we have been looking at _node-centric_ statistics:((("indexes", "index statistics")))((("clusters", "administration", "index stats"))) How much memory does +So far, we have been looking at _node-centric_ statistics:((("indices", "index statistics")))((("clusters", "administration", "index stats"))) How much memory does this node have? How much CPU is being used? How many searches is this node servicing? diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 3fef6da63..acc588466 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -13,7 +13,7 @@ production clusters. If there is one resource that you will run out of first, it will likely be memory.((("hardware", "memory")))((("memory"))) Sorting and aggregations can both be memory hungry, so enough heap space to -accommodate these is important. Even when the heap is comparatively small, +accommodate these is important.((("heap"))) Even when the heap is comparatively small, extra memory can be given to the OS filesystem cache. Because many data structures used by Lucene are disk-based formats, Elasticsearch leverages the OS cache to great effect. diff --git a/510_Deployment/30_other.asciidoc b/510_Deployment/30_other.asciidoc index 6e2e7fc7a..63c00050a 100644 --- a/510_Deployment/30_other.asciidoc +++ b/510_Deployment/30_other.asciidoc @@ -2,7 +2,7 @@ === Java Virtual Machine You should always run the most recent version of the Java Virtual Machine (JVM), -unless otherwise stated on the Elasticsearch website.((("deployment", "Java Virtual Machine (JVM)")))((("JVM (Java Virtual Machine)")))((("Java Virtual Machine (JVM)"))) Elasticsearch, and in +unless otherwise stated on the Elasticsearch website.((("deployment", "Java Virtual Machine (JVM)")))((("JVM (Java Virtual Machine)")))((("Java Virtual Machine", see="JVM"))) Elasticsearch, and in particular Lucene, is a demanding piece of software. The unit and integration tests from Lucene often expose bugs in the JVM itself. These bugs range from mild annoyances to serious segfaults, so it is best to use the latest version @@ -20,7 +20,7 @@ between client and server. .Please Do Not Tweak JVM Settings **** -The JVM exposes dozens (hundreds even!) of settings, parameters, and configurations. +The JVM exposes dozens (hundreds even!) of settings, parameters, and configurations.((("JVMs (Java Virtual Machines)", "avoiding custom configuration"))) They allow you to tweak and tune almost every aspect of the JVM. When a knob is encountered, it is human nature to want to turn it. We implore @@ -37,7 +37,7 @@ half the time, this alone restores stability and performance. === Transport Client Versus Node Client If you are using Java, you may wonder when to use the transport client versus the -node client.((("clients")))((("NodeClient versus TransportClient")))((("TransportClient versus NodeClient"))) As discussed at the beginning of the book, the transport client +node client.((("Java", "clients for Elasticsearch")))((("clients")))((("node client", "versus transport client")))((("transport client", "versus node client"))) As discussed at the beginning of the book, the transport client acts as a communication layer between the cluster and your application. It knows the API and can automatically round-robin between nodes, sniff the cluster for you, and more. But it is _external_ to the cluster, similar to the REST clients. diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 6a7a7c6ca..a886bb0c5 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -68,7 +68,7 @@ is more wasted space simply because the pointer is larger. And worse than waste space, the larger pointers eat up more bandwidth when moving values between main memory and various caches (LLC, L1, and so forth). -Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed oops"))) +Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed object pointers"))) to get around this problem. Instead of pointing at exact byte locations in memory, the pointers reference _object offsets_.((("object offsets"))) This means a 32-bit pointer can reference four billion _objects_, rather than four billion bytes. Ultimately, this diff --git a/520_Post_Deployment/20_logging.asciidoc b/520_Post_Deployment/20_logging.asciidoc index 24220b9d8..125b2e101 100644 --- a/520_Post_Deployment/20_logging.asciidoc +++ b/520_Post_Deployment/20_logging.asciidoc @@ -11,7 +11,7 @@ up the logging level to `DEBUG`. You _could_ modify the `logging.yml` file and restart your nodes--but that is both tedious and leads to unnecessary downtime. Instead, you can update logging -levels through the `cluster-settings` API((("Cluster Settings API", "updating logging levels"))) that we just learned about. +levels through the `cluster-settings` API((("Cluster Settings API, updating logging levels"))) that we just learned about. To do so, take the logger you are interested in and prepend `logger.` to it. Let's turn up the discovery logging: diff --git a/520_Post_Deployment/50_backup.asciidoc b/520_Post_Deployment/50_backup.asciidoc index f54ac6153..7f7d63f69 100644 --- a/520_Post_Deployment/50_backup.asciidoc +++ b/520_Post_Deployment/50_backup.asciidoc @@ -80,7 +80,7 @@ of the existing repository. ==== Snapshotting All Open Indices -A repository can contain multiple snapshots.((("indexes", "open, snapshots on")))((("backing up your cluster", "snapshots on all open indexes"))) Each snapshot is associated with a +A repository can contain multiple snapshots.((("indices", "open, snapshots on")))((("backing up your cluster", "snapshots on all open indexes"))) Each snapshot is associated with a certain set of indices (for example, all indices, some subset, or a single index). When creating a snapshot, you specify which indices you are interested in and give the snapshot a unique name. @@ -115,7 +115,7 @@ may take a long time to return! ==== Snapshotting Particular Indices -The default behavior is to back up all open indices.((("indexes", "snapshotting particular")))((("backing up your cluster", "snapshotting particular indexes"))) But say you are using Marvel, +The default behavior is to back up all open indices.((("indices", "snapshotting particular")))((("backing up your cluster", "snapshotting particular indices"))) But say you are using Marvel, and don't really want to back up all the diagnostic `.marvel` indices. You just don't have enough space to back up everything. diff --git a/520_Post_Deployment/60_restore.asciidoc b/520_Post_Deployment/60_restore.asciidoc index e521301f5..d4f383eb6 100644 --- a/520_Post_Deployment/60_restore.asciidoc +++ b/520_Post_Deployment/60_restore.asciidoc @@ -11,7 +11,7 @@ POST _snapshot/my_backup/snapshot_1/_restore The default behavior is to restore all indices that exist in that snapshot. If `snapshot_1` contains five indices, all five will be restored into -our cluster. ((("indexes", "restoring from a snapshot"))) As with the `snapshot` API, it is possible to select which indices +our cluster. ((("indices", "restoring from a snapshot"))) As with the `snapshot` API, it is possible to select which indices we want to restore. There are also additional options for renaming indices. This allows you to