Skip to content

Latest commit

 

History

History
411 lines (360 loc) · 16 KB

optimization.rst

File metadata and controls

411 lines (360 loc) · 16 KB

Optimizations

Table of contents

In this doc, we will cover the optimization solution in the query engine and the limitations.

The Logical Plan optimization transfer the logical operator by predefined rules. The following section will cover the detail rules.

The where clause will reduce the size of rows markedly, thus the where clause optimization is the most important optimization.

There are two rules involved in the core engine.

The consecutive Filter operator will be merged as one Filter operator:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_ppl/_explain \
... -d '{"query" : "source=accounts | where age > 10 | where age < 20 | fields age"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\",\"query\":{\"bool\":{\"filter\":[{\"range\":{\"age\":{\"from\":null,\"to\":20,\"include_lower\":true,\"include_upper\":false,\"boost\":1.0}}},{\"range\":{\"age\":{\"from\":10,\"to\":null,\"include_lower\":false,\"include_upper\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"_source\":{\"includes\":[\"age\"],\"excludes\":[]},\"sort\":[{\"_doc\":{\"order\":\"asc\"}}]}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

The Filter operator should be push down under Sort operator:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_ppl/_explain \
... -d '{"query" : "source=accounts | sort age | where age < 20 | fields age"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\",\"query\":{\"range\":{\"age\":{\"from\":null,\"to\":20,\"include_lower\":true,\"include_upper\":false,\"boost\":1.0}}},\"_source\":{\"includes\":[\"age\"],\"excludes\":[]},\"sort\":[{\"age\":{\"order\":\"asc\",\"missing\":\"_first\"}}]}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

The Elasticsearch Query DSL and Aggregation also enabling the storage engine specific optimization.

The Project list will push down to Query DSL to filter the source:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\",\"_source\":{\"includes\":[\"age\"],\"excludes\":[]}}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

The Filter operator will merge into Elasticsearch Query DSL:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts WHERE age > 30"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\",\"query\":{\"range\":{\"age\":{\"from\":30,\"to\":null,\"include_lower\":false,\"include_upper\":true,\"boost\":1.0}}},\"_source\":{\"includes\":[\"age\"],\"excludes\":[]},\"sort\":[{\"_doc\":{\"order\":\"asc\"}}]}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

The Sort operator will merge into Elasticsearch Query DSL:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts ORDER BY age"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\",\"_source\":{\"includes\":[\"age\"],\"excludes\":[]},\"sort\":[{\"age\":{\"order\":\"asc\",\"missing\":\"_first\"}}]}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

Because the Elasticsearch Script Based Sorting can't handle NULL/MISSING value, there is one exception is that if the sort list include expression other than field reference, it will not be merged into Query DSL:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts ORDER BY abs(age)"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "SortOperator",
        "description": {
          "sortList": {
            "abs(age)": {
              "sortOrder": "ASC",
              "nullOrder": "NULL_FIRST"
            }
          }
        },
        "children": [
          {
            "name": "ElasticsearchIndexScan",
            "description": {
              "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\"}, searchDone=false)"
            },
            "children": []
          }
        ]
      }
    ]
  }
}

The Limit operator will merge in Elasticsearch Query DSL:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts LIMIT 10 OFFSET 5"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":5,\"size\":10,\"timeout\":\"1m\",\"_source\":{\"includes\":[\"age\"],\"excludes\":[]}}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

If sort that includes expression, which cannot be merged into query DSL, also exists in the query, the Limit operator will not be merged into query DSL as well:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT age FROM accounts ORDER BY abs(age) LIMIT 10"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[age]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 10,
          "offset": 0
        },
        "children": [
          {
            "name": "SortOperator",
            "description": {
              "sortList": {
                "abs(age)": {
                  "sortOrder": "ASC",
                  "nullOrder": "NULL_FIRST"
                }
              }
            },
            "children": [
              {
                "name": "ElasticsearchIndexScan",
                "description": {
                  "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":200,\"timeout\":\"1m\"}, searchDone=false)"
                },
                "children": []
              }
            ]
          }
        ]
      }
    ]
  }
}

The Aggregation operator will merge into Elasticsearch Aggregation:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT gender, avg(age) FROM accounts GROUP BY gender"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[gender, avg(age)]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"gender\":{\"terms\":{\"field\":\"gender.keyword\",\"missing_bucket\":true,\"order\":\"asc\"}}}]},\"aggregations\":{\"avg(age)\":{\"avg\":{\"field\":\"age\"}}}}}}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

The Sort operator will merge into Elasticsearch Aggregation.:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT gender, avg(age) FROM accounts GROUP BY gender ORDER BY gender DESC NULLS LAST"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[gender, avg(age)]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"gender\":{\"terms\":{\"field\":\"gender.keyword\",\"missing_bucket\":true,\"order\":\"desc\"}}}]},\"aggregations\":{\"avg(age)\":{\"avg\":{\"field\":\"age\"}}}}}}, searchDone=false)"
        },
        "children": []
      }
    ]
  }
}

Because the Elasticsearch Composite Aggregation order doesn't support separate NULL_FIRST/NULL_LAST option. only the default sort option (ASC NULL_FIRST/DESC NULL_LAST) will be supported for push down to Elasticsearch Aggregation, otherwise it will fall back to the default memory based operator:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT gender, avg(age) FROM accounts GROUP BY gender ORDER BY gender ASC NULLS LAST"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[gender, avg(age)]"
    },
    "children": [
      {
        "name": "SortOperator",
        "description": {
          "sortList": {
            "gender": {
              "sortOrder": "ASC",
              "nullOrder": "NULL_LAST"
            }
          }
        },
        "children": [
          {
            "name": "ElasticsearchIndexScan",
            "description": {
              "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"gender\":{\"terms\":{\"field\":\"gender.keyword\",\"missing_bucket\":true,\"order\":\"asc\"}}}]},\"aggregations\":{\"avg(age)\":{\"avg\":{\"field\":\"age\"}}}}}}, searchDone=false)"
            },
            "children": []
          }
        ]
      }
    ]
  }
}

Because the Elasticsearch Composite Aggregation doesn't support order by metrics field, then if the sort list include fields which refer to metrics aggregation, then the sort operator can't be push down to Elasticsearch Aggregation:

sh$ curl -sS -H 'Content-Type: application/json' \
... -X POST localhost:9200/_opendistro/_sql/_explain \
... -d '{"query" : "SELECT gender, avg(age) FROM accounts GROUP BY gender ORDER BY avg(age)"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[gender, avg(age)]"
    },
    "children": [
      {
        "name": "SortOperator",
        "description": {
          "sortList": {
            "avg(age)": {
              "sortOrder": "ASC",
              "nullOrder": "NULL_FIRST"
            }
          }
        },
        "children": [
          {
            "name": "ElasticsearchIndexScan",
            "description": {
              "request": "ElasticsearchQueryRequest(indexName=accounts, sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"gender\":{\"terms\":{\"field\":\"gender.keyword\",\"missing_bucket\":true,\"order\":\"asc\"}}}]},\"aggregations\":{\"avg(age)\":{\"avg\":{\"field\":\"age\"}}}}}}, searchDone=false)"
            },
            "children": []
          }
        ]
      }
    ]
  }
}

The filter expressions in WHERE clause may be pushed down to Elasticsearch DSL queries to avoid large amounts of data retrieved. In this case, for Elasticsearch multi-field (a text field with another keyword field inside), assumption is made that the keyword field name is always "keyword" which is true by default.

At the moment there is no optimization to merge similar sort operators to avoid unnecessary sort. In this case, only one sort operator associated with window function will be pushed down to Elasticsearch DSL queries. Others will sort the intermediate results in memory and return to its window operator in the upstream. This cost can be avoided by optimization aforementioned though in-memory sorting operation can still happen. Therefore a custom circuit breaker is in use to monitor sort operator and protect memory usage.

Without sort push down optimization, the sort operator will sort the result from child operator. By default, only 200 docs will extracted from the source index, you can change this value by using size_limit setting.