RFC: Pluggable Execution Engine for Stream Processing #17501

mch2 · 2025-03-04T06:52:50Z

Is your feature request related to a problem? Please describe

Overview and Motivation

With the introduction of arrow streams into OpenSearch we have been experimenting with the idea of an embeddable execution engine. The idea is to use Lucene for initial doc retrieval and filtering, and stream out doc values to a pluggable execution engine for analytics. The primary motivation of this is to introduce modern analytics capabilities to OpenSearch along with performance improvements particularly through vectorized processing of columnar data.

Describe the solution you'd like

I have been using rust based Apache DataFusion as the engine of choice for experiments, but the idea is to make the engine pluggable. I chose DataFusion particularly because it was easy to get off the ground, it uses arrow as its in-memory format, and is performant with efficient vectorized processing. It was also relatively simple to create vectors in the jvm and send them zero copy to/from DataFusion. Further its extensible architecture allows us to customize pretty much anything (language front-end/planning/execution etc), and use it for more than simply an execution engine.

Some additional benefits:

Reduced JVM Heap Pressure: With more pushed off heap JVM pressure is reduced. DataFusion can also limit memory pool size and spill to disk when necessary
External Data Integration: Built-in support for various file formats and remote locations, enabling unions between hot node data and external stores (e.g., Parquet files) with compatible schemas
Flexible Query Support - While not the main motivation, DataFusion is a full query engine. We can send sql or pre-built substrait plans built in java (ex. Calcite) to DF for execution.

Use Cases:
The three use cases I have been targeting for POC are (branches at bottom of this issue):

Shard fan out - collate results from individual data node streams at the coordinator.
Execute joins across multiple per index streams at the coordinator.
Perform aggregations with improved efficiency - This is in line with this RFC for a memory efficient agg approach.

For implementation perspective this is what the execution flow looks like for terms aggregation in a POC:

At the coordinator:
The Coordinator executes query phase as normal (not pictured), where each data node returns a stream ticket.

The coordinator then allocates a DataFusion SessionContext & Runtime over JNI
Coordinator executes the desired agg (1 and 2 are pictured separately but are a single JNI call in reality). This registers a TableProvider with a partition for each shard and creates a logical plan (DataFrame).
On Registration getFlightInfo is invoked fetching stream metadata (schema) and endpoints.
executeStream is then invoked on the DataFrame, which eventually invokes getStream from the data node’s flight server to return record batches.

On the data node - This resembles today's approach where aggregators return per-shard results, but with streaming to reduce memory requirements.

Similar necessary DF context & runtime are allocated.
Collector writes docValues vectors up to a certain batch size.
The VectorSchemaRoot on the batch (VSR pictured) is exported to DF via CData interface. The reason for two VSR’s here is the input vector has a different schema than the output in this case. DF then aggregates the batch.
Once a batch has been processed aggregated results are then written out to the result vector and streamed back to the coordinator.

I’ve benchmarked a couple of experiments for aggregations and have seen favorable results in both latency & memory used - Cluster using DF in green - 3 data nodes + 1 dedicated coordinator.

Benchmark - Big5 keyword-terms operation.

I will follow up here with a few more benchmarks shortly using the new red line feature in OSB, but wanted to get this out there to see what people think?

Related:
Join Support: #15185
Streaming aggs - #16774
Pluggable Storage Engine Support - #17341 (comment)
POC branches - aggregations (term) - https://github.com/mch2/OpenSearch/commits/df-streaming-aggs/
joins - https://github.com/mch2/OpenSearch/commits/mch2-rishma-join

Related component

Search:Performance

Describe alternatives you've considered

not do this and rely on pure java implementations?

Additional context

No response

ViggoC · 2025-03-06T06:16:43Z

Hi @mch2, I'm so exciting to see this proposal, it will elevate the analytics capabilities of Opensearch to a new level.
But I'm kind of confused about some parts in this RFC. Could you explain more?

The Coordinator executes query phase as normal (not pictured), where each data node returns a stream ticket.

Why do we need to do it in a query then fetch way

executeStream is then invoked on the DataFrame, which eventually invokes getStream from the data node’s flight server to return record batches.

IIUC, the distributed plan is executed in "pull mode", the data node will not start working until the coordinator requests data from it, right? And at what phase is the DF Context initialized? query phase or fetch phase?

Benchmark - Big5 keyword-terms operation.

How to read the benchmark result, what's the meaning of X/Y axises?

mch2 added discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes untriaged labels Mar 4, 2025

opensearch-infra bot added this to OpenSearch Roadmap Mar 4, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Mar 4, 2025

github-actions bot added the Search:Performance label Mar 4, 2025

github-project-automation bot added this to Search Project Board Mar 4, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Mar 4, 2025

mch2 changed the title ~~RFC: Embeddable Query Engine for Stream Processing~~ RFC: Pluggable Query Engine for Stream Processing Mar 4, 2025

mch2 changed the title ~~RFC: Pluggable Query Engine for Stream Processing~~ RFC: Pluggable Execution Engine for Stream Processing Mar 4, 2025

sandeshkr419 removed the untriaged label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Pluggable Execution Engine for Stream Processing #17501

RFC: Pluggable Execution Engine for Stream Processing #17501

mch2 commented Mar 4, 2025 •

edited

Loading

ViggoC commented Mar 6, 2025

RFC: Pluggable Execution Engine for Stream Processing #17501

RFC: Pluggable Execution Engine for Stream Processing #17501

Comments

mch2 commented Mar 4, 2025 • edited Loading

Is your feature request related to a problem? Please describe

Overview and Motivation

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

ViggoC commented Mar 6, 2025

mch2 commented Mar 4, 2025 •

edited

Loading