Skip to content

Commit

Permalink
Migrate the code from the datafusion-federation repository.
Browse files Browse the repository at this point in the history
`filter-repo` could be used instead of bringing all the commits from the
federation repository, but since the flight-sql-server directory has
been renamed a few times, it's hard to distinguish the commits, and we
don't want to lose the original commits.
  • Loading branch information
hozan23 committed Feb 3, 2025
1 parent 55db1a6 commit 75c0286
Show file tree
Hide file tree
Showing 26 changed files with 58 additions and 4,115 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ jobs:
- uses: arduino/setup-protoc@v3
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- run: cargo rustdoc -p datafusion-federation -- --cfg docsrs
- run: cargo rustdoc -p datafusion-flight-sql-server -- --cfg docsrs
- run: chmod -c -R +rX "target/doc"
- run: touch target/doc/index.html
- run: echo "<meta http-equiv=refresh content=0;url=datafusion_federation>" > target/doc/index.html
- run: echo "<meta http-equiv=refresh content=0;url=datafusion_flight_sql_server>" > target/doc/index.html
- if: github.event_name == 'push' && github.ref == 'refs/heads/main'
uses: actions/upload-pages-artifact@v3
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,4 @@ jobs:
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- run: cargo build --all
- run: cargo package -p datafusion-federation --allow-dirty
- run: cargo package -p datafusion-flight-sql-server --allow-dirty
6 changes: 2 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
resolver = "2"

members = [
"datafusion-federation",
"datafusion-flight-sql-server",
"datafusion-flight-sql-table-provider",
]
Expand All @@ -12,16 +11,15 @@ version = "0.3.5"
edition = "2021"
license = "Apache-2.0"
readme = "README.md"
repository = "https://github.com/datafusion-contrib/datafusion-federation"
repository = "https://github.com/datafusion-contrib/datafusion-flight-sql-server"

[workspace.dependencies]
arrow = "53.3"
arrow-flight = { version = "53.3", features = ["flight-sql-experimental"] }
arrow-json = "53.3"
async-stream = "0.3.5"
async-trait = "0.1.83"
datafusion = "44.0.0"
datafusion-federation = { path = "./datafusion-federation", version = "0.3.5" }
datafusion-federation = { version = "0.3.5" }
datafusion-substrait = "44.0.0"
futures = "0.3.31"
tokio = { version = "1.41", features = ["full"] }
Expand Down
190 changes: 52 additions & 138 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,138 +1,52 @@
# DataFusion Federation

[![crates.io](https://img.shields.io/crates/v/datafusion-federation.svg)](https://crates.io/crates/datafusion-federation)
[![docs.rs](https://docs.rs/datafusion-federation/badge.svg)](https://docs.rs/datafusion-federation)

DataFusion Federation allows
[DataFusion](https://github.com/apache/arrow-datafusion) to execute (part of) a
query plan by a remote execution engine.

┌────────────────┐
┌────────────┐ │ Remote DBMS(s) │
SQL Query ───> │ DataFusion │ ───> │ ( execution │
└────────────┘ │ happens here ) │
└────────────────┘

The goal is to allow resolving queries across remote query engines while
pushing down as much compute as possible to the remote database(s). This allows
execution to happen as close to the storage as possible. This concept is
referred to as 'query federation'.

> [!TIP]
> This repository implements the federation framework itself. If you want to
> connect to a specific database, check out the compatible providers available
> in
> [datafusion-contrib/datafusion-table-providers](https://github.com/datafusion-contrib/datafusion-table-providers/).
## Usage

Check out the [examples](./datafusion-federation/examples/) to get a feel for
how it works.

For a complete step-by-step example of how federation works, you can check the
example [here](./datafusion-federation/examples/df-csv-advanced.rs).

## Potential use-cases:

- Querying across SQLite, MySQL, PostgreSQL, ...
- Pushing down SQL or [Substrait](https://substrait.io/) plans.
- DataFusion -> Flight SQL -> DataFusion
- ..

## Design concept

Say you have a query plan as follows:

┌────────────┐
│ Join │
└────────────┘
┌───────┴────────┐
┌────────────┐ ┌────────────┐
│ Scan A │ │ Join │
└────────────┘ └────────────┘
┌───────┴────────┐
┌────────────┐ ┌────────────┐
│ Scan B │ │ Scan C │
└────────────┘ └────────────┘

DataFusion Federation will identify the largest possible sub-plans that
can be executed by an external database:

┌────────────┐ Optimizer recognizes
│ Join │ that B and C are
└────────────┘ available in an
▲ external database
┌──────────────┴────────┐
│ ┌ ─ ─ ─ ─ ─ ─ ┴ ─ ── ─ ─ ─ ─ ─┐
┌────────────┐ ┌────────────┐ │
│ Scan A │ │ │ Join │
└────────────┘ └────────────┘ │
│ ▲
┌───────┴────────┐ │
┌────────────┐ ┌────────────┐ │
││ Scan B │ │ Scan C │
└────────────┘ └────────────┘ │
─ ── ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ── ─ ┘

The sub-plans are cut out and replaced by an opaque federation node in the plan:

┌────────────┐
│ Join │
└────────────┘ Rewritten Plan
┌────────┴───────────┐
│ │
┌────────────┐ ┏━━━━━━━━━━━━━━━━━━┓
│ Scan A │ ┃ Scan B+C ┃
└────────────┘ ┃ (TableProvider ┃
┃ that can execute ┃
┃ sub-plan in an ┃
┃external database)┃
┗━━━━━━━━━━━━━━━━━━┛

Different databases may have different query languages and execution
capabilities. To accommodate for this, we allow each 'federation provider' to
self-determine what part of a sub-plan it will actually federate. This is done
by letting each federation provider define its own optimizer rule. When a
sub-plan is 'cut out' of the overall plan, it is first passed the federation
provider's optimizer rule. This optimizer rule determines the part of the plan
that is cut out, based on the execution capabilities of the database it
represents.

## Implementation

A remote database is represented by the `FederationProvider` trait. To identify
table scans that are available in the same database, they implement
`FederatedTableSource` trait. This trait allows lookup of the corresponding
`FederationProvider`.

Identifying sub-plans to federate is done by the `FederationOptimizerRule`.
This rule needs to be registered in your DataFusion SessionState. One easy way
to do this is using `default_session_state`. To do its job, the
`FederationOptimizerRule` currently requires that all TableProviders that need
to be federated are `FederatedTableProviderAdaptor`s. The
`FederatedTableProviderAdaptor` also has a fallback mechanism that allows
implementations to fallback to a 'vanilla' TableProvider in case the
`FederationOptimizerRule` isn't registered.

The `FederationProvider` can provide a `compute_context`. This allows it to
differentiate between multiple remote execution context of the same type. For
example two different mysql instances, database schemas, access level, etc. The
`FederationProvider` also returns the `Optimizer` that is allows it to
self-determine what part of a sub-plan it can federate.

The `sql` module implements a generic `FederationProvider` for SQL execution
engines. A specific SQL engine implements the `SQLExecutor` trait for its
engine specific execution. There are a number of compatible providers available
in
[datafusion-contrib/datafusion-table-providers](https://github.com/datafusion-contrib/datafusion-table-providers/).

## Status

The project is in alpha status. Contributions welcome; land a PR = commit
access.

- [Docs (release)](https://docs.rs/datafusion-federation)
- [Docs (main)](https://datafusion-contrib.github.io/datafusion-federation/)
# DataFusion Flight SQL Server

The `datafusion-flight-sql-server` is a Flight SQL server that implements the
necessary endpoints to use DataFusion as the query engine.

## Getting Started

To use `datafusion-flight-sql-server` in your Rust project, run:

```sh
$ cargo add datafusion-flight-sql-server
```

## Example

Here's a basic example of setting up a Flight SQL server:

```rust
use datafusion_flight_sql_server::service::FlightSqlService;
use datafusion::{
execution::{
context::SessionContext,
options::CsvReadOptions,
},
};

async {
let dsn: String = "0.0.0.0:50051".to_string();
let remote_ctx = SessionContext::new();
remote_ctx
.register_csv("test", "./examples/test.csv", CsvReadOptions::new())
.await.expect("Register csv");

FlightSqlService::new(remote_ctx.state()).serve(dsn.clone())
.await
.expect("Run flight sql service");

};
```

This example sets up a Flight SQL server listening on `127.0.0.1:50051`.


# Acknowledgments

This repository was a Rust crate that was first built as a part of
[datafusion-federation](https://github.com/datafusion-contrib/datafusion-federation/)
repository.

For more details about the original repository, please visit
[datafusion-federation](https://github.com/datafusion-contrib/datafusion-federation/).

32 changes: 0 additions & 32 deletions datafusion-federation/CHANGELOG.md

This file was deleted.

43 changes: 0 additions & 43 deletions datafusion-federation/Cargo.toml

This file was deleted.

4 changes: 0 additions & 4 deletions datafusion-federation/examples/data/test.csv

This file was deleted.

7 changes: 0 additions & 7 deletions datafusion-federation/examples/data/test2.csv

This file was deleted.

Loading

0 comments on commit 75c0286

Please sign in to comment.