-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move configuration information out of example usage page #11300
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,8 +19,25 @@ | |
|
||
# Introduction | ||
|
||
The library user guide explains how to use the DataFusion library as a dependency in your Rust project. Please check out the user-guide for more details on how to use DataFusion's SQL and DataFrame APIs, or the contributor guide for details on how to contribute to DataFusion. | ||
The library user guide explains how to use the DataFusion library as a | ||
dependency in your Rust project and customize its behavior using its extension APIs. | ||
|
||
If you haven't reviewed the [architecture section in the docs][docs], it's a useful place to get the lay of the land before starting down a specific path. | ||
Please check out the [user guide] for getting started using | ||
DataFusion's SQL and DataFrame APIs, or the [contributor guide] | ||
for details on how to contribute to DataFusion. | ||
|
||
If you haven't reviewed the [architecture section in the docs][docs], it's a | ||
useful place to get the lay of the land before starting down a specific path. | ||
|
||
DataFusion is designed to be extensible at all points, including | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was random content that was on the "getting started" page that I moved into the library user guide as it seemed a bette rhome |
||
|
||
- [x] User Defined Functions (UDFs) | ||
- [x] User Defined Aggregate Functions (UDAFs) | ||
- [x] User Defined Table Source (`TableProvider`) for tables | ||
- [x] User Defined `Optimizer` passes (plan rewrites) | ||
- [x] User Defined `LogicalPlan` nodes | ||
- [x] User Defined `ExecutionPlan` nodes | ||
|
||
[user guide]: ../user-guide/example-usage.md | ||
[contributor guide]: ../contributor-guide/index.md | ||
[docs]: https://docs.rs/datafusion/latest/datafusion/#architecture |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,146 @@ | ||||||
<!--- | ||||||
Licensed to the Apache Software Foundation (ASF) under one | ||||||
or more contributor license agreements. See the NOTICE file | ||||||
distributed with this work for additional information | ||||||
regarding copyright ownership. The ASF licenses this file | ||||||
to you under the Apache License, Version 2.0 (the | ||||||
"License"); you may not use this file except in compliance | ||||||
with the License. You may obtain a copy of the License at | ||||||
|
||||||
http://www.apache.org/licenses/LICENSE-2.0 | ||||||
|
||||||
Unless required by applicable law or agreed to in writing, | ||||||
software distributed under the License is distributed on an | ||||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
KIND, either express or implied. See the License for the | ||||||
specific language governing permissions and limitations | ||||||
under the License. | ||||||
--> | ||||||
|
||||||
# Crate Configuration | ||||||
|
||||||
This section contains information on how to configure DataFusion in your Rust | ||||||
project. See the [Configuration Settings] section for a list of options that | ||||||
control DataFusion's behavior. | ||||||
|
||||||
[configuration settings]: configs.md | ||||||
|
||||||
## Add latest non published DataFusion dependency | ||||||
|
||||||
DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) | ||||||
|
||||||
If you would like to test out DataFusion changes which are merged but not yet | ||||||
published, Cargo supports adding dependency directly to GitHub branch: | ||||||
|
||||||
```toml | ||||||
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} | ||||||
``` | ||||||
|
||||||
Also it works on the package level | ||||||
|
||||||
```toml | ||||||
datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"} | ||||||
``` | ||||||
|
||||||
And with features | ||||||
|
||||||
```toml | ||||||
datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] } | ||||||
``` | ||||||
|
||||||
More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) | ||||||
|
||||||
## Optimized Configuration | ||||||
|
||||||
For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is | ||||||
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. | ||||||
|
||||||
```toml | ||||||
[dependencies] | ||||||
datafusion = { version = "22.0" } | ||||||
tokio = { version = "^1.0", features = ["rt-multi-thread"] } | ||||||
snmalloc-rs = "0.3" | ||||||
|
||||||
[profile.release] | ||||||
lto = true | ||||||
codegen-units = 1 | ||||||
``` | ||||||
|
||||||
Then, in `main.rs.` update the memory allocator with the below after your imports: | ||||||
|
||||||
```rust ,ignore | ||||||
use datafusion::prelude::*; | ||||||
|
||||||
#[global_allocator] | ||||||
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; | ||||||
|
||||||
#[tokio::main] | ||||||
async fn main() -> datafusion::error::Result<()> { | ||||||
Ok(()) | ||||||
} | ||||||
``` | ||||||
|
||||||
Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally | ||||||
with `native` or at least `avx2`. | ||||||
|
||||||
```shell | ||||||
RUSTFLAGS='-C target-cpu=native' cargo run --release | ||||||
``` | ||||||
|
||||||
## Enable backtraces | ||||||
|
||||||
By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in #11439 |
||||||
|
||||||
```toml | ||||||
datafusion = { version = "31.0.0", features = ["backtrace"]} | ||||||
``` | ||||||
|
||||||
Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables) | ||||||
|
||||||
```bash | ||||||
RUST_BACKTRACE=1 ./target/debug/datafusion-cli | ||||||
DataFusion CLI v31.0.0 | ||||||
> select row_numer() over (partition by a order by a) from (select 1 a); | ||||||
Error during planning: Invalid function 'row_numer'. | ||||||
Did you mean 'ROW_NUMBER'? | ||||||
|
||||||
backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace | ||||||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 | ||||||
1: std::backtrace_rs::backtrace::trace_unsynchronized | ||||||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 | ||||||
2: std::backtrace::Backtrace::create | ||||||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13 | ||||||
3: std::backtrace::Backtrace::capture | ||||||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9 | ||||||
4: datafusion_common::error::DataFusionError::get_back_trace | ||||||
at /datafusion/datafusion/common/src/error.rs:436:30 | ||||||
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr | ||||||
............ | ||||||
``` | ||||||
|
||||||
The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs` | ||||||
|
||||||
``` | ||||||
#[tokio::test] | ||||||
async fn test_get_backtrace_for_failed_code() -> Result<()> { | ||||||
let ctx = SessionContext::new(); | ||||||
|
||||||
let sql = " | ||||||
select row_numer() over (partition by a order by a) from (select 1 a); | ||||||
"; | ||||||
|
||||||
let _ = ctx.sql(sql).await?.collect().await?; | ||||||
|
||||||
Ok(()) | ||||||
} | ||||||
``` | ||||||
|
||||||
To obtain a backtrace: | ||||||
|
||||||
```bash | ||||||
cargo build --features=backtrace | ||||||
RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture | ||||||
``` | ||||||
|
||||||
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,29 +33,6 @@ datafusion = "latest_version" | |
tokio = { version = "1.0", features = ["rt-multi-thread"] } | ||
``` | ||
|
||
## Add latest non published DataFusion dependency | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved all this content to other more appropriate locations |
||
|
||
DataFusion changes are published to `crates.io` according to [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) | ||
In case if it is required to test out DataFusion changes which are merged but yet to be published, Cargo supports adding dependency directly to GitHub branch | ||
|
||
```toml | ||
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} | ||
``` | ||
|
||
Also it works on the package level | ||
|
||
```toml | ||
datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"} | ||
``` | ||
|
||
And with features | ||
|
||
```toml | ||
datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] } | ||
``` | ||
|
||
More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) | ||
|
||
## Run a SQL query against data stored in a CSV | ||
|
||
```rust | ||
|
@@ -201,109 +178,3 @@ async fn main() -> datafusion::error::Result<()> { | |
| 1 | 2 | | ||
+---+--------+ | ||
``` | ||
|
||
## Extensibility | ||
|
||
DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: | ||
|
||
- [x] User Defined Functions (UDFs) | ||
- [x] User Defined Aggregate Functions (UDAFs) | ||
- [x] User Defined Table Source (`TableProvider`) for tables | ||
- [x] User Defined `Optimizer` passes (plan rewrites) | ||
- [x] User Defined `LogicalPlan` nodes | ||
- [x] User Defined `ExecutionPlan` nodes | ||
|
||
## Optimized Configuration | ||
|
||
For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is | ||
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. | ||
|
||
```toml | ||
[dependencies] | ||
datafusion = { version = "22.0" } | ||
tokio = { version = "^1.0", features = ["rt-multi-thread"] } | ||
snmalloc-rs = "0.3" | ||
|
||
[profile.release] | ||
lto = true | ||
codegen-units = 1 | ||
``` | ||
|
||
Then, in `main.rs.` update the memory allocator with the below after your imports: | ||
|
||
```rust ,ignore | ||
use datafusion::prelude::*; | ||
|
||
#[global_allocator] | ||
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; | ||
|
||
#[tokio::main] | ||
async fn main() -> datafusion::error::Result<()> { | ||
Ok(()) | ||
} | ||
``` | ||
|
||
Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally | ||
with `native` or at least `avx2`. | ||
|
||
```shell | ||
RUSTFLAGS='-C target-cpu=native' cargo run --release | ||
``` | ||
|
||
## Enable backtraces | ||
|
||
By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error, | ||
like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file: | ||
|
||
```toml | ||
datafusion = { version = "31.0.0", features = ["backtrace"]} | ||
``` | ||
|
||
Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables) | ||
|
||
```bash | ||
RUST_BACKTRACE=1 ./target/debug/datafusion-cli | ||
DataFusion CLI v31.0.0 | ||
> select row_number() over (partition by a order by a) from (select 1 a); | ||
Error during planning: Invalid function 'row_number'. | ||
Did you mean 'ROW_NUMBER'? | ||
|
||
backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace | ||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 | ||
1: std::backtrace_rs::backtrace::trace_unsynchronized | ||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 | ||
2: std::backtrace::Backtrace::create | ||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13 | ||
3: std::backtrace::Backtrace::capture | ||
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9 | ||
4: datafusion_common::error::DataFusionError::get_back_trace | ||
at /datafusion/datafusion/common/src/error.rs:436:30 | ||
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr | ||
............ | ||
``` | ||
|
||
The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs` | ||
|
||
``` | ||
#[tokio::test] | ||
async fn test_get_backtrace_for_failed_code() -> Result<()> { | ||
let ctx = SessionContext::new(); | ||
|
||
let sql = " | ||
select row_number() over (partition by a order by a) from (select 1 a); | ||
"; | ||
|
||
let _ = ctx.sql(sql).await?.collect().await?; | ||
|
||
Ok(()) | ||
} | ||
``` | ||
|
||
To obtain a backtrace: | ||
|
||
```bash | ||
cargo build --features=backtrace | ||
RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture | ||
``` | ||
|
||
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a link to the library user guide as well as making this into a bullet list so it is easier to see what options are available