-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce unified DataSourceExec
for provided datasources, remove ParquetExec
, CsvExec
, etc
#14224
Conversation
…ne DataSourceExec plan
fix csv_json example
# Conflicts: # datafusion/physical-plan/src/aggregates/mod.rs
# Conflicts: # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/physical-plan/src/memory.rs
# Conflicts: # datafusion/core/src/datasource/file_format/arrow.rs # datafusion/core/src/datasource/file_format/csv.rs # datafusion/core/src/datasource/file_format/json.rs # datafusion/core/src/datasource/file_format/parquet.rs # datafusion/core/src/datasource/memory.rs # datafusion/core/src/test/mod.rs # datafusion/physical-plan/src/memory.rs
Great - custom source creators will just implement some I think we addressed all the feedback and made the PR even better. Migration path is simple, and two major downstream projects took a pass on it. Let's wait for a few hours in case anything else comes in, and then proceed with the merge. @alamb, can you take a final look to verify all is good? Thanks. |
Doing now -- I am figuring out the pattern for migration where we have code that looks for |
I found with this small change I could rewrite our optimizer passes fairly easily (it also seems to not change this particular PR). We can fix as a follow on PR diff --git a/datafusion/physical-plan/src/source.rs b/datafusion/physical-plan/src/source.rs
index 0b7e6dd03..6f57870ce 100644
--- a/datafusion/physical-plan/src/source.rs
+++ b/datafusion/physical-plan/src/source.rs
@@ -176,8 +176,8 @@ impl DataSourceExec {
/// Return the source object
#[allow(unused)]
- pub fn source(&self) -> Arc<dyn DataSource> {
- Arc::clone(&self.source)
+ pub fn source(&self) -> &Arc<dyn DataSource> {
+ &self.source
} RationaleWe have a bunch of code like this in influxdb_iox for rewriting ParquetExec in various ways: To make this easier I am trying to replicate the pattern of (However, the rust lifetime rules prevented this from compiling without the change above /// A view of a [`DataSourceExec`] that scans parquet files
pub struct ParquetExecWrapper<'a> {
base_config: &'a FileScanConfig,
parquet_source: &'a ParquetSource,
}
impl<'a> ParquetExecWrapper<'a> {
/// Create a new wrapper from a [`ExecutionPlan`], returning `None` if the
/// plan is not a [`DataSourceExec`] that scans Parquet files
pub fn try_new(plan: &'a dyn ExecutionPlan) -> Option<Self> {
let plan = plan.as_ref();
let data_source_exec = plan.as_any().downcast_ref::<DataSourceExec>()?;
let base_config = data_source_exec
.source()
.as_any()
.downcast_ref::<FileScanConfig>()?;
let parquet_source = base_config
.file_source()
.as_any()
.downcast_ref::<ParquetSource>()?;
Some(Self {
base_config,
parquet_source,
})
}
/// Return the underlying [`FileScanConfig`]
pub fn base_config(&self) -> &'a FileScanConfig {
self.base_config
}
/// Return the underlying [`ParquetSource`]
pub fn parquet_source(&self) -> &'a ParquetSource {
self.parquet_source
}
} Otherwise things are going well -- I am making good progress |
Great! The change you suggested makes sense. Since it is a very small change, we'll shortly add it to this PR and so migration smoother to others as well. |
Something else I noticed is the import paths for ParquetExec are gone (so I got a lot of compile errors about missing ParquetExec):
Suggested action is reexport the file here --- a/datafusion/core/src/datasource/physical_plan/mod.rs
+++ b/datafusion/core/src/datasource/physical_plan/mod.rs
@@ -33,7 +33,7 @@ pub(crate) use self::json::plan_to_json;
#[cfg(feature = "parquet")]
pub use self::parquet::source::ParquetSource;
#[cfg(feature = "parquet")]
-pub use self::parquet::{ParquetFileMetrics, ParquetFileReaderFactory};
+pub use self::parquet::{ParquetFileMetrics, ParquetFileReaderFactory, ParquetExec}; |
To be clear, I think we can do these kinds of changes in a follow on PR too -- I haven't found any blockers to this PR being merged yet but I don't have our tests running yet to be sure. But all in all things are going well from my perspective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got enough of our code to compile and tests running that i think this PR is ok to merge. Thank you @mertak-synnada @ozankabak and @berkaysynnada -- this is pretty epic
I believe this will be a non trivial change for any system that directly creates and manipulates ParquetExec
but I think the new structures seem to support everything we need.
It would be really great to write up a document (maybe a blog post) that explains this change and gives help for people upgrading. Specifically some examples of creating ParquetExec before and after (and something similar to CsvExec
and AvroExec
.
As I work through the rest of our upgrade, I may have additional ideas to make it easier.
I can't realistically review this PR in detail, bug I trust that other committers have done so 👍
Great, let's go ahead 🚀
This is a terrific idea -- we will be happy to write up a blog post that will serve a dual purpose: (1) help people with upgrading, and (2) brag about this epic PR 🙂
Great, we will be happy to collaborate on any follow-on PRs. |
I just merged the PR -- I will check back in half an hour to see if there are any problems |
Yes 100% that would be perfect. I think such a post would be find to put on the datafusion blog but content marketing wise if you write it putting it on a synnada blog would make total sense too |
Here is one small follow up from this PR: |
As I have been working through actually upgrading delta.rs with these changes it turns out it was more effort than expected (like everything in software) I have created a few PRs that I think would be nice to get in before we release this change downstream |
Also I started writing down an upgrade guide (not yet ready for review) |
Which issue does this PR close?
ExecutionPlan
Across AllTableProviders
#13838.MemoryExec
#14502fetch
limit for MemoryExec #14337Rationale for this change
This PR merges all Data sources into one Execution Plan, named DataSourceExec, and a single trait named DataSource which is inspired by DataSink. File-related shared behaviors are merged into FileSourceConfig, and a MemorySourceConfig is created. MemoryExec, ArrowExec, AvroExec, CsvExec, NdJsonExec, and ParquetExec are deprecated and merged into DataSourceExec.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?