-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Split datasources out from datafusion
crate (datafusion/core
)
#14444
Comments
FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun |
take |
Whoa, this is awesome! I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening. I'm definitely going to be keeping up and helping with this work. |
@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects. |
@logan-keede Thanks for the tip! |
(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)
Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out) BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged: Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc Now that that is in, we will be able to push this project along like a 🚀 |
Update here is that @logan-keede is cranking right along: After some discussion with @jayzhan211 I think we have a good plan going forward I feel like we are on the cusp of finally getting these things split from the core 🙏 |
This has been an update in plan, specifically addition of
Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment) Please refer to the above mentioned issue for more context. |
Is your feature request related to a problem or challenge?
Historically DataFusion was one (very) large crate
datafusion,
and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)As described by @waynexia the build time of DataFusion has been growing,
Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the
datafusion
/ core crate in https://github.com/apache/datafusion/tree/main/datafusion/coreWhile we are pursuing additional ways to reduce compile time, I think we should also move more code out of
datafusion/core
into their own crates.We have successfully done this in the past with other projects such as
Describe the solution you'd like
I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core
Describe alternatives you've considered
I think we will end up with several new crates
datafusion-catalog-listing
:ListingTable
and associated types likePartitionedFile
datafusion-datasource-parquet
:ParquetExec
and file firmatdatafusion-datasource-avro
AvroExec
and file formatsdatafusion-datasource-arrow
datafusion-datasource-json
datafusion-datasource-csv
I think we could start by creating
datafusion-catalog-listing
and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow
perhaps)Additional context
No response
The text was updated successfully, but these errors were encountered: