Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

Open
alamb opened this issue Feb 3, 2025 · 9 comments
Open

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

alamb opened this issue Feb 3, 2025 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

Is your feature request related to a problem or challenge?

Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)

As described by @waynexia the build time of DataFusion has been growing,

Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the datafusion / core crate in https://github.com/apache/datafusion/tree/main/datafusion/core

Image

While we are pursuing additional ways to reduce compile time, I think we should also move more code out of datafusion/core into their own crates.

We have successfully done this in the past with other projects such as

Describe the solution you'd like

I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core

Describe alternatives you've considered

I think we will end up with several new crates

  • datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
  • datafusion-datasource-parquet: ParquetExec and file firmat
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

I think we could start by creating datafusion-catalog-listing and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow perhaps)

Additional context

No response

@alamb alamb added the enhancement New feature or request label Feb 3, 2025
@alamb
Copy link
Contributor Author

alamb commented Feb 3, 2025

FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun

@logan-keede
Copy link
Contributor

take

@davisp
Copy link
Member

davisp commented Feb 5, 2025

Whoa, this is awesome!

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

I'm definitely going to be keeping up and helping with this work.

@davisp
Copy link
Member

davisp commented Feb 5, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@logan-keede
Copy link
Contributor

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@davisp try this

rm -rf target 
cargo build --release --timings --lib --quiet

@davisp
Copy link
Member

davisp commented Feb 6, 2025

@logan-keede Thanks for the tip!

@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out)

BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged:

Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc

Now that that is in, we will be able to push this project along like a 🚀

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2025

Update here is that @logan-keede is cranking right along:

After some discussion with @jayzhan211 I think we have a good plan going forward

I feel like we are on the cusp of finally getting these things split from the core 🙏

@logan-keede
Copy link
Contributor

logan-keede commented Feb 12, 2025

This has been an update in plan, specifically addition of datafusion-datasource crate

  • datafusion-catalog-listing: ListingTable and original datasource::listing
  • datafusion-datasource: NEW that holds PartitonedFile,and DataSource and other File related components
  • datafusion-datasource-parquet: ParquetExec and file formats
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment)

Please refer to the above mentioned issue for more context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants