Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Move various parts of datasource out of core #14616

Merged
merged 10 commits into from
Feb 12, 2025

Conversation

logan-keede
Copy link
Contributor

@logan-keede logan-keede commented Feb 11, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Move the parts of datasource that do not have much coupling out of core.

Are these changes tested?

yes, by Github CI.

Are there any user-facing changes?

No, there should not be.

@github-actions github-actions bot added the core Core DataFusion crate label Feb 11, 2025
@logan-keede
Copy link
Contributor Author

cc @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @logan-keede -- I think this looks great to me. I have a few small questions but nothing major that can't be done as a follow on PR

I love this incremental approach @logan-keede -- very very nice

/// as other operators.
///
/// [`FileStream`]: <https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/file_stream.rs>
pub struct FileStreamMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about if there was some better name for this module than file_stream_part.rs but I couldn't come up with one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, when all of file_stream is moved, we can change the name. or perhaps we should just keep it file_stream as I did not make a file_scan_config_part but file_scan_config (it was an arbitrary decision).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer file_stream

use crate::datasource::physical_plan::FileMeta;
use crate::datasource::physical_plan::{FileOpenFuture, FileOpener};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it is ok to re-export FileOpenFuture and FileOpener in physical_plan ? Or should they still be in a file_stream submodule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already in physical_plan. They could be imported from both file_stream and physical_plan earlier.

pub use file_stream::{FileOpenFuture, FileOpener, FileStream, OnError};

or did I misunderstand something?

@jayzhan211
Copy link
Contributor

Why not move to the datasource crate but catalog-listing

@logan-keede
Copy link
Contributor Author

Why not move to the datasource crate but catalog-listing

original plan was to move just datasource::listing to catalog-listing, but listing/table.rs has some dependencies on 'these' files (almost all file with pattern "physical_plan/file_*.rs" or "file_format/file_*.rs") but 'these' files also have some dependency on catalog-listing (like PartitionedFile and more). So, this was all I could think of to avoid circular dependency and keeping original structure somewhat intact. do you still think we can make a datasource crate distinct from catalog-listing?

@jayzhan211
Copy link
Contributor

FileCompressionType, PartitionedFile, FileRange can be move to datasource.

If A and B is tightly couple, you need to pull partial structure out to C and import C for A and B. Not moving A and B together.

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

FileCompressionType, PartitionedFile, FileRange can be move to datasource.

If A and B is tightly couple, you need to pull partial structure out to C and import C for A and B. Not moving A and B together.

The possible plan I proposed on #14444 proposes a structure like this

  • datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
  • datafusion-datasource-parquet: ParquetExec and file firmat
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

This was before the refactor with DataSource landed:

@jayzhan211 are you proposing we add another crate in there, something like?

  • datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
  • datafusion-datasource: NEW that holds FileCompressionType, PartitonedFile,and DataSource
  • datafusion-datasource-parquet: ParquetExec and file firmat
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

FWIW I think we can merge this PR and then keep moving code around as follow on PRs)

@jayzhan211
Copy link
Contributor

Yes and not only the 3 I mentioned, FileFormat, FileFormatFactory, etc. I think File related structure should belong into it own crate and then the implementation like parquet, arrow or listing table import it.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 12, 2025

Catalog -> Schema -> Table -> FileFormat -> QueryPlanner is the overall dependency w
e follow by.

use crate::physical_plan::RecordBatchStream;

use arrow::datatypes::SchemaRef;
use arrow::error::ArrowError;
use arrow::record_batch::RecordBatch;
use datafusion_common::instant::Instant;
pub use datafusion_catalog_listing::file_stream_part::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid importing everything at once so that the imported modules are explicitly controlled and managed.

@logan-keede
Copy link
Contributor Author

image
mostly for my clarity

also, I am not suggesting that we keep the file_format and physical_plan folders in new solution,
though I was going to suggest(only one folder for all file related) if we move forward by old approach

PS: @jayzhan211 can/should I tag you in related PRs?

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

also, I am not suggesting that we keep the file_format and physical_plan folders in new solution,
though I was going to suggest(only one folder for all file related) if we move forward by old approach

That is my understanding as well

Screenshot 2025-02-12 at 9 35 28 AM

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

Ok, I am going to merge this one in and we can keep working on this in follow on PRs.

@logan-keede can you update the plan on #14444 and keep on hacking ?

Thanks again!

@alamb alamb merged commit 980931c into apache:main Feb 12, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants