Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some dependency removals and setup for refactor of FileScanConfig #14543

Merged
merged 7 commits into from
Feb 7, 2025

Conversation

logan-keede
Copy link
Contributor

@logan-keede logan-keede commented Feb 7, 2025

Which issue does this PR close?

Rationale for this change

  1. SessionState needs to be replaced with Session in datasource to split it out of core.
  2. get_file_format_factory needs to be implemented for Session
    pub trait FileFormatFactory: Sync + Send + GetExt + Debug {
  3. for 2, we need to move FileScanConfig to datafusion-catalog-listing(or some other place, but out of core)

What changes are included in this PR?

Refactoring,

  • minor: SessionState -> dyn Session + some downcast to be removed later.
  • minor: avoid use of AvroSource in file_format/mod.rs (thanks to @mertak-synnada for helping out)
  • major: major move FileScanConfig to datafusion-catalog-listing (upcoming, we can probably do it in a separate PR) coupling due to new merges seems to have made this a bit difficult than before, I am hoping we can go with this while I sort other parts out.

Are these changes tested?

Are there any user-facing changes?

There should not be

@logan-keede logan-keede marked this pull request as ready for review February 7, 2025 19:21
@logan-keede
Copy link
Contributor Author

cc @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @logan-keede -- this looks great. Thank you so much for making this small PRs.

I had a small suggestion but I think we could do it as a follow on PR too

Comment on lines 65 to 68
/// Check if repartition is supported
fn supports_repartition(&self, config: &FileScanConfig) -> bool {
!(config.file_compression_type.is_compressed() || config.new_lines_in_values)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better we did not provide a default implementation for supports_repartition

Assuming that a file source can repartition based on its compression and newlines seems very specific to CSV.

Suggested change
/// Check if repartition is supported
fn supports_repartition(&self, config: &FileScanConfig) -> bool {
!(config.file_compression_type.is_compressed() || config.new_lines_in_values)
}
/// Check if repartition is supported
fn supports_repartition(&self, config: &FileScanConfig) -> bool;

Copy link
Contributor Author

@logan-keede logan-keede Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to include it in my next PR, which should be tomorrow hopefully.

just for clarity do you want to implement something like this for other supported file format

fn supports_repartition(&self, config: &FileScanConfig) -> bool {
        !(config.file_compression_type.is_compressed())
    }

and original for csv. I do not have much context on this function, So i can not comment on whether newlines is useful only in csv or not. I will try to look into it tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! Thanks -- I pushed a change to this branch -- hopefully that was ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let's just hope I haven't broken the CI checks. 😆 😭 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it moving!

@alamb alamb merged commit 7ccc6d7 into apache:main Feb 7, 2025
25 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 8, 2025

Thanks again @logan-keede

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants