Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize values list execution plan by moving the evaluation part to execution phase #1169

Open
jimexist opened this issue Oct 23, 2021 · 7 comments
Labels
datafusion Changes in the datafusion crate good first issue Good for newcomers

Comments

@jimexist
Copy link
Member

I wonder what you think about moving the creation of the actual RecordBatch from ValuesExec::try_new to execute -- the rationale would be to make PhysicalPlan creation faster and push the actual work into execute where if can potentially be run concurrently with other parts

Given the size of data in a VALUES statement, this is not likely to be any real difference so I am fine with leaving the creation in the same place too -- I just wanted to mention it.

Originally posted by @alamb in #1165 (comment)

@alamb alamb added datafusion Changes in the datafusion crate good first issue Good for newcomers labels Oct 23, 2021
@ygf11
Copy link
Contributor

ygf11 commented Oct 25, 2021

@alamb @jimexist It seems not hard, I want have a try!

@alamb
Copy link
Contributor

alamb commented Oct 25, 2021

Thanks @ygf11 !

@ygf11
Copy link
Contributor

ygf11 commented Oct 31, 2021

It seems statistics computation in ValuesExec also needs the evaluation result.
We need a RwLock?
https://github.com/apache/arrow-datafusion/blob/6b964cbfedc3ee5b08dea8a6c598de7844b95e91/datafusion/src/physical_plan/values.rs#L164-L168

@alamb
Copy link
Contributor

alamb commented Nov 1, 2021

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

@jimexist
Copy link
Member Author

jimexist commented Nov 1, 2021

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

agreed, no need for a lock. we can just put evaluation in the creation part of ValuesExec. i mean it probably wouldn't make any difference

@ygf11
Copy link
Contributor

ygf11 commented Nov 1, 2021

@alamb IMOP, earlier computing statistics conflicts with the purpose of this task, because this will iterate all row and column, which also slow the creation of ValueExec. Maybe stay the same is ok?

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

@alamb
Copy link
Contributor

alamb commented Nov 1, 2021

@ygf11 that is an excellent point -- perhaps then we should just close this task as "won't do" ?

andygrove added a commit to andygrove/datafusion that referenced this issue Jan 31, 2025
…ache#1169)

* Add Spark-compatible SchemaAdapterFactory implementation

* remove prototype code

* fix

* refactor

* implement more cast logic

* implement more cast logic

* add basic test

* improve test

* cleanup

* fmt

* add support for casting unsigned int to signed int

* clippy

* address feedback

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants