optimize values list execution plan by moving the evaluation part to execution phase #1169

jimexist · 2021-10-23T11:46:55Z

I wonder what you think about moving the creation of the actual RecordBatch from ValuesExec::try_new to execute -- the rationale would be to make PhysicalPlan creation faster and push the actual work into execute where if can potentially be run concurrently with other parts

Given the size of data in a VALUES statement, this is not likely to be any real difference so I am fine with leaving the creation in the same place too -- I just wanted to mention it.

Originally posted by @alamb in #1165 (comment)

The text was updated successfully, but these errors were encountered:

ygf11 · 2021-10-25T08:06:24Z

@alamb @jimexist It seems not hard, I want have a try!

alamb · 2021-10-25T10:39:30Z

Thanks @ygf11 !

ygf11 · 2021-10-31T14:34:46Z

It seems statistics computation in ValuesExec also needs the evaluation result.
We need a RwLock?
https://github.com/apache/arrow-datafusion/blob/6b964cbfedc3ee5b08dea8a6c598de7844b95e91/datafusion/src/physical_plan/values.rs#L164-L168

alamb · 2021-11-01T13:44:04Z

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

jimexist · 2021-11-01T14:12:15Z

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

agreed, no need for a lock. we can just put evaluation in the creation part of ValuesExec. i mean it probably wouldn't make any difference

ygf11 · 2021-11-01T14:34:56Z

@alamb IMOP, earlier computing statistics conflicts with the purpose of this task, because this will iterate all row and column, which also slow the creation of ValueExec. Maybe stay the same is ok?

I think don't think adding an RwLock sounds like a good idea. Perhaps we could compute the statistics when the ValuesExec is created (rather than computing on the call to statistics(&self)?

alamb · 2021-11-01T14:52:26Z

@ygf11 that is an excellent point -- perhaps then we should just close this task as "won't do" ?

…ache#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test

jimexist mentioned this issue Oct 23, 2021

add values list expression #1165

Merged

alamb added datafusion Changes in the datafusion crate good first issue Good for newcomers labels Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize values list execution plan by moving the evaluation part to execution phase #1169

optimize values list execution plan by moving the evaluation part to execution phase #1169

jimexist commented Oct 23, 2021

ygf11 commented Oct 25, 2021

alamb commented Oct 25, 2021

ygf11 commented Oct 31, 2021 •

edited

Loading

alamb commented Nov 1, 2021

jimexist commented Nov 1, 2021

ygf11 commented Nov 1, 2021

alamb commented Nov 1, 2021

optimize values list execution plan by moving the evaluation part to execution phase #1169

optimize values list execution plan by moving the evaluation part to execution phase #1169

Comments

jimexist commented Oct 23, 2021

ygf11 commented Oct 25, 2021

alamb commented Oct 25, 2021

ygf11 commented Oct 31, 2021 • edited Loading

alamb commented Nov 1, 2021

jimexist commented Nov 1, 2021

ygf11 commented Nov 1, 2021

alamb commented Nov 1, 2021

ygf11 commented Oct 31, 2021 •

edited

Loading