-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808
Comments
I believe @peter-toth and @ozankabak and I discussed something similar in #10543 (comment) The new reference walking APIs added in #10543 (which I don't think are released yet) may also be related -- specifically |
@LorrensP-2158466, I as far as I get you just want to change the return type of But I feel you will run into problems with the |
Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know! |
In order to make it work, you might be able to implement a custom iterator like enum InputIterator<'a> {
SingleInput(&LogicalPlan),
VecInput(&[LogicalPlan])),
..
} And then implement the |
So I have found a solution that i can compile, but at this stage I'm not very happy with it. ExplanationI first tried the solution Andrew mentioned, but it runs into following problems:
To support all these cases i came up with: pub enum InputIterator<'parent> {
Empty(Empty<&'parent LogicalPlan>),
Single(Once<&'parent LogicalPlan>),
Double(DoubleIter<&'parent LogicalPlan>), // DoubleIter<T> = std::array::IntoIter<T, 2>
Multiple(std::vec::IntoIter<&'parent LogicalPlan>),
} Built-In LogicalPlansMost cases can be handled by the Extension PlansTo handle the fn inputs_iter(&self) -> InputIterator<'_>; The user of this trait can then choose their own iterator, and we only have to call More ProblemsBut there are still some problems with this (i think):
Slice(slice::IntoIter<&'parent LogicalPlan>) which is an slice iterator, so the above would be SolutionsTo fix this i can split up the pub enum InputIterator{
// ...
Slice(slice::IntoIter<&LogicalPlan>),
FromArcs(Map<Iter<'_, Arc<LogicalPlan>>, AsRef::as_ref>), // maps Arc<LogicalPlan> into &LogicalPlan
} This does sum up to a total of 5 different iterator types, but i don't know how i can cover every possible way of holding onto multiple inputs. For example if node implementation holds their inputs in some other collection (like CurrentlySo currently the pub enum InputIterator<'parent> {
Empty(Empty<&'parent LogicalPlan>),
Single(Once<&'parent LogicalPlan>),
Double(DoubleIter<&'parent LogicalPlan>),
Slice(SliceIter<'parent, LogicalPlan>),
FromArcs(FromArcsIter<'parent, LogicalPlan>),
} I have made a macro which let's me "delegate" a call to this iterator to any of the inner iterators, e.g. fn next(&mut self) -> Option<Self::Item> {
delegate!(self, iter => iter.next())
} Things to do:
If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a |
I agree your (clearly very impressive skills) might be better spent on other projects. Is there any particular issue or area you are interested in working on? |
Thanks, that means a lot. I'm interested in anything data science or database related. I came in contact with DataFusion because of my Bachelor's project this year, and I liked it so much that I wanted to contribute in any way I could. But because of school, I can't find the time to do this actively. The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out. As I said, I like to help wherever I can, but I'm not familiar enough with DataFusion to know where I can help. Maybe you guys know some places where I can look/help? Thanks for the interest and help in this issue! |
This is very cool -- is the code somehere pubic ? Hopefully doing this kind of analysis would be easier now with the nicer tree node API from @peter-toth . One thing that might be interesting / relevant perhaps then would be to add an example of that kind of analysis. For example, this ticket describes an example for an analyzer rule #10855, but writing an example of |
I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not. I also think adding an example of SQL analysis would be nice. I'll move to #10871 so this can be closed. |
That is very cool -- thank you for sharing @LorrensP-2158466 . Since DataFusion is totally open it is always cool to see what people are doing with it. (BTW if you need a paper about DataFusion to cite -- we now have one https://dl.acm.org/doi/10.1145/3626246.3653368 :) ) |
Is your feature request related to a problem or challenge?
Currently, the only way to get the inputs of a LogicalPlan is to call
inputs()
, which returns aVec<&LogicalPlan>
. But I have noticed that there can be unnecessary calls tointo_iter()
oriter()
on this vector.Furthermore, the function returns a lot of Vectors of size 1, which can create unnecessary allocations.
This also applies to UserDefinedLogicalNode(Core), since I don't think the compiler can see through the use of
inputs()
and convert the Vec to an iterator.Describe the solution you'd like
To change the API to return an Iterator instead of a vector requires a lot of rewriting, so I think it's maybe nicer to create a new function that returns an iterator over the inputs like this:
fn inputs_iter(&self) -> impl Iterator<Item = &LogicalPlan> {}
We also have to extend the API of UserDefinedLogicalNode(Core) to have the same function so extension node's have this ability as well.
So instead of all the
vec![ input ]
calls, we can usestd::iter::once
, empty vec's can be empty iterators and in the case of an extension node we can just callnode.inputs_iter()
.Describe alternatives you've considered
No response
Additional context
This issue is purely for changing the API, so I'm willing to do it if this is accepted. To use this function in the source code is a bit more work, so I think it will be better to open another issue for changing the calls to
inputs()
intoinputs_iter()
.I may be entirely wrong since I'm fairly new to DataFusion, so any feedback is greatly appreciated.
The text was updated successfully, but these errors were encountered: