You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was written by @sadboy in #7775 (comment) and I think it deserves wider discussion, so I made a new ticket:
As a quick and simple solution, that's what I would recommend, yes.
More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's LogicalPlan/Expr API, which serves two very
different use cases:
The DataFrame users, who use DataFusion API to programmtically compose
their queries. In this kind of scenario, you do want to perform eager
input validation and fail early.
The "analysis" users, who use LogicalPlan as an IR. Datafusion's own
internal use of LogicalPlan would mostly fall into this category as
well, including the SQL compiler, the analyzer/optimizers, logical plan
serializers, etc. In this kind of scenario, input sanitation on the plan
node constructors is wasteful as arguments are already known to be
well-formed.
As things currently stand, the constructor methods in LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).
Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to LogicalPlan/Expr. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:
Plan types such as Expr::Wildcard can be removed, as they don't serve
any purpose in an IR. The effect is that you can greatly cut down the
"invalid states" in the analyzer/optimizer pipeline, making it much easier
to perform tree transformations
Expr::Column can be changed to index-based rather than name-based (i.e. Column { index: usize }), so you get guaranteed O(1) column resolution
The text was updated successfully, but these errors were encountered:
This was written by @sadboy in #7775 (comment) and I think it deserves wider discussion, so I made a new ticket:
As a quick and simple solution, that's what I would recommend, yes.
More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's
LogicalPlan
/Expr
API, which serves two verydifferent use cases:
their queries. In this kind of scenario, you do want to perform eager
input validation and fail early.
LogicalPlan
as an IR. Datafusion's owninternal use of
LogicalPlan
would mostly fall into this category aswell, including the SQL compiler, the analyzer/optimizers, logical plan
serializers, etc. In this kind of scenario, input sanitation on the plan
node constructors is wasteful as arguments are already known to be
well-formed.
As things currently stand, the constructor methods in
LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).
Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to
LogicalPlan
/Expr
. The former would be exclusively used tocapture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:
Expr::Wildcard
can be removed, as they don't serveany purpose in an IR. The effect is that you can greatly cut down the
"invalid states" in the analyzer/optimizer pipeline, making it much easier
to perform tree transformations
Expr::Column
can be changed to index-based rather than name-based (i.e.Column { index: usize }
), so you get guaranteed O(1) column resolutionThe text was updated successfully, but these errors were encountered: