Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

alamb · 2023-12-15T12:09:46Z

This was written by @sadboy in #7775 (comment) and I think it deserves wider discussion, so I made a new ticket:

As a quick and simple solution, that's what I would recommend, yes.

More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's LogicalPlan/Expr API, which serves two very
different use cases:

The DataFrame users, who use DataFusion API to programmtically compose
their queries. In this kind of scenario, you do want to perform eager
input validation and fail early.
The "analysis" users, who use LogicalPlan as an IR. Datafusion's own
internal use of LogicalPlan would mostly fall into this category as
well, including the SQL compiler, the analyzer/optimizers, logical plan
serializers, etc. In this kind of scenario, input sanitation on the plan
node constructors is wasteful as arguments are already known to be
well-formed.

As things currently stand, the constructor methods in LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).

Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to LogicalPlan/Expr. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:

Plan types such as Expr::Wildcard can be removed, as they don't serve
any purpose in an IR. The effect is that you can greatly cut down the
"invalid states" in the analyzer/optimizer pipeline, making it much easier
to perform tree transformations
Expr::Column can be changed to index-based rather than name-based (i.e.
Column { index: usize }), so you get guaranteed O(1) column resolution

The text was updated successfully, but these errors were encountered:

alamb mentioned this issue Dec 15, 2023

reduce clones of LogicalPlan in planner #7775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

alamb commented Dec 15, 2023

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

Comments

alamb commented Dec 15, 2023