Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

Open
alamb opened this issue Dec 15, 2023 · 0 comments
Open

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

alamb opened this issue Dec 15, 2023 · 0 comments

Comments

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

This was written by @sadboy in #7775 (comment) and I think it deserves wider discussion, so I made a new ticket:

As a quick and simple solution, that's what I would recommend, yes.

More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's LogicalPlan/Expr API, which serves two very
different use cases:

  1. The DataFrame users, who use DataFusion API to programmtically compose
    their queries. In this kind of scenario, you do want to perform eager
    input validation and fail early.
  2. The "analysis" users, who use LogicalPlan as an IR. Datafusion's own
    internal use of LogicalPlan would mostly fall into this category as
    well, including the SQL compiler, the analyzer/optimizers, logical plan
    serializers, etc. In this kind of scenario, input sanitation on the plan
    node constructors is wasteful as arguments are already known to be
    well-formed.

As things currently stand, the constructor methods in LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).

Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to LogicalPlan/Expr. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:

  • Plan types such as Expr::Wildcard can be removed, as they don't serve
    any purpose in an IR. The effect is that you can greatly cut down the
    "invalid states" in the analyzer/optimizer pipeline, making it much easier
    to perform tree transformations
  • Expr::Column can be changed to index-based rather than name-based (i.e.
    Column { index: usize }), so you get guaranteed O(1) column resolution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant