Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join Ordering using UES #939

Open
DerSchmidt opened this issue Jan 9, 2025 · 3 comments
Open

Join Ordering using UES #939

DerSchmidt opened this issue Jan 9, 2025 · 3 comments
Assignees

Comments

@DerSchmidt
Copy link
Collaborator

The SQL Parser currently outputs an unoptimized query plan using the DAPHNE-IR. One of the flaws is the join order, which is based on the order the joins are present in the given query.
We suggest using UES (Published in "Simplicity Done Right for Join Ordering"[CIDR 2021]) which is a simple join reordering that delivers good results.

For this, we would need to

  • add statistical information (maximum value frequencies and distinct element counts)
  • add a new lowering pass, in which the reordering occurs based on those statistics
@DerSchmidt DerSchmidt self-assigned this Jan 9, 2025
@pdamme
Copy link
Collaborator

pdamme commented Jan 9, 2025

Thanks for starting the work on join ordering, @DerSchmidt! That's very improtant for efficient relational query processing in DAPHNE.

Regarding adding statistical information: DAPHNE is already able to represent a few interesting data characteristics for matrices and frames. The two new characteristics you mention can be added in a similar way. Here are some hints on what needs to be done (no claim of completeness):

  1. We need to be able to represent the two new data characteristics in DaphneIR.
    • DaphneIR encodes data characteristics as parameters of MLIR types.
    • DAPHNE's custom MLIR types are defined in src/ir/daphneir/DaphneTypes.td.
    • You need to add new parameters to the Frame type (ideally some kind of list/array to be able to store an individual value per column).
    • Inspiration can be taken from the labels parameter of the Frame type (a std::vector to represent the data property for each column, pointer to such a vector to make this information completely optional (could be a nullptr))
    • It should always be possible to represent that the value of a data property in unknown (e.g., by -1 or by nullptr). This is important because determining the data properties can be quite expensive or not possible at DaphneDSL compile-time, so in many cases one might want/need to omit the information on a certain property.
    • Feel free to adapt the custom methods of the Frame type in its extraClassDeclaration; isSpecializationOf() should definitely be adapted.
  2. We need to be able to represent the two new data characteristic for input data sets.
    • Information on the data characteristics of input data sets (e.g., CSV files) is stored in meta data files in DAPHNE.
    • You need to add fields for the two new data properties to the FileMetaData struct (src/runtime/local/io/FileMetaData.h).
    • You need to adapt the meta data parser (src/parser/metadata/) to read entries for the two new data properties from the meta data files; these new entries should be optional for backward compatibility (if not present just set to unknown internally).
  3. We need to insert the information on the two new data characteristics into the IR.
    • Information on the data characteristics of input data sets from files is inserted into the IR through the type/property inference of ReadOp.
    • Property inference is implemented by one unified inference pass (src/compiler/inference/InferencePass.cpp) that handles all data characteristics to infer and takes special care of control flow operations etc.
    • Each data characteristic has a dedicated MLIR interface that is invoked by the inference pass; for some data characteristics, there are also MLIR traits for common cases.
    • Add an interface for each of the two new data properties (take inspiration from src/ir/daphneir/DaphneInferFrameLabelsOpInterface.td/h)
    • Implement that inferface at least for DaphneIR's ReadOp (take inspiration from src/ir/daphneir/DaphneInferFrameLabelsOpInterface.cpp, daphne::ReadOp::inferFrameLabels(), ReadOp in src/ir/daphneir/DaphneOps.td).
    • Take the two new characteristics into account in the inference pass (src/compiler/inference/InferencePass.cpp).
    • Potentially implement property inference for more DaphneIR operations to propagate the info from the base data to the points where they are needed (could be helpful before your join ordering pass or during it).
    • Add the new interfaces and their implementations in certain places (e.g., src/ir/daphneir/CMakeLists.txt, src/ir/daphneir/Daphne.h, src/ir/daphneir/DaphneOps.td).

@DerSchmidt
Copy link
Collaborator Author

DerSchmidt commented Jan 15, 2025

Feel free to adapt the custom methods of the Frame type in its extraClassDeclaration; isSpecializationOf() should definitely be adapted.

What does this really do? And what is the intended behavior?

This is my understanding of isSpecializationOf(): It makes a basic equality check for dimensions, types and labels of two Frames.
And as soon as we find a mismatch OR get an unknown in any one of the Frames, we say they are not specializations of one another.

So this means:

  • Matching dimensions, but both Frames have UnknownTypes, they are not specializations because the UnknownType can be an Int for one and a double for the other. I would agree that UnknownTypes are critical for this decision.
  • Matching dimensions, Matching Frames, but both don't have labels set (nullptr), meaning that they are again not specializations of one another. (Is this what we really want?)

And now a step forward on this issue, if we continue like this:

  • Matching dimensions, matching types, matching labels, BUT we have for both of them no statistics (nullptr). Is this really no specialization if both don't have statistics? Is this what we want and need?

So if you can make sense of this Specialization feature for me, it would be really appreciated!

@pdamme
Copy link
Collaborator

pdamme commented Jan 15, 2025

Hi @DerSchmidt! First, note that the parameters (columnTypes, numRows, numCols, labels) are part of the FrameType. Thus, a frame with 2 rows and a frame with 3 rows are technically different MLIR types, so there are many frame types. By DAPHNE's definition, a frame type a is a specialization of a frame type b if a has more information (less unknowns) than b, more specifically, if a provides known values for some of the unknown parameters in b. At the same time, a and b must not disagree (have different values) in any parameter that has a known value in both. Note that isSpecializationOf() is not commutative.

The documentation of isSpecializationOf() suggests that a must provide a known value for at least one unknown parameter of b ("strictly"). However, the implementation doesn't enforce this at least one-semantics, as I realize now. (We should think about potential implications of that soon, but I don't think it's critical for you now.)

In particular, when a and b match in all parameters but x and they both have an unknown x, then it holds that a.isSpecializationOf(b) == true and b.isSpecializationOf(a) == true. (In response to your examples.)

Internally, isSpecializationOf() considers all parameters of the FrameType. If the value for the other frame type is not unknown, it may contradict with the value of this frame type. So for a parameter x, we check if other.getX() is not unknown && getX() != other.getX(). If so, this frame type is not a specialization of the other frame type, so we return false. (If the value for the other frame type was unknown, then the value of this frame type would not matter, so no check is required.)

All you need to do is to treat the two new data properties (max value frequency per column and number of distict values per column; both essentially arrays) in a similar way as the other parameters. The most similar one is probably columnTypes: Iterate over the corresponding elements of the two arrays (from this and other) and for each pair: if the value for other is not unknown and differs from the value for this, then return false.

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants