Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of EXISTS and NOT EXISTS #1703

Merged
merged 40 commits into from
Feb 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
b455626
Add the required workflow files...
joka921 Oct 8, 2024
3205fa2
A dummy file for the workflow run thingy.
joka921 Oct 8, 2024
215a927
Merge pull request #1 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
b7bedba
Another test...
joka921 Oct 8, 2024
dfca875
Merge pull request #2 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
8bd6299
More sparql conformance stuff...
joka921 Oct 8, 2024
31cb0d5
Merge pull request #3 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
5cb6a0e
Backup in the middle.
joka921 Jan 7, 2025
e356ee1
Add some parsing and add some thoughts.
joka921 Jan 7, 2025
fc20174
Also implement NOT EXISTS
joka921 Jan 7, 2025
dde296b
Fix a small warning, to feed this to the tool.
joka921 Jan 7, 2025
0d1c788
Some cleanups and fixes.
joka921 Jan 8, 2025
7ff49c9
Fix compilation.
joka921 Jan 8, 2025
7ec8947
Fix the many many segfaults.
joka921 Jan 8, 2025
c03f3e5
Fix another bug.
joka921 Jan 8, 2025
2da52ab
Fix another bug.
joka921 Jan 8, 2025
cbbc771
Fix another bug.
joka921 Jan 8, 2025
91e5802
blub.
joka921 Jan 8, 2025
c3a9a7d
Added some more tests.
joka921 Jan 8, 2025
0adbfa6
Add some tests at least for the parser and query planner.
joka921 Jan 8, 2025
babd294
Some more tests.
joka921 Jan 9, 2025
6766af3
Added some comments.
joka921 Jan 9, 2025
f2524a8
Merge branch 'master' into exists
joka921 Jan 9, 2025
3a574ea
This is commented and very clean.
joka921 Jan 9, 2025
5809be2
better tests.
joka921 Jan 9, 2025
5294357
Made a pass over `ExistsJoin.h` and `ExistsJoin.cpp`
Jan 10, 2025
0917636
Merge remote-tracking branch 'origin/master' into exists
Feb 4, 2025
2bc5bdf
Changes by Hannah improving documentation and comments
Feb 5, 2025
c2abadd
Fix typo
Feb 5, 2025
c3f0e88
Merge remote-tracking branch 'origin/master' into exists
joka921 Feb 14, 2025
a6842f2
Merged and everything.
joka921 Feb 14, 2025
893f64f
Merge remote-tracking branch 'origin/exists' into exists
joka921 Feb 14, 2025
ee495f4
The test is currently not compiling, as we still have to apply severa…
joka921 Feb 14, 2025
ca30b5a
Also test different datasets.
joka921 Feb 14, 2025
d48d76b
Fix the name of the conformance test-suite
joka921 Feb 14, 2025
87ba7ca
Merge remote-tracking branch 'origin/master' into exists
Feb 14, 2025
cfe3c17
Minor improvements from Hannah's review
Feb 14, 2025
0cd71ac
Merge branch 'master' into exists
hannahbast Feb 14, 2025
608d0ea
Re-insert the `baseIri_` declaration in `SparqlQleverVisitor.h`
Feb 14, 2025
092e0d9
Revert changes in .github/workflows
Feb 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions src/engine/Bind.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,22 @@
#include "Bind.h"

#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/QueryExecutionTree.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "engine/sparqlExpressions/SparqlExpressionGenerators.h"
#include "util/ChunkedForLoop.h"
#include "util/Exception.h"

// _____________________________________________________________________________
Bind::Bind(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> subtree, parsedQuery::Bind b)
: Operation(qec), _subtree(std::move(subtree)), _bind(std::move(b)) {
_subtree = ExistsJoin::addExistsJoinsToSubtree(
_bind._expression, std::move(_subtree), getExecutionContext(),
cancellationHandle_);
}

// BIND adds exactly one new column
size_t Bind::getResultWidth() const { return _subtree->getResultWidth() + 1; }

Expand Down
6 changes: 3 additions & 3 deletions src/engine/Bind.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
#include "engine/sparqlExpressions/SparqlExpressionPimpl.h"
#include "parser/ParsedQuery.h"

/// BIND operation, currently only supports a very limited subset of expressions
// BIND operation.
class Bind : public Operation {
public:
static constexpr size_t CHUNK_SIZE = 10'000;

// ____________________________________________________________________________
Bind(QueryExecutionContext* qec, std::shared_ptr<QueryExecutionTree> subtree,
parsedQuery::Bind b)
: Operation(qec), _subtree(std::move(subtree)), _bind(std::move(b)) {}
parsedQuery::Bind b);

private:
std::shared_ptr<QueryExecutionTree> _subtree;
Expand Down
2 changes: 1 addition & 1 deletion src/engine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@ add_library(engine
TextLimit.cpp LazyGroupBy.cpp GroupByHashMapOptimization.cpp SpatialJoin.cpp
CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp ExecuteUpdate.cpp
Describe.cpp GraphStoreProtocol.cpp
QueryExecutionContext.cpp)
QueryExecutionContext.cpp ExistsJoin.cpp)
qlever_target_link_libraries(engine util index parser sparqlExpressions http SortPerformanceEstimator Boost::iostreams s2)
207 changes: 207 additions & 0 deletions src/engine/ExistsJoin.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
// Copyright 2025, University of Freiburg
// Chair of Algorithms and Data Structures
// Author: Johannes Kalmbach <[email protected]>

#include "engine/ExistsJoin.h"

#include "CallFixedSize.h"
#include "engine/QueryPlanner.h"
#include "engine/sparqlExpressions/ExistsExpression.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "util/JoinAlgorithms/JoinAlgorithms.h"

// _____________________________________________________________________________
ExistsJoin::ExistsJoin(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> left,
std::shared_ptr<QueryExecutionTree> right,
Variable existsVariable)
: Operation{qec},
left_{std::move(left)},
right_{std::move(right)},
joinColumns_{QueryExecutionTree::getJoinColumns(*left_, *right_)},
existsVariable_{std::move(existsVariable)} {
// Make sure that the left and right input are sorted on the join columns.
std::tie(left_, right_) = QueryExecutionTree::createSortedTrees(
std::move(left_), std::move(right_), joinColumns_);
}

// _____________________________________________________________________________
string ExistsJoin::getCacheKeyImpl() const {
return absl::StrCat("EXISTS JOIN left: ", left_->getCacheKey(),
" right: ", right_->getCacheKey());
}

// _____________________________________________________________________________
string ExistsJoin::getDescriptor() const { return "Exists Join"; }

// ____________________________________________________________________________
VariableToColumnMap ExistsJoin::computeVariableToColumnMap() const {
auto res = left_->getVariableColumns();
AD_CONTRACT_CHECK(
!res.contains(existsVariable_),
"The target variable of an EXISTS join must be a new variable");
res[existsVariable_] = makeAlwaysDefinedColumn(getResultWidth() - 1);
return res;
}

// ____________________________________________________________________________
size_t ExistsJoin::getResultWidth() const {
// We add one column to the input.
return left_->getResultWidth() + 1;
}

// ____________________________________________________________________________
vector<ColumnIndex> ExistsJoin::resultSortedOn() const {
// We add one column to `left_`, but do not change the order of the rows.
return left_->resultSortedOn();
}

// ____________________________________________________________________________
float ExistsJoin::getMultiplicity(size_t col) {
// The multiplicities of all columns except the last one are the same as in
// `left_`.
if (col < getResultWidth() - 1) {
return left_->getMultiplicity(col);
}
// For the added (Boolean) column we take a dummy value, assuming that it
// will not be used for subsequent joins or other operations that make use of
// the multiplicities.
return 1;
}

// ____________________________________________________________________________
uint64_t ExistsJoin::getSizeEstimateBeforeLimit() {
return left_->getSizeEstimate();
}

// ____________________________________________________________________________
size_t ExistsJoin::getCostEstimate() {
// The implementation is a linear zipper join.
return left_->getCostEstimate() + right_->getCostEstimate() +
left_->getSizeEstimate() + right_->getSizeEstimate();
}

// ____________________________________________________________________________
ProtoResult ExistsJoin::computeResult([[maybe_unused]] bool requestLaziness) {
auto leftRes = left_->getResult();
auto rightRes = right_->getResult();
const auto& left = leftRes->idTable();
const auto& right = rightRes->idTable();

// We reuse the generic `zipperJoinWithUndef` function, which has two two
// callbacks: one for each matching pair of rows from `left` and `right`, and
// one for rows in the left input that have no matching counterpart in the
// right input. The first callback can be a noop, and the second callback
// gives us exactly those rows, where the value in the to-be-added result
// column should be `false`.

// Extract the join columns from both inputs to make the following code
// easier.
ad_utility::JoinColumnMapping joinColumnData{joinColumns_, left.numColumns(),
right.numColumns()};
IdTableView<0> joinColumnsLeft =
left.asColumnSubsetView(joinColumnData.jcsLeft());
IdTableView<0> joinColumnsRight =
right.asColumnSubsetView(joinColumnData.jcsRight());
checkCancellation();

// Compute `isCheap`, which is true iff there are no UNDEF values in the join
// columns (in which case we can use a simpler and cheaper join algorithm).
//
// TODO<joka921> This is the most common case. There are many other cases
// where the generic `zipperJoinWithUndef` can be optimized. This is work for
// a future PR.
size_t numJoinColumns = joinColumnsLeft.numColumns();
AD_CORRECTNESS_CHECK(numJoinColumns == joinColumnsRight.numColumns());
bool isCheap = ql::ranges::none_of(
ad_utility::integerRange(numJoinColumns), [&](const auto& col) {
return (ql::ranges::any_of(joinColumnsRight.getColumn(col),
&Id::isUndefined)) ||
(ql::ranges::any_of(joinColumnsLeft.getColumn(col),
&Id::isUndefined));
});

// Nothing to do for the actual matches.
auto noopRowAdder = ad_utility::noop;

// Store the indices of rows for which the value of the `EXISTS` (in the added
// Boolean column) should be `false`.
std::vector<size_t, ad_utility::AllocatorWithLimit<size_t>> notExistsIndices{
allocator()};
// Helper lambda for computing the exists join with `callFixedSize`, which
// makes the number of join columns a template parameter.
auto runForNumJoinCols = [&notExistsIndices, isCheap, &noopRowAdder,
&colsLeftDynamic = joinColumnsLeft,
&colsRightDynamic = joinColumnsRight,
this]<int NumJoinCols>() {
// The `actionForNotExisting` callback gets iterators as input, but should
// output indices, hence the pointer arithmetic.
auto joinColumnsLeft = colsLeftDynamic.asStaticView<NumJoinCols>();
auto joinColumnsRight = colsRightDynamic.asStaticView<NumJoinCols>();
auto actionForNotExisting =
[&notExistsIndices, begin = joinColumnsLeft.begin()](
const auto& itLeft) { notExistsIndices.push_back(itLeft - begin); };

// Run `zipperJoinWithUndef` with the described callbacks and the mentioned
// optimization in case we know that there are no UNDEF values in the join
// columns.
auto checkCancellationLambda = [this] { checkCancellation(); };
auto runZipperJoin = [&](auto findUndef) {
[[maybe_unused]] auto numOutOfOrder = ad_utility::zipperJoinWithUndef(
joinColumnsLeft, joinColumnsRight,
ql::ranges::lexicographical_compare, noopRowAdder, findUndef,
findUndef, actionForNotExisting, checkCancellationLambda);
};
if (isCheap) {
runZipperJoin(ad_utility::noop);
} else {
runZipperJoin(ad_utility::findSmallerUndefRanges);
}
};
ad_utility::callFixedSize(numJoinColumns, runForNumJoinCols);

// Add the result column from the computed `notExistsIndices` (which tell us
// where the value should be `false`).
IdTable result = left.clone();
result.addEmptyColumn();
decltype(auto) existsCol = result.getColumn(getResultWidth() - 1);
ql::ranges::fill(existsCol, Id::makeFromBool(true));
for (size_t notExistsIndex : notExistsIndices) {
existsCol[notExistsIndex] = Id::makeFromBool(false);
}

// The added column only contains Boolean values, and adds no new words to the
// local vocabulary, so we can simply copy the local vocab from `leftRes`.
return {std::move(result), resultSortedOn(), leftRes->getCopyOfLocalVocab()};
}

// _____________________________________________________________________________
std::shared_ptr<QueryExecutionTree> ExistsJoin::addExistsJoinsToSubtree(
const sparqlExpression::SparqlExpressionPimpl& expression,
std::shared_ptr<QueryExecutionTree> subtree, QueryExecutionContext* qec,
const ad_utility::SharedCancellationHandle& cancellationHandle) {
// Extract all `EXISTS` functions from the given `expression`.
std::vector<const sparqlExpression::SparqlExpression*> existsExpressions;
expression.getPimpl()->getExistsExpressions(existsExpressions);

// For each `EXISTS` function, add the corresponding `ExistsJoin`.
for (auto* expr : existsExpressions) {
const auto& exists =
dynamic_cast<const sparqlExpression::ExistsExpression&>(*expr);
// If we have already considered this `EXIST` (which we can detect by its
// variable), skip it. This can happen because some `FILTER`s (which may
// contain `EXISTS` functions) are applied multiple times (for example,
// when there are OPTIONAL joins in the query).
if (subtree->isVariableCovered(exists.variable())) {
continue;
}

Check warning on line 197 in src/engine/ExistsJoin.cpp

View check run for this annotation

Codecov / codecov/patch

src/engine/ExistsJoin.cpp#L196-L197

Added lines #L196 - L197 were not covered by tests

QueryPlanner qp{qec, cancellationHandle};
auto pq = exists.argument();
auto tree =
std::make_shared<QueryExecutionTree>(qp.createExecutionTree(pq));
subtree = ad_utility::makeExecutionTree<ExistsJoin>(
qec, std::move(subtree), std::move(tree), exists.variable());
}
return subtree;
}
82 changes: 82 additions & 0 deletions src/engine/ExistsJoin.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
// Copyright 2025, University of Freiburg
// Chair of Algorithms and Data Structures
// Author: Johannes Kalmbach <[email protected]>

#pragma once

#include "engine/Operation.h"
#include "engine/QueryExecutionTree.h"

// The implementation of an "EXISTS join", which we use to realize the semantics
// of the SPARQL `EXISTS` function. The join takes two subtrees as input, and
// returns the left subtree with an additional boolean column that is `true` iff
// at least one matching row is contained in the right subtree.
class ExistsJoin : public Operation {
private:
// The left and right child.
std::shared_ptr<QueryExecutionTree> left_;
std::shared_ptr<QueryExecutionTree> right_;
std::vector<std::array<ColumnIndex, 2>> joinColumns_;

// The variable of the added (Boolean) result column.
Variable existsVariable_;

public:
// Constructor. The `existsVariable` (the variable for the added column) must
// not yet be bound in `left`.
ExistsJoin(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> left,
std::shared_ptr<QueryExecutionTree> right,
Variable existsVariable);

// Extract all `ExistsExpression`s from the given `expression`. For each
// `ExistsExpression`, add an `ExistsJoin`. The left side of the first
// `ExistsJoin` is the input `subtree`. The left side of subsequent
// `ExistsJoin`s is the previous `ExistsJoin`. The right side of each
// `ExistsJoin` is the argument of the respective `ExistsExpression`. When
// there are no `ExistsExpression`s, return the input `subtree` unchanged.
//
// The returned subtree will contain one additional column for each
// `ExistsExpression`, which contains the result of the respective
// `ExistsJoin`. The `ExistsExpression` just reads the values of this column.
// The main work is done by the `ExistsJoin`.
//
// This function should be called in the constructor of each `Operation`,
// where an `EXISTS` expression can occur. For example, in the constructor of
// `BIND` and `FILTER`.
static std::shared_ptr<QueryExecutionTree> addExistsJoinsToSubtree(
const sparqlExpression::SparqlExpressionPimpl& expression,
std::shared_ptr<QueryExecutionTree> subtree, QueryExecutionContext* qec,
const ad_utility::SharedCancellationHandle& cancellationHandle);

// All following functions are inherited from `Operation`, see there for
// comments.
protected:
string getCacheKeyImpl() const override;

public:
string getDescriptor() const override;

size_t getResultWidth() const override;

vector<ColumnIndex> resultSortedOn() const override;

bool knownEmptyResult() override { return left_->knownEmptyResult(); }

Check warning on line 64 in src/engine/ExistsJoin.h

View check run for this annotation

Codecov / codecov/patch

src/engine/ExistsJoin.h#L64

Added line #L64 was not covered by tests

float getMultiplicity(size_t col) override;

private:
uint64_t getSizeEstimateBeforeLimit() override;

public:
size_t getCostEstimate() override;

vector<QueryExecutionTree*> getChildren() override {
return {left_.get(), right_.get()};
}

private:
ProtoResult computeResult([[maybe_unused]] bool requestLaziness) override;

VariableToColumnMap computeVariableToColumnMap() const override;
};
4 changes: 4 additions & 0 deletions src/engine/Filter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

#include "backports/algorithm.h"
#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/QueryExecutionTree.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "engine/sparqlExpressions/SparqlExpressionGenerators.h"
Expand All @@ -28,6 +29,9 @@ Filter::Filter(QueryExecutionContext* qec,
: Operation(qec),
_subtree(std::move(subtree)),
_expression{std::move(expression)} {
_subtree = ExistsJoin::addExistsJoinsToSubtree(
_expression, std::move(_subtree), getExecutionContext(),
cancellationHandle_);
setPrefilterExpressionForChildren();
}

Expand Down
19 changes: 13 additions & 6 deletions src/engine/GroupBy.cpp
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
// Copyright 2018, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author:
// 2018 Florian Kramer ([email protected])
// 2020- Johannes Kalmbach ([email protected])
// Copyright 2018 - 2025, University of Freiburg
// Chair of Algorithms and Data Structures
// Authors: Florian Kramer [2018 - 2020]
// Johannes Kalmbach <[email protected]>

#include "engine/GroupBy.h"

#include <absl/strings/str_join.h>

#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/IndexScan.h"
#include "engine/Join.h"
#include "engine/LazyGroupBy.h"
Expand Down Expand Up @@ -52,6 +52,14 @@ GroupBy::GroupBy(QueryExecutionContext* qec, vector<Variable> groupByVariables,
ql::ranges::sort(_groupByVariables, std::less<>{}, &Variable::name);

auto sortColumns = computeSortColumns(subtree.get());

// Aliases are like `BIND`s, which may contain `EXISTS` expressions.
for (const auto& alias : _aliases) {
subtree = ExistsJoin::addExistsJoinsToSubtree(
alias._expression, std::move(subtree), getExecutionContext(),
cancellationHandle_);
}

_subtree =
QueryExecutionTree::createSortedTree(std::move(subtree), sortColumns);
}
Expand Down Expand Up @@ -1526,7 +1534,6 @@ Result GroupBy::computeGroupByForHashMapOptimization(
// NOTE: If the input blocks have very similar or even identical non-empty
// local vocabs, no deduplication is performed.
localVocab.mergeWith(std::span{&inputLocalVocab, 1});

// Setup the `EvaluationContext` for this input block.
sparqlExpression::EvaluationContext evaluationContext(
*getExecutionContext(), _subtree->getVariableColumns(), inputTable,
Expand Down
Loading
Loading