-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-2924: Lateral - fixed injection for tables and enhanced QueryExec API for easier testing. #2925
base: main
Are you sure you want to change the base?
Conversation
cbf563c
to
1b9aaec
Compare
92b525b
to
96f9e58
Compare
@@ -98,7 +98,7 @@ public class Query extends Prologue implements Cloneable, Printable | |||
public static final int ORDER_UNKNOW = -3 ; | |||
|
|||
// VALUES trailing clause | |||
protected TableData valuesDataBlock = null ; | |||
protected TableN valuesDataBlock = null ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the valuesDataBlock be mutable or not? One test case written by me some years ago assumed it to be mutable. The test case clones a query and checks that modification of the clone's valuesDataBlock does not affect the original one.
But the implementation of TableData suggests that it is intended to be immutable anyway as the addBinding method is overridden to reject updates.
As I see it: TableN -> mutable, TableData -> immutable - much like java's List and guava's ImmutableList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "immutable once built". That is, ideally, the Table interface should not have addBinding
. But that's ideal.
Here, the query VALUES block is syntax and fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I'll change back to TableData and update the test case to expect failure when trying to modify the rows of a queries valuesDataBlock directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted my changes to Query and updated the test case that attempted to modify the query's values block.
96f9e58
to
645968d
Compare
public class TableData extends TableN { | ||
public TableData(List<Var> variables, List<Binding> rows) { | ||
super(variables, rows) ; | ||
super(Collections.unmodifiableList(variables), Collections.unmodifiableList(rows)) ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableData appears to be intended as immutable but this was so far not enforced - one could call getRows() and modify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections.unmodifiableList
only puts a wrapper around the base list. The base variables
and rows
can be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor already existed. The wrappers prevent mutation via TableData.getRows().add()
which wasn't the case before. Its not perfect, but from your comment
It should be "immutable once built"
its a step closer towards preventing incorrect use of TableData
.
645968d
to
7262641
Compare
builder.reset(); | ||
builder.addAll(row); | ||
builder.addAll(binding); | ||
table2.addBinding(builder.build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addBinding checks whether the table's variables need to be updated.
This check is not needed here, so in the revised code I just collect the bindings in a list instead and create the table afterwards.
4867ac0
to
b336897
Compare
First part of a review - trying to understand the machinery changes. There seem to be a lot of them and I'm trying to understand what are necessary. |
@@ -98,7 +98,7 @@ public class Query extends Prologue implements Cloneable, Printable | |||
public static final int ORDER_UNKNOW = -3 ; | |||
|
|||
// VALUES trailing clause | |||
protected TableData valuesDataBlock = null ; | |||
protected TableN valuesDataBlock = null ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "immutable once built". That is, ideally, the Table interface should not have addBinding
. But that's ideal.
Here, the query VALUES block is syntax and fixed.
public class TableData extends TableN { | ||
public TableData(List<Var> variables, List<Binding> rows) { | ||
super(variables, rows) ; | ||
super(Collections.unmodifiableList(variables), Collections.unmodifiableList(rows)) ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections.unmodifiableList
only puts a wrapper around the base list. The base variables
and rows
can be modified.
@@ -103,4 +103,8 @@ public default ResultSet materialise() { | |||
} | |||
|
|||
public void close(); | |||
|
|||
default RowSet toRowSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer asRowSet
-- as*
when it is the same thing, different presentation, and to*
when it is a created conversion.
c.f. Dataset::asDatasetGraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to asRowSet
.
return new TableN(queryIterator) ; | ||
} | ||
|
||
public static Table create(Var var, Node value) | ||
{ return new Table1(var, value) ; } | ||
|
||
/** Creates a mutable table from the detached bindings of the row set. */ | ||
public static Table create(RowSet rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what going on. Could you explain this please?
This seems to duplicate the intent of RowSet.materialize
and RowSet.rewindable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference (to my understanding) is that a Table acts as a Collection and a RowSet as an Iterator.
A Table can thus be seen as a factory for RowSets via Table.toRowSet()
- each obtained RowSet is an independent iterator over the Table.
My qualms with RowSet.materialize/rewindable is that these methods operate on the iterator level; it seemed clean to me to add Table as a collection of bindings.
jena-arq/src/main/java/org/apache/jena/sparql/engine/binding/BindingBase.java
Outdated
Show resolved
Hide resolved
return this; | ||
} | ||
|
||
protected abstract Binding detachWithNewParent(Binding newParent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be nicer in BindingFactory
... if we were using Java21 and could use switch patterns. Oh well, another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking that BindingFactory could take care of making bindings independent. But then again, why should BindingFactory have special logic for BindingTDB instances? In principle, BindingTDB.detach
could be implemented as just filling the already present (but unused) cache - so no copy would be needed then.
@@ -54,6 +54,12 @@ public static QueryExecBuilder graph(Graph graph) { | |||
return QueryExecDatasetBuilder.create().graph(graph); | |||
} | |||
|
|||
/** Create a {@link QueryExecBuilder} initialized with an empty dataset. */ | |||
public static QueryExecBuilder emptyDataset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used from TestQueryExecution
. Please can the code go there so it is not in the runtime API.?
It can useDatasetGraphFactory.empty()
.
return QueryExec.dataset(DatasetGraphFactory.empty()`);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a suggestion for a shortcut - I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shorthand emptyDataset()
removed.
* Creates and returns an independent in-memory table by materializing the underlying row set. | ||
* Subsequently, {@link Table#toRowSet()} can be used to obtain a fresh row set view over the table. | ||
*/ | ||
public default Table table() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the question about "RowSet.matrialize/rewindable" above.
} finally { | ||
ds.close() ; | ||
} | ||
assertEquals(expected, actual) ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ResultSetCompare.exactEquals
(which should probably be in RowSetOps
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I view Table as a Collection and RowSet as an Iterator.
If I am not mistaken, then table's equals()
method naturally and succinctly performs an exact comparison (using term equality). For working with iterators (RowSet/ResultSet) one should of course use ResultSetCompare / RowSetOps methods.
In this case, one could also write a bit more verbose: assertTrue(ResultSetCompare.exactEquals(expected.toRowSet(), actual.toRowSet());
89406a0
to
f288cb7
Compare
f288cb7
to
0aadb7e
Compare
…ec API for easier testing.
0aadb7e
to
50b3e24
Compare
// Index variables in a set if there are more than a few of them. | ||
Collection<Var> tableVarsIndex = tableVars.size() > 4 ? new HashSet<>(tableVars) : tableVars; | ||
binding.vars().forEachRemaining(v -> { | ||
if (tableVarsIndex.contains(v)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table vars can't contain a binding var (or am I missing something?). That case is caught by the text
https://github.com/w3c/sparql-dev/blob/main/SEP/SEP-0007/sep-0007.md
"Disallow syntactic forms that set variables that may already be present in the current row."
It's tested in SyntaxVarScope.checkLATEAL
.
(I removed the if
, left the else
and the two tests pass.)
If so, it simplifies to vars.add(v);
And hence "vars = tableVars" (copy not needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See comment on the line containing Algebra.compatible
)
builder.addAll(row); | ||
builder.addAll(binding); | ||
table2.addBinding(builder.build()); | ||
if (Algebra.compatible(row, binding, commonVars.iterator())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a variable in common from substitution it must be the same term.
So this is a contains relationship. (Given the comments above there may be a simpler way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is potentially another can of worms I was about to open w.r.t. sep7:
Perhaps its worthy of opening an issue and discussion on sparql-dev. The reason I added the compatibility check was because I had the following query in mind when writing the tests, which for reasons I do not yet understand is currently prohibited.
Removing the restrictions in SyntaxVarScope.checkLATERAL
on Element_Data
(and Element_Bind
) and adding the check for compatible bindings (effectively filtering bindings of tables down to those compatible with the current row; as done in the PR) naturally makes the query work.
SELECT * {
VALUES ?department {
<urn:dept1>
<urn:dept2>
}
LATERAL {
SELECT * {
VALUES (?department ?employee) {
( <urn:dept1> <urn:person1> )
( <urn:dept1> <urn:person2> )
( <urn:dept2> <urn:person1> )
}
} ORDER BY ?employee LIMIT 1
}
}
My expectation:
department | employee |
---|---|
urn:dept1 | urn:person1 |
urn:dept2 | urn:person1 |
Actual result:
Scoping error due to check in SyntaxVarScope.checkLATERAL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a problem with SyntaxVarScope.checkLATERAL
because the SELECT *
is effectively SELECT ?employee
which works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT *
is effectivelySELECT ?employee
which works.
Hm, depending on the viewpoint it is the effective query after substitution, but before that, it's SELECT ?department ?employee
as to make ?department
in-scope such that the substitution could affect the table and discard its incompatible rows.
I think the scope error is in-line with the SEP7 statement:
"Disallow syntactic forms that set variables that may already be present in the current row."
But I think that this is too restrictive, because it prevents several useful cases for inlining data using VALUES blocks. IMO if substitution leads to incompatible bindings then they should just be discarded as usual - no? Perhaps you have corner cases in mind where under this logic the substitution could become ambiguous and that this might be the reason for the restrictions.
Removing the restrictions on Element_Bind
and Element_Data
in checkLATERAL
in order to make the above "department-employee" query work causes the following test cases to fail because they no longer raise an varscope/syntax error:
syntax-lateral-bad-{04, 05, 06, 08}.arq
For example, syntax-lateral-bad-04
:
SELECT * {
?s ?p ?o
LATERAL { BIND( 123 AS ?o) }
}
This is certainly not an efficient way to write a query, but I don't understand why the query should be syntactically illegal in the first place. To me it looks like it would only retain bindings where ?o = 123. So it would be the task of an optimizer to rewrite the Lateral/Bind
to a plain FILTER(?o = 123)
.
When disabling the SyntaxVarScope check for ElementBind, each ?o gets injected into the unit table beneath (extend ((?o 123)) (table unit))
such as (extend ((?o 123)) (table (row (?o 456))))
and the evaluation of OpExtend discards the incompatible binding - which seems like the natural way it should be to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, depending on the viewpoint it is the effective query after substitution,
My bad - I was thinking about GROUP BY
.
For ORDER BY
, the query is illegal by the assignment restriction.
To me it looks like it would only retain bindings where ?o = 123.
That would turn BIND
into a filter. (ARQ's extension LET
does that.)
We have to have a restriction of some kind. The SEP-0007 design makes it a compile-time condition (test once per query).
In general, some system have optimizations that on the static analysis of the query. Making the RHS of a LATERAL have runtime variance would interfere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have to get Jena 5.3.0 out without this.
This issue is important and something that is, I hope, going to get more importance in the SPARQL community.
WG sparql-query PR 177 at least gets the essential machinery into the resp with the WG contribution licnesing. A small step, but a step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, some system have optimizations that on the static analysis of the query.
Well, the tradeoff to what extend existing implementations need to drive restrictions of new features in a standard.
For the sake of closing the issue, should I modify/simplify the logic in QueryIterLateral or should I leave the slightly more general approach as it is for now? Essentially, the var scope check should prevent non-empty sets of commonVars. No, commonVars is non-empty in the case of nested laterals - because the outer lateral already injects its binding into the unit tables - so subsequent inner laterals attempt to do the same injection. Perhaps the way it is is safe - the alternative is to remove the compatibility check on commonVars (and the commonVars set itself) - because the injections should be compatible in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have to get Jena 5.3.0 out without this.
No worries, whenever it's ready.
GitHub issue resolved #2924
Pull request Description:
QueryIterLateral
QueryExec.emptyDataset().query("SELECT ...").table().toRowSet().toResultSet()
to generally safely obtain a detached materialized in-memory table / row set / result set without having to worry about resource leaks or try-with-resources blocks. In the revised tests, the tables are used directly when asserting the equality of expected and actual results. In an attempt to allow for cleanly detaching BindingTDB instances, I addedBinding.detach()
with appropriate implementations.Binding.detach
avoids copying of bindings that are already in-memory, and is thus more subtle thanBindingFactory.copy
.By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.