Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-2924: Lateral - fixed injection for tables and enhanced QueryExec API for easier testing. #2925

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Aklakan
Copy link
Contributor

@Aklakan Aklakan commented Jan 7, 2025

GitHub issue resolved #2924

Pull request Description:

  • Added a new test case with nested laterals.
  • Fixed nested laterals by merging compatible bindings in QueryIterLateral
  • While writing the tests, I added a few shorthands to the QueryExecBuilder API: Its now possible to write QueryExec.emptyDataset().query("SELECT ...").table().toRowSet().toResultSet() to generally safely obtain a detached materialized in-memory table / row set / result set without having to worry about resource leaks or try-with-resources blocks. In the revised tests, the tables are used directly when asserting the equality of expected and actual results. In an attempt to allow for cleanly detaching BindingTDB instances, I added Binding.detach() with appropriate implementations. Binding.detach avoids copying of bindings that are already in-memory, and is thus more subtle than BindingFactory.copy.

  • Tests are included.
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch 6 times, most recently from cbf563c to 1b9aaec Compare January 8, 2025 05:04
@Aklakan Aklakan changed the title GH-2924: Lateral - fixed injection for tables, disabled scope checks … GH-2924: Lateral - fixed injection for tables and enhanced QueryExec API for easier testing. Jan 8, 2025
@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch 3 times, most recently from 92b525b to 96f9e58 Compare January 9, 2025 11:21
@@ -98,7 +98,7 @@ public class Query extends Prologue implements Cloneable, Printable
public static final int ORDER_UNKNOW = -3 ;

// VALUES trailing clause
protected TableData valuesDataBlock = null ;
protected TableN valuesDataBlock = null ;
Copy link
Contributor Author

@Aklakan Aklakan Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the valuesDataBlock be mutable or not? One test case written by me some years ago assumed it to be mutable. The test case clones a query and checks that modification of the clone's valuesDataBlock does not affect the original one.

But the implementation of TableData suggests that it is intended to be immutable anyway as the addBinding method is overridden to reject updates.

As I see it: TableN -> mutable, TableData -> immutable - much like java's List and guava's ImmutableList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "immutable once built". That is, ideally, the Table interface should not have addBinding. But that's ideal.

Here, the query VALUES block is syntax and fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I'll change back to TableData and update the test case to expect failure when trying to modify the rows of a queries valuesDataBlock directly.

Copy link
Contributor Author

@Aklakan Aklakan Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted my changes to Query and updated the test case that attempted to modify the query's values block.

@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch from 96f9e58 to 645968d Compare January 9, 2025 11:24
public class TableData extends TableN {
public TableData(List<Var> variables, List<Binding> rows) {
super(variables, rows) ;
super(Collections.unmodifiableList(variables), Collections.unmodifiableList(rows)) ;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableData appears to be intended as immutable but this was so far not enforced - one could call getRows() and modify them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.unmodifiableList only puts a wrapper around the base list. The base variables and rows can be modified.

Copy link
Contributor Author

@Aklakan Aklakan Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor already existed. The wrappers prevent mutation via TableData.getRows().add() which wasn't the case before. Its not perfect, but from your comment

It should be "immutable once built"

its a step closer towards preventing incorrect use of TableData.

@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch from 645968d to 7262641 Compare January 9, 2025 11:44
builder.reset();
builder.addAll(row);
builder.addAll(binding);
table2.addBinding(builder.build());
Copy link
Contributor Author

@Aklakan Aklakan Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addBinding checks whether the table's variables need to be updated.
This check is not needed here, so in the revised code I just collect the bindings in a list instead and create the table afterwards.

@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch 5 times, most recently from 4867ac0 to b336897 Compare January 10, 2025 15:19
@afs afs self-assigned this Jan 11, 2025
@afs
Copy link
Member

afs commented Jan 14, 2025

First part of a review - trying to understand the machinery changes. There seem to be a lot of them and I'm trying to understand what are necessary.

@@ -98,7 +98,7 @@ public class Query extends Prologue implements Cloneable, Printable
public static final int ORDER_UNKNOW = -3 ;

// VALUES trailing clause
protected TableData valuesDataBlock = null ;
protected TableN valuesDataBlock = null ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "immutable once built". That is, ideally, the Table interface should not have addBinding. But that's ideal.

Here, the query VALUES block is syntax and fixed.

public class TableData extends TableN {
public TableData(List<Var> variables, List<Binding> rows) {
super(variables, rows) ;
super(Collections.unmodifiableList(variables), Collections.unmodifiableList(rows)) ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.unmodifiableList only puts a wrapper around the base list. The base variables and rows can be modified.

@@ -103,4 +103,8 @@ public default ResultSet materialise() {
}

public void close();

default RowSet toRowSet() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer asRowSet -- as* when it is the same thing, different presentation, and to* when it is a created conversion.

c.f. Dataset::asDatasetGraph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to asRowSet.

return new TableN(queryIterator) ;
}

public static Table create(Var var, Node value)
{ return new Table1(var, value) ; }

/** Creates a mutable table from the detached bindings of the row set. */
public static Table create(RowSet rs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what going on. Could you explain this please?

This seems to duplicate the intent of RowSet.materialize and RowSet.rewindable.

Copy link
Contributor Author

@Aklakan Aklakan Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference (to my understanding) is that a Table acts as a Collection and a RowSet as an Iterator.
A Table can thus be seen as a factory for RowSets via Table.toRowSet() - each obtained RowSet is an independent iterator over the Table.
My qualms with RowSet.materialize/rewindable is that these methods operate on the iterator level; it seemed clean to me to add Table as a collection of bindings.

return this;
}

protected abstract Binding detachWithNewParent(Binding newParent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nicer in BindingFactory ... if we were using Java21 and could use switch patterns. Oh well, another time.

Copy link
Contributor Author

@Aklakan Aklakan Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking that BindingFactory could take care of making bindings independent. But then again, why should BindingFactory have special logic for BindingTDB instances? In principle, BindingTDB.detach could be implemented as just filling the already present (but unused) cache - so no copy would be needed then.

@@ -54,6 +54,12 @@ public static QueryExecBuilder graph(Graph graph) {
return QueryExecDatasetBuilder.create().graph(graph);
}

/** Create a {@link QueryExecBuilder} initialized with an empty dataset. */
public static QueryExecBuilder emptyDataset() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used from TestQueryExecution. Please can the code go there so it is not in the runtime API.?

It can useDatasetGraphFactory.empty().

return QueryExec.dataset(DatasetGraphFactory.empty()`);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a suggestion for a shortcut - I'll remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorthand emptyDataset() removed.

* Creates and returns an independent in-memory table by materializing the underlying row set.
* Subsequently, {@link Table#toRowSet()} can be used to obtain a fresh row set view over the table.
*/
public default Table table() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the question about "RowSet.matrialize/rewindable" above.

} finally {
ds.close() ;
}
assertEquals(expected, actual) ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ResultSetCompare.exactEquals (which should probably be in RowSetOps).

Copy link
Contributor Author

@Aklakan Aklakan Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I view Table as a Collection and RowSet as an Iterator.
If I am not mistaken, then table's equals() method naturally and succinctly performs an exact comparison (using term equality). For working with iterators (RowSet/ResultSet) one should of course use ResultSetCompare / RowSetOps methods.
In this case, one could also write a bit more verbose: assertTrue(ResultSetCompare.exactEquals(expected.toRowSet(), actual.toRowSet());

@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch 7 times, most recently from 89406a0 to f288cb7 Compare January 15, 2025 14:14
@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch from f288cb7 to 0aadb7e Compare January 15, 2025 14:15
@Aklakan Aklakan force-pushed the 2025-01-07_lateral-table-fix branch from 0aadb7e to 50b3e24 Compare January 15, 2025 14:17
// Index variables in a set if there are more than a few of them.
Collection<Var> tableVarsIndex = tableVars.size() > 4 ? new HashSet<>(tableVars) : tableVars;
binding.vars().forEachRemaining(v -> {
if (tableVarsIndex.contains(v)) {
Copy link
Member

@afs afs Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table vars can't contain a binding var (or am I missing something?). That case is caught by the text

https://github.com/w3c/sparql-dev/blob/main/SEP/SEP-0007/sep-0007.md
"Disallow syntactic forms that set variables that may already be present in the current row."

It's tested in SyntaxVarScope.checkLATEAL.

(I removed the if, left the else and the two tests pass.)

If so, it simplifies to vars.add(v);
And hence "vars = tableVars" (copy not needed).

Copy link
Contributor Author

@Aklakan Aklakan Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See comment on the line containing Algebra.compatible)

builder.addAll(row);
builder.addAll(binding);
table2.addBinding(builder.build());
if (Algebra.compatible(row, binding, commonVars.iterator())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a variable in common from substitution it must be the same term.

So this is a contains relationship. (Given the comments above there may be a simpler way.)

Copy link
Contributor Author

@Aklakan Aklakan Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is potentially another can of worms I was about to open w.r.t. sep7:

Perhaps its worthy of opening an issue and discussion on sparql-dev. The reason I added the compatibility check was because I had the following query in mind when writing the tests, which for reasons I do not yet understand is currently prohibited.
Removing the restrictions in SyntaxVarScope.checkLATERAL on Element_Data (and Element_Bind) and adding the check for compatible bindings (effectively filtering bindings of tables down to those compatible with the current row; as done in the PR) naturally makes the query work.

SELECT * {
  VALUES ?department {
    <urn:dept1>
    <urn:dept2>
  }
  LATERAL {
    SELECT * {
      VALUES (?department ?employee) {
        ( <urn:dept1> <urn:person1> )
        ( <urn:dept1> <urn:person2> )
        ( <urn:dept2> <urn:person1> )
      }
    } ORDER BY ?employee LIMIT 1
  }
}

My expectation:

department employee
urn:dept1 urn:person1
urn:dept2 urn:person1

Actual result:
Scoping error due to check in SyntaxVarScope.checkLATERAL .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be a problem with SyntaxVarScope.checkLATERAL because the SELECT * is effectively SELECT ?employee which works.

Copy link
Contributor Author

@Aklakan Aklakan Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT * is effectively SELECT ?employee which works.

Hm, depending on the viewpoint it is the effective query after substitution, but before that, it's SELECT ?department ?employee as to make ?department in-scope such that the substitution could affect the table and discard its incompatible rows.

I think the scope error is in-line with the SEP7 statement:

"Disallow syntactic forms that set variables that may already be present in the current row."

But I think that this is too restrictive, because it prevents several useful cases for inlining data using VALUES blocks. IMO if substitution leads to incompatible bindings then they should just be discarded as usual - no? Perhaps you have corner cases in mind where under this logic the substitution could become ambiguous and that this might be the reason for the restrictions.

Removing the restrictions on Element_Bind and Element_Data in checkLATERAL in order to make the above "department-employee" query work causes the following test cases to fail because they no longer raise an varscope/syntax error:

syntax-lateral-bad-{04, 05, 06, 08}.arq

For example, syntax-lateral-bad-04:

SELECT * {
   ?s ?p ?o
   LATERAL { BIND( 123 AS ?o) }
}

This is certainly not an efficient way to write a query, but I don't understand why the query should be syntactically illegal in the first place. To me it looks like it would only retain bindings where ?o = 123. So it would be the task of an optimizer to rewrite the Lateral/Bind to a plain FILTER(?o = 123).

When disabling the SyntaxVarScope check for ElementBind, each ?o gets injected into the unit table beneath (extend ((?o 123)) (table unit)) such as (extend ((?o 123)) (table (row (?o 456)))) and the evaluation of OpExtend discards the incompatible binding - which seems like the natural way it should be to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, depending on the viewpoint it is the effective query after substitution,

My bad - I was thinking about GROUP BY.

For ORDER BY, the query is illegal by the assignment restriction.

To me it looks like it would only retain bindings where ?o = 123.

That would turn BIND into a filter. (ARQ's extension LET does that.)

We have to have a restriction of some kind. The SEP-0007 design makes it a compile-time condition (test once per query).

In general, some system have optimizations that on the static analysis of the query. Making the RHS of a LATERAL have runtime variance would interfere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have to get Jena 5.3.0 out without this.

This issue is important and something that is, I hope, going to get more importance in the SPARQL community.

WG sparql-query PR 177 at least gets the essential machinery into the resp with the WG contribution licnesing. A small step, but a step.

Copy link
Contributor Author

@Aklakan Aklakan Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, some system have optimizations that on the static analysis of the query.

Well, the tradeoff to what extend existing implementations need to drive restrictions of new features in a standard.

For the sake of closing the issue, should I modify/simplify the logic in QueryIterLateral or should I leave the slightly more general approach as it is for now? Essentially, the var scope check should prevent non-empty sets of commonVars. No, commonVars is non-empty in the case of nested laterals - because the outer lateral already injects its binding into the unit tables - so subsequent inner laterals attempt to do the same injection. Perhaps the way it is is safe - the alternative is to remove the compatibility check on commonVars (and the commonVars set itself) - because the injections should be compatible in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have to get Jena 5.3.0 out without this.

No worries, whenever it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with Nested Laterals
2 participants