From ddf420686a437c905b3867ec9cb3c2214db7405b Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Wed, 14 Jun 2023 23:15:12 +0200 Subject: [PATCH 1/7] initial changes to turn the input to Group(..) and to Aggregation(..) into solution sequences rather then sets/multisets of solutions; see https://github.com/w3c/sparql-query/issues/95#issuecomment-1584615608 --- spec/index.html | 48 ++++++++++++++++++++++++++---------------------- 1 file changed, 26 insertions(+), 22 deletions(-) diff --git a/spec/index.html b/spec/index.html index c2ded19..1693f44 100644 --- a/spec/index.html +++ b/spec/index.html @@ -8745,7 +8745,11 @@
Grouping and Aggregation

Step: GROUP BY

If the GROUP BY keyword is used, or there is implicit grouping due to the use of aggregates in the projection, then grouping is performed by the - Group function. It divides the solution set into groups of one or + Group function. + In this case, before grouping, the solution set is converted into a solution + sequence by applying the ToList function. + Next, the Group function + divides this solution sequence into groups of one or more solutions, with the same overall cardinality. In case of implicit grouping, a fixed constant (1) is used to group all solutions into a single group.

Step: Aggregates

@@ -8765,9 +8769,9 @@
Grouping and Aggregation
Let E := [], a list of pairs of the form (variable, expression) If Q contains GROUP BY exprlist - Let G := Group(exprlist, P) + Let G := Group(exprlist, ToList(P)) Else If Q contains an aggregate in SELECT, HAVING, ORDER BY - Let G := Group((1), P) + Let G := Group((1), ToList(P)) Else skip the rest of the aggregate step End @@ -9415,10 +9419,10 @@

Aggregate Algebra

Definition: Group
-

Group evaluates a list of expressions against a solution sequence, producing a set +

Group evaluates a list of expressions against a solution sequence Ψ, producing a set of partial functions from keys to solution sequences.

-

Group(exprlist, Ω) = { ListEval(exprlist, μ) → { μ' | μ' in Ω, ListEval(exprlist, μ) - = ListEval(exprlist, μ') } | μ in Ω }

+

Group(exprlist, Ψ) = { ListEval(exprlist, μ) → [ μ' | μ' in Ψ, ListEval(exprlist, μ) + = ListEval(exprlist, μ') ] | μ in Ψ }

Definition: ListEval

@@ -9441,22 +9445,22 @@

Aggregate Algebra

Let exprlist be a list of expressions or *, func a set function, scalarvals a set of partial functions (possibly empty) passed from the aggregate - in the query, and let { key1→Ω1, ..., - keym→Ωm } be a multiset of partial functions from keys to + in the query, and let { key1→Ψ1, ..., + keym→Ψm } be a set of partial functions from keys to solution sequences as produced by the grouping step.

-

Aggregation applies the set function func to the given multiset and produces a - single value for each key and partition of solutions for that key.

-

Aggregation(exprlist, func, scalarvals, { key1→Ω1, ..., - keym→Ωm } )
-    = { (key, F(Ω)) | key → Ω in { key1→Ω1, ..., - keym→Ωm } }

+

Aggregation applies the set function func to the given set and produces a + single value for each key and group of solutions for that key.

+

Aggregation(exprlist, func, scalarvals, { key1→Ψ1, ..., + keym→Ψm } )
+    = { (key, F(Ψ)) | key → Ψ in { key1→Ψ1, ..., + keym→Ψm } }

where
-   M(Ω) = { ListEval(exprlist, μ) | μ in Ω }
-   F(Ω) = func(M(Ω), scalarvals), for non-DISTINCT
-   F(Ω) = func(Distinct(M(Ω)), scalarvals), for DISTINCT

+   M(Ψ) = [ ListEval(exprlist, μ) | μ in Ψ ]
+   F(Ψ) = func(M(Ψ), scalarvals), for non-DISTINCT
+   F(Ψ) = func(Distinct(M(Ψ)), scalarvals), for DISTINCT

Special Case: when COUNT is used with the expression * the value of F will be the cardinality of the group solution sequence, - card[Ω], or card[Distinct(Ω)] if the DISTINCT + card[Ψ], or card[Distinct(Ψ)] if the DISTINCT keyword is present.

scalarvals are used to pass values to the underlying set function, bypassing @@ -9466,7 +9470,7 @@

Aggregate Algebra

All aggregates may have the DISTINCT keyword as the first token in their argument list. If this keyword is present then first argument to func is Distinct(M).

Example

-

Given a solution multiset (Ω) with the following values:

+

Given a solution sequence Ψ with the following values:

@@ -9497,10 +9501,10 @@

Aggregate Algebra

And the query expression SELECT (ex:agg(?y, ?z) AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x.

-

We produce G = Group((?x), Ω) = { ( (1), { μ1, μ2 } ), ( (2), { - μ3 } ) }

+

We produce G = Group((?x), Ψ) = { (1) → [μ1, μ2], (2) → + [μ3] }

And so Aggregation((?y, ?z), ex:agg, {}, G) =
- { ((1), eg:agg({(2, 3), (3, 4)}, {})), ((2), eg:agg({(5, 6)}, {})) }.

+ { ((1), eg:agg([(2, 3), (3, 4)], {})), ((2), eg:agg([(5, 6)], {})) }.

Definition: AggregateJoin

Let S1, ..., Sn be a list of sets, where each set From c2fe1ccc91b2687ff79f367db45b3de648ae6431 Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Thu, 15 Jun 2023 15:03:08 +0200 Subject: [PATCH 2/7] =?UTF-8?q?extends=20the=20definition=20of=20Distinct(?= =?UTF-8?q?=CE=A8)=20to=20be=20applicable=20to=20sequences=20of=20tuples,?= =?UTF-8?q?=20as=20is=20used=20in=20the=20definition=20of=20F(=CE=A8)=20fo?= =?UTF-8?q?r=20the=20Aggregation=20operator;=20see=20https://github.com/w3?= =?UTF-8?q?c/sparql-query/pull/98#discussion=5Fr1230231743?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- spec/index.html | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/spec/index.html b/spec/index.html index 1693f44..5789c57 100644 --- a/spec/index.html +++ b/spec/index.html @@ -9368,10 +9368,10 @@

SPARQL Algebra

Definition: Distinct

-

Let Ψ be a sequence of solution mappings. We define:

-

Distinct(Ψ) = [ μ | μ in Ψ ]

-

card[Distinct(Ψ)](μ) = 1

-

The order of Distinct(Ψ) must preserve any ordering given by OrderBy.

+

Let Ψ be a sequence of elements which may be either solution mappings or lists of RDF terms. We define:

+

Distinct(Ψ) = [ e | e in Ψ ]

+

card[Distinct(Ψ)](e) = 1

+

The order of Distinct(Ψ) must preserve any ordering given by OrderBy (if any).

Definition: Reduced

From e95109b185c5c977620877df0b85a4d44e6bd880 Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Thu, 15 Jun 2023 16:06:05 +0200 Subject: [PATCH 3/7] changes the definitions of the set functions such that they are defined over sequences of lists now, rather than multisets of lists --- spec/index.html | 101 ++++++++++++++++++++++++------------------------ 1 file changed, 51 insertions(+), 50 deletions(-) diff --git a/spec/index.html b/spec/index.html index 5789c57..bdfe627 100644 --- a/spec/index.html +++ b/spec/index.html @@ -9515,24 +9515,24 @@

Aggregate Algebra

..., aggn→valn | key in K and key→vali in Si for each 1 <= i <= n }

-

Flatten is a function which is used to collapse multisets of lists into a multiset, so - for example { (1, 2), (3, 4) } becomes { 1, 2, 3, 4 }.

+

Flatten is a function which is used to collapse a sequence of lists into a single list. + For example, [(1, 2), (3, 4)] becomes (1, 2, 3, 4).

Definition: Flatten

-

The Flatten(M) function takes a multiset of lists, M {(L1, L2, - ...), ...}, and returns the multiset { x | L in M and x in L }.

+

The Flatten(S) function takes a sequence of lists, S = [(L1, L2, + ...), ...], and returns the list ( x | L in S and x in L ).

Set Functions

The set functions which underlie SPARQL aggregates all have a common signature: - SetFunc(M), or SetFunc(M, scalarvals) where M is a multiset of lists, and scalarvals is + SetFunc(S), or SetFunc(S, scalarvals) where S is a sequence of lists, and scalarvals is one or more scalar values that are passed to the set function indirectly via the ( ... ; key=value ) syntax for aggregates in the SPARQL grammar. The only use of this that is supported by the built-in aggregates in SPARQL Query 1.1 is GROUP_CONCAT, as in GROUP_CONCAT(?x ; separator=", ").

Note that the name "Set Function" is somewhat historical — the arguments to set - functions are in fact multisets. The name is retained due to the commonality with SQL - Set Functions, which also operate over multisets.

+ functions are in fact sequences. The name is retained due to the commonality with SQL + Set Functions, which operate over multisets.

The set functions defined in this document are Count, Sum, Min, Max, Avg, GroupConcat, and Sample — corresponding to the aggregates COUNT, SUM, MIN, MAX, AVG, @@ -9550,10 +9550,10 @@

Count
has a bound, non-error value within the aggregate group.

Definition: Count

-
xsd:integer Count(multiset M)
-

N = Flatten(M)

-

remove error elements from N

-

Count(M) = card[N]

+
xsd:integer Count(sequence S)
+

L = Flatten(S)

+

remove error elements from L

+

Count(S) = card[L]

@@ -9565,13 +9565,14 @@
Sum
be 6.0 (float).

Definition: Sum

-
numeric Sum(multiset M)
-

Sum(M) = Sum(ToList(Flatten(M))).

-

Sum(S) = op:numeric-add(S1, Sum(S2..n)) when card[S] > +

numeric Sum(sequence S)
+

L = Flatten(S)

+

Sum(S) = Sum(L)

+

Sum(L) = op:numeric-add(L1, Sum(L2..n)) when card[L] > 1
- Sum(S) = op:numeric-add(S1, 0) when card[S] = 1
- Sum(S) = "0"^^xsd:integer when card[S] = 0

-

In this way, Sum({1, 2, 3}) = op:numeric-add(1, op:numeric-add(2, + Sum(L) = op:numeric-add(L1, 0) when card[L] = 1
+ Sum(L) = "0"^^xsd:integer when card[L] = 0

+

In this way, Sum( (1, 2, 3) ) = op:numeric-add(1, op:numeric-add(2, op:numeric-add(3, 0))).

@@ -9581,11 +9582,11 @@
Avg
average value for an expression over a group. It is defined in terms of Sum and Count.

Definition: Avg

-
numeric Avg(multiset M)
-

Avg(M) = "0"^^xsd:integer, where Count(M) = 0

-

Avg(M) = Sum(M) / Count(M), where Count(M) > 0

+
numeric Avg(sequence S)
+

Avg(S) = "0"^^xsd:integer, where Count(S) = 0

+

Avg(S) = Sum(S) / Count(S), where Count(S) > 0

-

For example, Avg({1, 2, 3}) = Sum({1, 2, 3})/Count({1, 2, 3}) = 6/3 = 2.

+

For example, Avg([(1), (2), (3)]) = Sum([(1), (2), (3)])/Count([(1), (2), (3)]) = 6/3 = 2.

Min
@@ -9595,12 +9596,12 @@
Min
arbitrarily typed expressions.

Definition: Min

-
term Min(multiset M)
-

Min(M) = Min(ToList(Flatten(M)))

-

Min({}) = error.

-

The flattened multiset of values passed as an argument is converted to a sequence - S, this sequence is ordered as per the ORDER BY ASC clause.

-

Min(S) = S0

+
term Min(sequence S)
+

L = Flatten(S)

+

Min(S) = Min(L)

+

The flattened list L of values is ordered as per the ORDER BY ASC clause.

+

Min(L) = L0 if card[L] > 0
+ Min(L) = error if card[L] = 0

@@ -9611,12 +9612,12 @@
Max
arbitrarily typed expressions.

Definition: Max

-
term Max(multiset M)
-

Max(M) = Max(ToList(Flatten(M)))

-

Max({}) = error.

-

The multiset of values passed as an argument is converted to a sequence S, this - sequence is ordered as per the ORDER BY DESC clause.

-

Max(S) = S0

+
term Max(sequence S)
+

L = Flatten(S)

+

Max(S) = Max(L)

+

The flattened list L of values is ordered as per the ORDER BY DESC clause.

+

Max(L) = L0 if card[L] > 0
+ Max(L) = error if card[L] = 0

@@ -9627,33 +9628,33 @@
GroupConcat
SEPARATOR.

Definition: GroupConcat

-
literal GroupConcat(multiset M)
+
literal GroupConcat(sequence S)

If the "separator" scalar argument is absent from GROUP_CONCAT then it is taken to be the "space" character, unicode codepoint U+0020.

-

The multiset of values, M passed as an argument is converted to a sequence S.

-

GroupConcat(M, scalarvals) = GroupConcat(Flatten(M), scalarvals("separator"))

-

GroupConcat(S, sep) = "", where |S| = 0

-

GroupConcat(S, sep) = CONCAT("", S0), where - |S| = 1

-

GroupConcat(S, sep) = CONCAT(S0, sep, GroupConcat(S1..n-1, - sep)), where |S| > 1

-
-

For example, GroupConcat({"a", "b", "c"}, {"separator" → "."}) = "a.b.c".

+

L = Flatten(S)

+

GroupConcat(S, scalarvals) = GroupConcat(L, scalarvals("separator"))

+

GroupConcat(L, sep) = "", where |L| = 0

+

GroupConcat(L, sep) = CONCAT("", L0), where + |L| = 1

+

GroupConcat(L, sep) = CONCAT(L0, sep, GroupConcat(L1..n-1, + sep)), where |L| > 1

+ +

For example, GroupConcat([("a"), ("b"), ("c")], {"separator" → "."}) = "a.b.c".

Sample
-

Sample is a set function which returns an arbitrary value from the multiset passed +

Sample is a set function which returns an arbitrary value from the sequence passed to it.

Definition: Sample

-
RDFTerm Sample(multiset M)
-

Sample(M) = v, where v in Flatten(M)

-

Sample({}) = error

+
RDFTerm Sample(sequence S)
+

Sample(S) = v, where v in Flatten(S)

+

Sample([]) = error

-

For example, given Sample({"a", "b", "c"}), "a", "b", and "c" are all valid return +

For example, given Sample([("a"), ("b"), ("c")]), "a", "b", and "c" are all valid return values. Note that Sample() is not required to be deterministic for a given input, the - only restriction is that the output value must be present in the input multiset.

+ only restriction is that the output value must be present in the input sequence.

From 67ac56ee236ac4bfb281c02b3852c5409e3024fc Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Thu, 15 Jun 2023 22:19:01 +0200 Subject: [PATCH 4/7] Apply suggestions from code review Co-authored-by: Ted Thibodeau Jr --- spec/index.html | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/spec/index.html b/spec/index.html index bdfe627..0b67bcd 100644 --- a/spec/index.html +++ b/spec/index.html @@ -8770,7 +8770,7 @@
Grouping and Aggregation
If Q contains GROUP BY exprlist Let G := Group(exprlist, ToList(P)) -Else If Q contains an aggregate in SELECT, HAVING, ORDER BY +Else If Q contains an aggregate in SELECT, HAVING, ORDER BY Let G := Group((1), ToList(P)) Else skip the rest of the aggregate step @@ -9456,8 +9456,8 @@

Aggregate Algebra

keym→Ψm } }

where
  M(Ψ) = [ ListEval(exprlist, μ) | μ in Ψ ]
-   F(Ψ) = func(M(Ψ), scalarvals), for non-DISTINCT
-   F(Ψ) = func(Distinct(M(Ψ)), scalarvals), for DISTINCT

+   F(Ψ) = func(M(Ψ), scalarvals), for non-DISTINCT
+   F(Ψ) = func(Distinct(M(Ψ)), scalarvals), for DISTINCT

Special Case: when COUNT is used with the expression * the value of F will be the cardinality of the group solution sequence, card[Ψ], or card[Distinct(Ψ)] if the DISTINCT From ff80086d0b37ff19eff77b72b664db42d95f0326 Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Fri, 16 Jun 2023 15:49:54 +0200 Subject: [PATCH 5/7] improved definition of Distinct; as per https://github.com/w3c/sparql-query/pull/98#discussion_r1231405785 --- spec/index.html | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/spec/index.html b/spec/index.html index 0b67bcd..8cae4f8 100644 --- a/spec/index.html +++ b/spec/index.html @@ -9368,10 +9368,22 @@

SPARQL Algebra

Definition: Distinct

-

Let Ψ be a sequence of elements which may be either solution mappings or lists of RDF terms. We define:

-

Distinct(Ψ) = [ e | e in Ψ ]

-

card[Distinct(Ψ)](e) = 1

-

The order of Distinct(Ψ) must preserve any ordering given by OrderBy (if any).

+

Let Ψ be a sequence of elements which may be either solution mappings or lists of RDF terms.

+

Distinct(Ψ) is a sequence of elements that has the following properties.

+
    +
  1. Every element in Ψ is contained in Distinct(Ψ).
  2. +
  3. Every element in Distinct(Ψ) is contained in Ψ.
  4. +
  5. Distinct(Ψ) is free of duplicates. That is, the element at the |i|-th position in Distinct(Ψ) is different from the element at the |j|-th position in Distinct(Ψ) for every two natural numbers |i| and |j| such that |i| ≠ |j|. +
  6. +
  7. For every two elements e1 and e2 in Distinct(Ψ), the relative order of their first occurrences in Ψ is preserved in Distinct(Ψ). That is, if i1 < i2, then j1 < j2, where +
      +
    • i1 is the smallest natural number such that e1 is at the i1-th position in Ψ,
    • +
    • i2 is the smallest natural number such that e2 is at the i2-th position in Ψ,
    • +
    • j1 is the position of e1 in Distinct(Ψ), and
    • +
    • j2 is the position of e2 in Distinct(Ψ).
    • +
    +
  8. +

Definition: Reduced

From d808fb81cb44ad57931f326ab0af0fcca3fd474a Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Sun, 18 Jun 2023 11:50:58 +0200 Subject: [PATCH 6/7] reverts the change to Distinct and introduces Dedup instead, as suggested in https://github.com/w3c/sparql-query/commit/ff80086d0b37ff19eff77b72b664db42d95f0326#commitcomment-118517879 ; additionally addresses the following comments: https://github.com/w3c/sparql-query/commit/ff80086d0b37ff19eff77b72b664db42d95f0326#r118402147 https://github.com/w3c/sparql-query/commit/ff80086d0b37ff19eff77b72b664db42d95f0326#r118507227 https://github.com/w3c/sparql-query/commit/ff80086d0b37ff19eff77b72b664db42d95f0326#r118508717 --- spec/index.html | 39 +++++++++++++++++++++------------------ 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/spec/index.html b/spec/index.html index 8cae4f8..ed0e4f7 100644 --- a/spec/index.html +++ b/spec/index.html @@ -9368,22 +9368,10 @@

SPARQL Algebra

Definition: Distinct

-

Let Ψ be a sequence of elements which may be either solution mappings or lists of RDF terms.

-

Distinct(Ψ) is a sequence of elements that has the following properties.

-
    -
  1. Every element in Ψ is contained in Distinct(Ψ).
  2. -
  3. Every element in Distinct(Ψ) is contained in Ψ.
  4. -
  5. Distinct(Ψ) is free of duplicates. That is, the element at the |i|-th position in Distinct(Ψ) is different from the element at the |j|-th position in Distinct(Ψ) for every two natural numbers |i| and |j| such that |i| ≠ |j|. -
  6. -
  7. For every two elements e1 and e2 in Distinct(Ψ), the relative order of their first occurrences in Ψ is preserved in Distinct(Ψ). That is, if i1 < i2, then j1 < j2, where -
      -
    • i1 is the smallest natural number such that e1 is at the i1-th position in Ψ,
    • -
    • i2 is the smallest natural number such that e2 is at the i2-th position in Ψ,
    • -
    • j1 is the position of e1 in Distinct(Ψ), and
    • -
    • j2 is the position of e2 in Distinct(Ψ).
    • -
    -
  8. -
+

Let Ψ be a sequence of solution mappings. We define:

+

Distinct(Ψ) = [ μ | μ in Ψ ]

+

card[Distinct(Ψ)](μ) = 1

+

The order of Distinct(Ψ) must preserve any ordering given by OrderBy.

Definition: Reduced

@@ -9469,10 +9457,25 @@

Aggregate Algebra

where
  M(Ψ) = [ ListEval(exprlist, μ) | μ in Ψ ]
  F(Ψ) = func(M(Ψ), scalarvals), for non-DISTINCT
-   F(Ψ) = func(Distinct(M(Ψ)), scalarvals), for DISTINCT

+   F(Ψ) = func(Dedup(M(Ψ)), scalarvals), for DISTINCT

+

with Dedup(M(Ψ)) being an order-preserving, duplicate-free version of the sequence M(Ψ); that is, Dedup(M(Ψ)) is a sequence of RDF terms that has the following four properties.

+
    +
  1. Every unique element in M(Ψ) is contained in Dedup(M(Ψ)).
  2. +
  3. Every element in Dedup(M(Ψ)) is contained in M(Ψ).
  4. +
  5. Dedup(M(Ψ)) is free of duplicates. That is, the element at the |i|-th position in Dedup(M(Ψ)) is not the same term as the element at the |j|-th position in Dedup(M(Ψ)) for every two natural numbers |i| and |j| such that |i| ≠ |j|.
  6. +
  7. For any two elements e1 and e2 in Dedup(M(Ψ)), the relative order of their first occurrences in M(Ψ) is preserved in Dedup(M(Ψ)). That is, if i1 < i2, then j1 < j2, where +
      +
    • i1 is the smallest natural number such that e1 is at the i1-th position in M(Ψ),
    • +
    • i2 is the smallest natural number such that e2 is at the i2-th position in M(Ψ),
    • +
    • j1 is the position of e1 in Dedup(M(Ψ)), and
    • +
    • j2 is the position of e2 in Dedup(M(Ψ)).
    • +
    +
  8. +
+

Special Case: when COUNT is used with the expression * the value of F will be the cardinality of the group solution sequence, - card[Ψ], or card[Distinct(Ψ)] if the DISTINCT + card[Ψ], or card[Dedup(Ψ)] if the DISTINCT keyword is present.

scalarvals are used to pass values to the underlying set function, bypassing From 8e30456504de274cd1f396615ad3060443a00135 Mon Sep 17 00:00:00 2001 From: Olaf Hartig Date: Thu, 22 Jun 2023 19:39:51 +0200 Subject: [PATCH 7/7] Removing the elements again from the algorithm. --- spec/index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/index.html b/spec/index.html index ed0e4f7..59c4643 100644 --- a/spec/index.html +++ b/spec/index.html @@ -8770,7 +8770,7 @@

Grouping and Aggregation
If Q contains GROUP BY exprlist Let G := Group(exprlist, ToList(P)) -Else If Q contains an aggregate in SELECT, HAVING, ORDER BY +Else If Q contains an aggregate in SELECT, HAVING, ORDER BY Let G := Group((1), ToList(P)) Else skip the rest of the aggregate step