-
Notifications
You must be signed in to change notification settings - Fork 339
Dimension alignment (JOIN Simplification and Acceleration Series 4)
In the previous essay, we talked about the alignment of two sub tables to the primary table. Below is the SQL query for doing it:
SELECT Orders.id, Orders.customer,A.x,B.y
FROM Orders
LEFT JOIN (SELECT id,SUM(price) x FROM OrderDetail GROUP BY id) A
ON Orders.id=A.id
LEFT JOIN (SELECT id,SUM(amount) y FROM OrderPayment GROUP BY id ) B
ON Orders.id=B.id
WHERE A.x>B.y
Obviously, this is a useful JOIN in real-world queries, but under which type it should be classified?
It involves the Orders table and two subqueries – A and B. Each subquery has a GROUP BY id clause which means the result set’s primary key will be the id field. Now the three tables (a subquery can be considered a temporary table) involved in the JOIN use the same primary key. They are homo-dimension tables associated by a one-to-one mapping. That fits into our types.
But the syntax used to simplify JOIN queries in the previous relative essay cannot be applied to this homo-dimension table JOIN query because both subqueries cannot be omitted.
The prerequisite for a simplifiable JOIN query should be that the relationship between the to-be-joined tables is already defined in the whole data structure. In technical terms, we need to know the database metadata definition. It is unlikely that an ad hoc subquery can be predefined in the metadata, so the table to be joined (a subquery) should be specified.
While the tables (sub queries) to be JOINed cannot be omitted, but since the associative fields, which are primary keys, are already specified in GROUP BY and the grouping fields, which are also primary keys defined by GROUP BY in the subqueries, must be selected for performing the JOINs, and moreover, the sub tables over which the subqueries are performed are independent of one another without any more associations, we can put both the GROUP and aggregate operations in the main query to get rid of a layer of subquery:
SELECT Orders.id, Orders.customer, OrderDetail.SUM(price) x, OrderParyment.SUM(amount) y
FROM Orders
LEFT JOIN OrderDetail GROUP BY id
LEFT JOIN OrderPayment GROUP BY id
WHERE A.x > B.y
The join we are discussing is far from the JOINs defined in SQL. There is no trace of Cartesian product in it. And unlike a SQL JOIN defined between any two tables, this join aligns tables – OrderDetail, OrderPayment and Orders – to a common primary key field, the id field. All tables are aligned against a certain base dimension. As they have different dimensions (primary keys), the GROUP BY action may be needed during the process, leading to an aggregate operation when referring to a field of the table being grouped. There are no direct associations between the OrderDetail table, the OrderPayment table, and even the Orders table, so there is no need to care about their relationships, or whether there is another table to be joined with the current table. In SQL, the Cartesian product-based JOIN requires at least two tables to define the associative operation. Any change to an involved table or its deletion requires the handling of the matching table, making the query hard to understand.
Ours is the dimension alignment join. Though still within the three types previously defined, it has a different syntax. Unlike the case in SQL, here the JOIN key word is more a conjunction than a verb. The FULL JOIN isn’t a rare thing for the dimension alignment, whereas it is impossible or scarcely seen with the three basic join types.
Though we use the association between primary and sub tables to abstract the dimension alignment join, it isn’t the exclusive scenario to which this kind of join is applied (According to the simplified syntax for a join between the primary and sub tables, the query should not be so complicated). The dimension alignment join syntax can be applied to any tables to be associated, without the requirement that the associative fields be primary keys or a part of them.
Here are the Contract table, the Payment table, and the Invoice table:
Contract table
id
date
customer
price
…
Payment table
seq
date
source
amount
…
Invoice table
code
date
customer
amount
…
To find each day’s contract amount, payment amount, and invoice amount, we can write the query like this:
SELECT Contract.SUM(price), Payment.SUM(amount), Invoice.SUM(amount) ON date
FROM Contract GROUP BY date
FULL JOIN Payment GROUP BY date
FULL JOIN Invoice GROUP BY date
Here we specifically write a GROUP BY date clause to specify that the result set will be aligned by date, rather than put the clause within SELECT statement.
The arrangement focuses on each table separately without the need to take care of the association relationships between the three tables. It gives the impression that there are no associations among the tables except for the dimension (date) to which they are pulled.
We can also use a mix of the join types:
Along with the above Contract table, here are the Customer table and the Sellers table:
Customer table
id
name
area
…
Sales table
id
name
area
…
The customer field in the Contract table is the foreign key pointing to the Customer table.
To find the number of salespeople and the contract amount in each area:
SELECT Sales.COUNT(1), Contract.SUM(price) ON area
FROM Sales GROUP BY area
FULL JOIN Contract GROUP BY customer.area
The query uses both the dimension alignment join syntax and the foreign key attributization syntax.
In these examples, all the joins are finally become the homo-dimension type. A dimension alignment join can also be of primary-sub-table type. But it is quite uncommon so we will skip it here.
Finally, the above dimension alignment join syntax, for the time being, is just schematic. It needs a clear and strict dimension definition to become fully formalized and generate interpretive execution statements.
We call the simplistic syntax DQL (Dimensional Query Language). This is a query language revolving around dimensions. The language has been successfully implemented in engineering and DQL statements can be translated to SQL ones, which means it can run on any relational databases.
Read related essays and articles in https://c.scudata.com if you want to know more about DQL theory and applications.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code