-
Notifications
You must be signed in to change notification settings - Fork 339
Comparison of SQL and SPL:Join Operations (Ⅲ)
【Abstract】
The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you.
In this essay, we focus on the handling of table joins with many-to-many relationship.
The many-to-many relationship exists between tables where one record in table A corresponds to any number of records in table B and vice versa.
We use cross join to handle this type of relationship. One record of table A will join up with each record of table B to generate new records, so the number of rows in the returned result set is the product of numbers of rows in the two tables. SQL CROSS JOIN is used to do the cross join. SPL has xjoin() function to do the job. In real-world situations, the many-to-many relationship is rare. Now let’s look at how to handle such a scenario through matrix multiplication.
【Example 1】Calculate the product of two matrices. The two matrix tables are stored respectively in MATRIXA and MATRIXB:
MATRIXA:
ROW | COL | VALUE |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 3 | 3 |
2 | 1 | 4 |
2 | 2 | 5 |
2 | 3 | 6 |
MATRIXB:
ROW | COL | VALUE |
---|---|---|
1 | 1 | 1 |
1 | 2 | 4 |
2 | 1 | 2 |
2 | 2 | 5 |
3 | 1 | 3 |
3 | 2 | 6 |
The mathematical formula for solving this task is as follows:
SQL solution:
The matrix multiplication involves many-to-many relationship, which is handled in SQL through the cross join. First, we cross product the two matrix tables to select records where COL value in table A is equivalent to ROW value in table B. Second, sum values with same COL and ROW values. Below are SQL statements:
SELECT
"ROW",COL,SUM(VALUE1*VALUE2) VALUE
FROM (
SELECT
A."ROW",A.VALUE VALUE1,B.COL,B.VALUE VALUE2
FROM MATRIXA A
CROSS JOIN
MATRIXB B
WHERE A.COL=B."ROW"
)
GROUP BY "ROW",COL
ORDER BY "ROW",COL
SPL solution:
SPL offers xjoin() function to perform cross join.
A | |
---|---|
1 | =T("MatrixA.csv") |
2 | =T("MatrixB.csv") |
3 | =xjoin(A1:A; A2:B, A.COL==A2.ROW) |
4 | =A3.groups(A.ROW, B.COL; sum(A.VALUE * B.VALUE):VALUE) |
A1: Import MatrixA table.
A2: Import MatrixB table.
A3: The xjoin() function performs cross join, as well as the conditional filtering at the same time.
A4: Group rows by the condition that COL value is equivalent to ROW value and sum products of values in each group.
A non-equi-join is one where the joining condition is not equivalence comparison. Examples of such joins include finding whether an age is included in a specific age group and whether the income belongs to a certain income level.
【Example 2】Find which age group a person falls in based on COMMUNITY table and AGE_GROUP. Below is part of the source data:
COMMUNITY:
ID | NAME | AGE |
---|---|---|
1 | David | 28 |
2 | Daniel | 15 |
3 | Andrew | 65 |
4 | Rudy | |
… | … | … |
AGE_GROUP:
GROUP_NAME | START | END |
---|---|---|
Children | 0 | 15 |
Youth | 16 | 40 |
Middle | 41 | 60 |
Old | 61 | 100 |
SQL solution:
There are no age record for some community residents, so we use LEFT JOIN to handle this task:
SELECT
ID,NAME,AGE,GROUP_NAME
FROM COMMUNITY C
LEFT JOIN
AGE_GROUP A
ON A."START"<=C.AGE AND A."END">=C.AGE
ORDER BY ID
SPL solution:
SPL offers xjoin() function to perform cross join. The function works with @1 option to enable a left join.
A | |
---|---|
1 | =T("Community.txt") |
2 | =T("AgeGroup.txt") |
3 | =xjoin@1(A1:C; A2:A, A2.START<=C.AGE && A2.END>=C.AGE) |
4 | =A3.new(C.ID, C.NAME, C.AGE,A.GROUP_NAME) |
A1: Import Community table.
A2: Import AgeGroup table.
A3: The xjoin@1() function left joins the two tables according to the first table and get records corresponding to each group. When a record does not match any age group, just keep it and set its corresponding age group as null.
A4: Return ages groups for all community residents.
SQL defines the join operation simply as performing Cartesian product on two sets (tables) and then filtering by a specific condition. The definition covers not only the commonly seen equi-joins, but the equi-joins with many-to-many relationship and even non-equi-joins. An experienced programmer knows that most joins in real-world situations are equi-joins. The over-simplified SQL definition does not reflect the features of equi-joins, and thus the language cannot make the most use of those features to facilitate code writing and task implementation. When a query is too complicated (such as when the number of involved tables are many and when there are nested queries), it is hard to write and optimize code with SQL.
SPL makes use of those features to distinguish equi-joins and non-equi-joins and to offer separate solutions for equi-joins with three different types of associative relationships. The language has join() function to handle one-to-one relationship (homo-dimension tables) and one-to-many relationship (primary and sub tables) and provides A.switch() function to deal with many-to-one relationship (foreign key table).
Besides, SQL’s use of temporary table and nested query makes it hard to write and maintain a SQL query. SPL, however, can compose concise code step by step according to the natural way of thinking.
The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which combines advantages of both Java and SQL. A join operation will thus become simple and easy with SPL.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code