-
Notifications
You must be signed in to change notification settings - Fork 339
Comparison of SQL and SPL:Recursion Operation
An recursion operation is defined as the process of directly or indirectly calling a procedure repeatedly. The Tower of Hanoi puzzle is one typical example of this. This essay introduces the methods and basic principles of SQL and SPL, the two common programming languages, in handling recursion scenarios, and finds the efficient and fast solutions for you through sample programs. Looking ${article} for details.
【Example 1】According to a company’s organizational structure below, get the hierarchical level of each department (headquarters are level 1, branches are level 2, and so on).
ID | ORG_NAME | PARENT_ID |
---|---|---|
1 | Head Office | 0 |
2 | Beijing Branch Office | 1 |
3 | Shanghai Branch Office | 1 |
4 | Chengdu Branch Office | 1 |
5 | Beijing R&D Center | 2 |
… | … | … |
SQL solution:
To get this done, we need to loop through each record to find all superiors for each organization recursively. According to the natural way of thinking, the expected recursive process is that, for each record of each organization, we can locate the parent record through its superior’s ID and then the grandparent record through the corresponding ID, and so on. SQL does not have the concepts of explicit record and reference, so we can only get records having the superior-subordinate relationship recursively and sort them in a specific order, as shown below:
SQL offers WITH statement to do the recursion query, as shown below:
WITH CTE1 (ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME) AS(
SELECT
ID,ORG_NAME,PARENT_ID,ID O_ID,ORG_NAME O_ORG_NAME
FROM ORGANIZATION
UNION ALL
SELECT
ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,CTE1.O_ID,CTE1.O_ORG_NAME
FROM ORGANIZATION ORG
INNER JOIN CTE1
ON ORG.ID=CTE1.PARENT_ID
)
SELECT
O_ID ID,MIN(O_ORG_NAME) ORG_NAME,COUNT(*) "LEVEL"
FROM (
SELECT
ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME
FROM(
SELECT *
FROM CTE1
ORDER BY O_ID,ID
)
)
GROUP BY O_ID
ORDER BY ID
With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:
SELECT
MAX(ID) ID, MAX(ORG_NAME) ORG_NAME, COUNT(*) "LEVEL"
FROM (
SELECT
ID,ORG_NAME,SUM(ID) OVER (ORDER BY RN) GROUP_ID
FROM (
SELECT
CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0
OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ID
ELSE NULL END ID,
CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0
OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ORG_NAME
ELSE NULL END ORG_NAME,
RN
FROM (
SELECT
ID,ORG_NAME,PARENT_ID,ROWNUM RN
FROM ORGANIZATION ORG
START WITH ORG.PARENT_ID>=0
CONNECT BY ORG.ID=PRIOR ORG.PARENT_ID
)
)
)
GROUP BY GROUP_ID
ORDER BY GROUP_ID
The START WITH … statement does not generate simpler code. Though the program returns records of all superiors in turn for every organization to the result set, we need to number records in each group and compare each record with its neighboring one to get the initial position of a group because a SQL set is unordered. As SQL GROUP BY only supports the equi-grouping, we need to add group marks for a further grouping. It would be much simpler if the language supported conditional grouping (according to PARENT_ID=0, for instance) as SPL does.
With databases that neither support WITH nor START WITH …, SQL is unable to achieve recursive queries directly but it has to use the stored procedure to write an recursion function (The details will be skipped here).
SPL solution:
SPL provides A.prior(F) function to perform a recursive query to find all references by default.
A | |
---|---|
1 | =T("Organization.txt") |
2 | >A1.switch(PARENT_ID,A1:ID) |
3 | =A1.new(ID,ORG_NAME,~.prior(PARENT_ID).len():LEVEL) |
A1: Import Organization table from the source file.
A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve self-join.
A3: Create a new table made up of ID, organization name and level. A.prior() function calculates the department level by getting the number of levels of records that a record references.
The SPL script of handling an recursion problem is concise because SPL’s recursive logic is natural. A.prior() function returns a set of all records of the parent organization for the current organization. Then it’s easy to perform count or any other operation, even a complicated one.
SPL supports retrieving a data table from the database, too. We can change A1 in the above script as follows:
A | |
---|---|
1 | =connect("db").query("SELECT * FROM ORGANIZATION") |
【Example 2】According to a company’s organizational structure below, get Beijing Branch Office’s subordinate organizations and its superiors’ names (separate multilevel organizations by comma).
ID | ORG_NAME | PARENT_ID |
---|---|---|
1 | Head Office | 0 |
2 | Beijing Branch Office | 1 |
3 | Shanghai Branch Office | 1 |
4 | Chengdu Branch Office | 1 |
5 | Beijing R&D Center | 2 |
… | … | … |
SQL solution:
After the analysis of example 1, the first idea for doing this is that, during the recursive process for searching for the superiors for each organization, stop the recursion if the specified value (the Beijing Branch Office) is found, keep its records and filter away records of the organizations that cannot be found. It is difficult for SQL to hit the target with one round of recursion. So we divide the task into two steps. First, find all subordinate organizations of Beijing Branch Office; second, as the solution to the above problem does, find all superior organizations for each of those organizations until Beijing Branch Office appears. Below is the SQL queries:
WITH CTE1 (ID,ORG_NAME,PARENT_ID) AS(
SELECT
ID,ORG_NAME,PARENT_ID
FROM ORGANIZATION
WHERE ORG_NAME='Beijing Branch Office'
UNION ALL
SELECT
ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID
FROM ORGANIZATION ORG
INNER JOIN CTE1
ON ORG.PARENT_ID=CTE1.ID
)
,CTE2 (ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM) AS(
SELECT
ID,ORG_NAME,PARENT_ID,ID O_ID,1 GROUP_NUM
FROM CTE1
UNION ALL
SELECT ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,
CTE2.O_ID,CTE2.GROUP_NUM+1 GROUP_NUM
FROM ORGANIZATION ORG
INNER JOIN CTE2
ON ORG.ID=CTE2.PARENT_ID AND
CTE2.ORG_NAME>>'Beijing Branch Office'
)
SELECT
MAX(O_ID) ID, MAX(O_ORG_NAME) ORG_NAME,
MAX(PARENT_NAME) PARENT_NAME
FROM(
SELECT
O_ID, O_ORG_NAME,
WM_CONCAT(ORG_NAME) OVER
(PARTITION BY O_ID ORDER BY O_ID,GROUP_NUM) PARENT_NAME
FROM(
SELECT
ID,PARENT_ID,O_ID,GROUP_NUM,
CASE WHEN GROUP_NUM=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
CASE WHEN GROUP_NUM=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
FROM (
SELECT
ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM,ROWNUM RN
FROM CTE2
ORDER BY O_ID,GROUP_NUM
)
)
)
GROUP BY O_ID
ORDER BY O_ID
With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:
WITH CTE1 AS (
SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
FROM ORGANIZATION ORG
START WITH ORG.ORG_NAME='Beijing Branch Office'
CONNECT BY ORG.PARENT_ID=PRIOR ORG.ID
)
,CTE2 AS (
SELECT
ID,ORG_NAME,PARENT_ID,RN,
CASE WHEN LAG(ORG_NAME,1) OVER(ORDER BY RN ASC)= 'Beijing Branch Office' OR
LAG(ORG_NAME,1) OVER(ORDER BY RN ASC) IS NULL THEN 1 ELSE 0 END FLAG
FROM(
SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
FROM CTE1
START WITH 1=1
CONNECT BY CTE1.ID=PRIOR CTE1.PARENT_ID
)
)
SELECT
MAX(ID) ID, MAX(O_ORG_NAME) ORG_NAME,
MAX(PARENT_NAME) PARENT_NAME
FROM(
SELECT
ID,O_ORG_NAME,GROUP_ID,
WM_CONCAT(ORG_NAME) OVER
(PARTITION BY GROUP_ID ORDER BY RN) PARENT_NAME
FROM(
SELECT
ID, ORG_NAME, O_ORG_NAME,RN,
SUM(ID) OVER (ORDER BY RN) GROUP_ID
FROM(
SELECT
PARENT_ID,RN,
CASE WHEN FLAG=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
CASE WHEN FLAG=1 THEN ID ELSE NULL END ID,
CASE WHEN FLAG=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
FROM CTE2
)
)
)
GROUP BY GROUP_ID
ORDER BY GROUP_ID
SPL solution:
SPL offers A.prior(F,r) function to do the recursive query to find the refences until a specific value appears:
A | |
---|---|
1 | =T("Organization.txt") |
2 | >A1.switch(PARENT_ID,A1:ID) |
3 | =A1.select@1(ORG_NAME=="Beijing Branch Office") |
4 | =A1.new(ID,ORG_NAME,~.prior(PARENT_ID,A3) :PARENT_NAME) |
5 | =A4.select(PARENT_NAME!=null) |
6 | =A5.run(PARENT_NAME=PARENT_NAME.(PARENT_ID.ORG_NAME).concat@c()) |
A1: Import the Organization table.
A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve foreign key objectification.
A3: Get records of Beijing Branch Office.
A4: Create a new table made up of ID, organization name and the set of records of all superior organizations.
A5: Get records that has at least one superior organization, that is, those of Beijing Branch Office.
A6: Join up names of superior organizations into a comma-separated string in a circular way.
The SPL solution is logically clear and concise. It uses records of Beijing Branch Office as parameters to do the recursive search of references.
【Example 3】According to the following Chinese administrative divisions table, find regions and counties under Hebei province. Below is part of the source table:
ID | NAME | PARENT_ID |
---|---|---|
1 | China | 0 |
11 | Beijing | 1 |
12 | Tianjin | 1 |
13 | Hebei | 1 |
… | … | … |
1301 | Shijiazhuang | 13 |
1302 | Tangshan | 13 |
… | … | … |
SQL solution:
First, we find all records of Hebei province and then find records that are not superior regions of other regions. Below are the SQL queries:
WITH CTE1 (ID,NAME,PARENT_ID,PARENT_NAME) AS(
SELECT
ID,NAME,PARENT_ID,NULL PARENT_NAME
FROM CHINA_REGION
WHERE NAME='Hebei'
UNION ALL
SELECT
T.ID,T.NAME,T.PARENT_ID,CTE1.NAME PARENT_NAME
FROM CHINA_REGION T
INNER JOIN CTE1
ON T.PARENT_ID=CTE1.ID
)
SELECT ID,NAME,PARENT_NAME
FROM CTE1 T1
WHERE NOT EXISTS(
SELECT 1
FROM CTE1 T2
WHERE T1.ID=T2.PARENT_ID
)
ORDER BY ID
SPL solution:
SPL has P.nodes@d(F,r) function to find all leaves recursively:
A | |
---|---|
1 | =T("ChinaRegion.csv") |
2 | >A1.switch(PARENT_ID,A1:ID) |
3 | =A1.select@1(NAME=="Hebei") |
4 | =A1.nodes@d(PARENT_ID,A3) |
5 | =A4.new(ID,NAME,PARENT_ID.NAME:PARENT_NAME) |
A1: Import ChinaRegion table.
A2: Objectify the ID foreign key of the superior region to convert it into the corresponding record and achieve foreign key objectification.
A3: Get records of Hebei province.
A4: Use P.nodes() function to perform the recursive search; @d option is used to recursively find all leaves.
A5: Create a new table made up of ID, region names and names of superior regions.
【Example 4】According to the following terminal device use for online teaching in a primary school, find the proportion of each type of terminal. Below is the directory structure:
Files under the directory is of Excel format. Below is part of the source data:
ID | STUDENT_NAME | TERMINAL |
---|---|---|
1 | Rebecca Moore | Phone |
2 | Ashley Wilson | Phone,PC,Pad |
3 | Rachel Johnson | Phone,PC,Pad |
4 | Emily Smith | PC,Pad |
5 | Ashley Smith | PC |
6 | Matthew Johnson | Phone |
7 | Alexis Smith | Phone,PC |
8 | Megan Wilson | Phone,PC,Pad |
… | … | … |
Since SQL cannot retrieve a directory and the files under it, a SQL solution is thus impossible. Let’s look at how SPL handles the recursive traversal of directories and files.
SPL solution:
SPL offers directory@s() function to search for names of all files under all subdirectories in an recursive way:
A | |
---|---|
1 | =directory@ps("C:/Primary School") |
2 | >A1.run(t=T(~),@+=t.len(),~=t.conj(TERMINAL.split@c())) |
3 | =A1.conj().groups(~:TERMINAL;count(~)/A2:PERCENTAGE) |
A1:Use directory() function to import the file directory. @s option is used to get file names recursively under all subdirectories.
A2:Read in files circularly, split each string of terminals by comma, and concatenate all unique terminals.
A3:Group records by terminal and calculate the proportion for each.
【Example 5】Below is the JSON data of worldwide new COVID-19 confirmed cases in a certain time point. The task is to get the total confirmed cases across the world. Below is part of the source data:
[
{Region:"USA",Confirmed:[
{Region:"California",Confirmed:3671902},
{Region:"New York",Confirmed:1139740},
{Region:"Virginia",Confirmed:620801},
{Region:"Florida",Confirmed:2064525},
…
]},
{Region:"Brazil",Confirmed:[…]},
{Region:"India",Confirmed: […]},
{Region:"France",Confirmed: […]}
…
]
As SQL cannot retrieve a JSON file, we will skip the SQL solution. Below is the SPL way of handling the recursive traversal of a JSON file.
SPL solution:
SPL offers A.field@r() function to get field values recursively, and A.conj@r() function to concatenate members of a sequence recursively. Below is the SPL script:
A | |
---|---|
1 | =json(file("COVID-19.json").read()) |
2 | =A1.field@r("Confirmed") |
3 | =A2.conj@r().sum() |
A1: Read in the JSON file.
A2: GET ALL Confirmed field values recursively.
A3: Concatenate all numbers of confirmed cases and sum them.
SQL has its method of achieving recursion problems, but the method not convenient to use. This is because SQL does not have the concepts of explicit records and reference, and row numbers should be used in the result of an recursive operation to mark the superior-subordinate relationship. Besides, SQL is based on unordered sets, which makes it troublesome to perform subsequent operations after recursions.
SPL supports the data structure of different data types and offers the concept of explicit records. More importantly, SPL’s logic in handle recursion problems is clear. It uses an recursion function after the self-join and then a set operation to finish the job. SPL supplies flexible and adaptable recursion features to handle common recursion scenarios, such as directory recursion and file content recursion.
With a complex query, the complexity of SQL solution increases sharply by resorting to temporary table and nested queries. This makes it hard to write and maintain the SQL statements. SPL, however, organizes an algorithm in a natural way of thinking and generates concise code.
The SPL-driven and ordered-set-based esProc is the professional data computation engine that provides a complete set of set-oriented operations. The tool combines merits of both Java and SQL and makes it easy to handle recursion scenarios.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code