Skip to content

Comparison of SQL and SPL:Recursion Operation

esProcSPL edited this page May 7, 2024 · 1 revision

An recursion operation is defined as the process of directly or indirectly calling a procedure repeatedly. The Tower of Hanoi puzzle is one typical example of this. This essay introduces the methods and basic principles of SQL and SPL, the two common programming languages, in handling recursion scenarios, and finds the efficient and fast solutions for you through sample programs. Looking ${article} for details.


Ⅰ. Search for all references recursively

【Example 1】According to a company’s organizational structure below, get the hierarchical level of each department (headquarters are level 1, branches are level 2, and so on).

ID ORG_NAME PARENT_ID
1 Head Office 0
2 Beijing Branch Office 1
3 Shanghai Branch Office 1
4 Chengdu Branch Office 1
5 Beijing R&D Center 2

SQL solution:

To get this done, we need to loop through each record to find all superiors for each organization recursively. According to the natural way of thinking, the expected recursive process is that, for each record of each organization, we can locate the parent record through its superior’s ID and then the grandparent record through the corresponding ID, and so on. SQL does not have the concepts of explicit record and reference, so we can only get records having the superior-subordinate relationship recursively and sort them in a specific order, as shown below:

SQL offers WITH statement to do the recursion query, as shown below:

WITH CTE1 (ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME) AS(
   SELECT
      ID,ORG_NAME,PARENT_ID,ID O_ID,ORG_NAME O_ORG_NAME
   FROM ORGANIZATION
   UNION ALL
   SELECT
      ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,CTE1.O_ID,CTE1.O_ORG_NAME
   FROM ORGANIZATION ORG
   INNER JOIN CTE1
   ON ORG.ID=CTE1.PARENT_ID
)
SELECT
   O_ID ID,MIN(O_ORG_NAME) ORG_NAME,COUNT(*) "LEVEL"
FROM (
   SELECT
      ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME
   FROM(
      SELECT * 
      FROM CTE1
      ORDER BY O_ID,ID
   )
)
GROUP BY O_ID
ORDER BY ID

With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:

SELECT 
   MAX(ID) ID, MAX(ORG_NAME) ORG_NAME, COUNT(*) "LEVEL"
FROM (
   SELECT
      ID,ORG_NAME,SUM(ID) OVER (ORDER BY RN) GROUP_ID
   FROM (
      SELECT
         CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0
         OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ID
         ELSE NULL END ID,
         CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0
         OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ORG_NAME
         ELSE NULL END ORG_NAME,
         RN
      FROM (
         SELECT
            ID,ORG_NAME,PARENT_ID,ROWNUM RN
         FROM ORGANIZATION ORG
         START WITH ORG.PARENT_ID>=0
         CONNECT BY ORG.ID=PRIOR ORG.PARENT_ID
      )
   )
)
GROUP BY GROUP_ID
ORDER BY GROUP_ID

The START WITH … statement does not generate simpler code. Though the program returns records of all superiors in turn for every organization to the result set, we need to number records in each group and compare each record with its neighboring one to get the initial position of a group because a SQL set is unordered. As SQL GROUP BY only supports the equi-grouping, we need to add group marks for a further grouping. It would be much simpler if the language supported conditional grouping (according to PARENT_ID=0, for instance) as SPL does.

With databases that neither support WITH nor START WITH …, SQL is unable to achieve recursive queries directly but it has to use the stored procedure to write an recursion function (The details will be skipped here).

SPL solution:

SPL provides A.prior(F) function to perform a recursive query to find all references by default.

A
1 =T("Organization.txt")
2 >A1.switch(PARENT_ID,A1:ID)
3 =A1.new(ID,ORG_NAME,~.prior(PARENT_ID).len():LEVEL)

A1: Import Organization table from the source file.

A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve self-join.

A3: Create a new table made up of ID, organization name and level. A.prior() function calculates the department level by getting the number of levels of records that a record references.

The SPL script of handling an recursion problem is concise because SPL’s recursive logic is natural. A.prior() function returns a set of all records of the parent organization for the current organization. Then it’s easy to perform count or any other operation, even a complicated one.

SPL supports retrieving a data table from the database, too. We can change A1 in the above script as follows:

A
1 =connect("db").query("SELECT * FROM ORGANIZATION")

Ⅱ. Search for all references recursively until a specific value appears

【Example 2】According to a company’s organizational structure below, get Beijing Branch Office’s subordinate organizations and its superiors’ names (separate multilevel organizations by comma).

ID ORG_NAME PARENT_ID
1 Head Office 0
2 Beijing Branch Office 1
3 Shanghai Branch Office 1
4 Chengdu Branch Office 1
5 Beijing R&D Center 2

SQL solution:

After the analysis of example 1, the first idea for doing this is that, during the recursive process for searching for the superiors for each organization, stop the recursion if the specified value (the Beijing Branch Office) is found, keep its records and filter away records of the organizations that cannot be found. It is difficult for SQL to hit the target with one round of recursion. So we divide the task into two steps. First, find all subordinate organizations of Beijing Branch Office; second, as the solution to the above problem does, find all superior organizations for each of those organizations until Beijing Branch Office appears. Below is the SQL queries:

WITH CTE1 (ID,ORG_NAME,PARENT_ID) AS(
   SELECT
      ID,ORG_NAME,PARENT_ID
   FROM ORGANIZATION
   WHERE ORG_NAME='Beijing Branch Office'
   UNION ALL
   SELECT
      ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID
   FROM ORGANIZATION ORG
   INNER JOIN CTE1
   ON ORG.PARENT_ID=CTE1.ID
)
,CTE2 (ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM) AS(
   SELECT
      ID,ORG_NAME,PARENT_ID,ID O_ID,1 GROUP_NUM
   FROM CTE1
   UNION ALL
   SELECT ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,
      CTE2.O_ID,CTE2.GROUP_NUM+1 GROUP_NUM
   FROM ORGANIZATION ORG
   INNER JOIN CTE2
   ON ORG.ID=CTE2.PARENT_ID AND
   CTE2.ORG_NAME>>'Beijing Branch Office'
)
SELECT
   MAX(O_ID) ID, MAX(O_ORG_NAME) ORG_NAME,
   MAX(PARENT_NAME) PARENT_NAME
FROM(
   SELECT
      O_ID, O_ORG_NAME,
      WM_CONCAT(ORG_NAME) OVER 
      (PARTITION BY O_ID ORDER BY O_ID,GROUP_NUM) PARENT_NAME
   FROM(
      SELECT 
         ID,PARENT_ID,O_ID,GROUP_NUM,
         CASE WHEN GROUP_NUM=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
         CASE WHEN GROUP_NUM=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
      FROM (
         SELECT
            ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM,ROWNUM RN
         FROM CTE2
         ORDER BY O_ID,GROUP_NUM
      )
   )
)
GROUP BY O_ID
ORDER BY O_ID

With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:

WITH CTE1 AS (
   SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
   FROM ORGANIZATION ORG
   START WITH ORG.ORG_NAME='Beijing Branch Office'
   CONNECT BY ORG.PARENT_ID=PRIOR ORG.ID
)
,CTE2 AS (
   SELECT
      ID,ORG_NAME,PARENT_ID,RN,
      CASE WHEN LAG(ORG_NAME,1) OVER(ORDER BY RN ASC)= 'Beijing Branch Office' OR
      LAG(ORG_NAME,1) OVER(ORDER BY RN ASC) IS NULL THEN 1 ELSE 0 END FLAG
   FROM(
      SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
      FROM CTE1
      START WITH 1=1
      CONNECT BY CTE1.ID=PRIOR CTE1.PARENT_ID
   )
)
SELECT
   MAX(ID) ID, MAX(O_ORG_NAME) ORG_NAME,
   MAX(PARENT_NAME) PARENT_NAME
FROM(
   SELECT
      ID,O_ORG_NAME,GROUP_ID,
      WM_CONCAT(ORG_NAME) OVER 
      (PARTITION BY GROUP_ID ORDER BY RN) PARENT_NAME
   FROM(
      SELECT
         ID, ORG_NAME, O_ORG_NAME,RN,
         SUM(ID) OVER (ORDER BY RN) GROUP_ID
      FROM(
         SELECT
            PARENT_ID,RN,
            CASE WHEN FLAG=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
            CASE WHEN FLAG=1 THEN ID ELSE NULL END ID,
            CASE WHEN FLAG=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
         FROM CTE2
      )
   )
)
GROUP BY GROUP_ID
ORDER BY GROUP_ID

SPL solution:

SPL offers A.prior(F,r) function to do the recursive query to find the refences until a specific value appears:

A
1 =T("Organization.txt")
2 >A1.switch(PARENT_ID,A1:ID)
3 =A1.select@1(ORG_NAME=="Beijing Branch Office")
4 =A1.new(ID,ORG_NAME,~.prior(PARENT_ID,A3) :PARENT_NAME)
5 =A4.select(PARENT_NAME!=null)
6 =A5.run(PARENT_NAME=PARENT_NAME.(PARENT_ID.ORG_NAME).concat@c())

A1: Import the Organization table.

A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve foreign key objectification.

A3: Get records of Beijing Branch Office.

A4: Create a new table made up of ID, organization name and the set of records of all superior organizations.

A5: Get records that has at least one superior organization, that is, those of Beijing Branch Office.

A6: Join up names of superior organizations into a comma-separated string in a circular way.

The SPL solution is logically clear and concise. It uses records of Beijing Branch Office as parameters to do the recursive search of references.

Ⅲ. Search for leaf records recursively

【Example 3】According to the following Chinese administrative divisions table, find regions and counties under Hebei province. Below is part of the source table:

ID NAME PARENT_ID
1 China 0
11 Beijing 1
12 Tianjin 1
13 Hebei 1
1301 Shijiazhuang 13
1302 Tangshan 13

SQL solution:

First, we find all records of Hebei province and then find records that are not superior regions of other regions. Below are the SQL queries:

WITH CTE1 (ID,NAME,PARENT_ID,PARENT_NAME) AS(
   SELECT
      ID,NAME,PARENT_ID,NULL PARENT_NAME
   FROM CHINA_REGION
   WHERE NAME='Hebei'
   UNION ALL
   SELECT
      T.ID,T.NAME,T.PARENT_ID,CTE1.NAME PARENT_NAME
   FROM CHINA_REGION T
   INNER JOIN CTE1
   ON T.PARENT_ID=CTE1.ID
)
SELECT ID,NAME,PARENT_NAME
FROM CTE1 T1
WHERE NOT EXISTS(
   SELECT 1 
   FROM CTE1 T2
   WHERE T1.ID=T2.PARENT_ID
)
ORDER BY ID

SPL solution:

SPL has P.nodes@d(F,r) function to find all leaves recursively:

A
1 =T("ChinaRegion.csv")
2 >A1.switch(PARENT_ID,A1:ID)
3 =A1.select@1(NAME=="Hebei")
4 =A1.nodes@d(PARENT_ID,A3)
5 =A4.new(ID,NAME,PARENT_ID.NAME:PARENT_NAME)

A1: Import ChinaRegion table.

A2: Objectify the ID foreign key of the superior region to convert it into the corresponding record and achieve foreign key objectification.

A3: Get records of Hebei province.

A4: Use P.nodes() function to perform the recursive search; @d option is used to recursively find all leaves.

A5: Create a new table made up of ID, region names and names of superior regions.

IV. Traverse all files under a directory

【Example 4】According to the following terminal device use for online teaching in a primary school, find the proportion of each type of terminal. Below is the directory structure:

Files under the directory is of Excel format. Below is part of the source data:

ID STUDENT_NAME TERMINAL
1 Rebecca Moore Phone
2 Ashley Wilson Phone,PC,Pad
3 Rachel Johnson Phone,PC,Pad
4 Emily Smith PC,Pad
5 Ashley Smith PC
6 Matthew Johnson Phone
7 Alexis Smith Phone,PC
8 Megan Wilson Phone,PC,Pad

Since SQL cannot retrieve a directory and the files under it, a SQL solution is thus impossible. Let’s look at how SPL handles the recursive traversal of directories and files.

SPL solution:

SPL offers directory@s() function to search for names of all files under all subdirectories in an recursive way:

A
1 =directory@ps("C:/Primary School")
2 >A1.run(t=T(~),@+=t.len(),~=t.conj(TERMINAL.split@c()))
3 =A1.conj().groups(~:TERMINAL;count(~)/A2:PERCENTAGE)

A1:Use directory() function to import the file directory. @s option is used to get file names recursively under all subdirectories.

A2:Read in files circularly, split each string of terminals by comma, and concatenate all unique terminals.

A3:Group records by terminal and calculate the proportion for each.

V. Concatenate field values in a JSON file recursively

【Example 5】Below is the JSON data of worldwide new COVID-19 confirmed cases in a certain time point. The task is to get the total confirmed cases across the world. Below is part of the source data:

[
   {Region:"USA",Confirmed:[
      {Region:"California",Confirmed:3671902},
      {Region:"New York",Confirmed:1139740},
      {Region:"Virginia",Confirmed:620801},
      {Region:"Florida",Confirmed:2064525},
      …
   ]},
   {Region:"Brazil",Confirmed:[…]},
   {Region:"India",Confirmed: […]},
   {Region:"France",Confirmed: […]}
   …
]

As SQL cannot retrieve a JSON file, we will skip the SQL solution. Below is the SPL way of handling the recursive traversal of a JSON file.

SPL solution:

SPL offers A.field@r() function to get field values recursively, and A.conj@r() function to concatenate members of a sequence recursively. Below is the SPL script:

A
1 =json(file("COVID-19.json").read())
2 =A1.field@r("Confirmed")
3 =A2.conj@r().sum()

A1: Read in the JSON file.

A2: GET ALL Confirmed field values recursively.

A3: Concatenate all numbers of confirmed cases and sum them.

Summary

SQL has its method of achieving recursion problems, but the method not convenient to use. This is because SQL does not have the concepts of explicit records and reference, and row numbers should be used in the result of an recursive operation to mark the superior-subordinate relationship. Besides, SQL is based on unordered sets, which makes it troublesome to perform subsequent operations after recursions.

SPL supports the data structure of different data types and offers the concept of explicit records. More importantly, SPL’s logic in handle recursion problems is clear. It uses an recursion function after the self-join and then a set operation to finish the job. SPL supplies flexible and adaptable recursion features to handle common recursion scenarios, such as directory recursion and file content recursion.

With a complex query, the complexity of SQL solution increases sharply by resorting to temporary table and nested queries. This makes it hard to write and maintain the SQL statements. SPL, however, organizes an algorithm in a natural way of thinking and generates concise code.

The SPL-driven and ordered-set-based esProc is the professional data computation engine that provides a complete set of set-oriented operations. The tool combines merits of both Java and SQL and makes it easy to handle recursion scenarios.

Clone this wiki locally