Skip to content

Comparison of SQL and SPL:Set oriented Operations

esProcSPL edited this page Apr 25, 2024 · 2 revisions

【Abstract】

Set-oriented operations include a set of basic operations, including concatenation, intersection, union, difference, XOR, and inclusion. As structured data often exists in the form of sets, both SQL and SPL, the two commonly used programming languages, provide a wealth of set-oriented methods and functions to handle the data. This essay explores solutions and basic principles of the two languages for handling set-oriented operations through examples and corresponding sample programs and tries to introduce the more convenient and more efficient way for you.

1. Concatenation

Concatenation of set A and set B is a set including all members of set A and all members of set B. It contains duplicate members.

【Example 1】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to calculate the total sales amount of each product in the year 2014. Below are parts of the two source tables:

SALES_ONLINE:

ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT
1 HANAR 2014/01/01 17 3868.6
2 ERNSH 2014/01/03 42 1005.9
3 LINOD 2014/01/06 64 400.0
4 OTTIK 2014/01/07 10 1194.0
5 OCEAN 2014/01/09 41 319.2

SALES_STORE:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
1 EASTC 2014/01/01 1 27 3063.0
2 ERNSH 2014/01/02 8 70 2713.5
3 MAGAA 2014/01/03 2 74 1675.0
4 SAVEA 2014/01/07 7 45 2018.2
5 FOLIG 2014/01/08 8 30 1622.4

The two tables have different structures. SALES_STORE has SELLERID field for storing IDs of salespeople. But both have ID, CUSTOMERID, ORDERDATE, PRODUCTID and AMOUNT fields.

SQL solution:

select
    PRODUCTID, sum(AMOUNT) AMOUNT
from 
    (select PRODUCTID, AMOUNT 
    from SALES_ONLINE
    where extract (year from ORDERDATE)=2014
    union all
    select PRODUCTID, AMOUNT
    from SALES_STORE
    where extract (year from ORDERDATE)=2014)
group by PRODUCTID
order by PRODUCTID

SQL UNION operation is used to combine two or more sub result sets of same structure. They need to have same number of columns and each column should have the same data type. UNION ALL combines all records, including the duplicates.

Here we use ORCALE to do the SQL calculations. Since ORACLE does not have YEAR function, we use extract (year from date) to get the year.

SPL solution:

SPL uses the vertical line “|” to calculate concatenation of sets. A|B represents the concatenation of set A and set B.

A
1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014)
2 =T("SalesStore.csv").select(year(ORDERDATE)==2014)
3 =A1
4 =A3.groups(PRODUCTID; sum(AMOUNT):AMOUNT)

A1: Import SalesOnline table from the source file and select sales records of 2014.

A2: Import SalesStore table from the source file and select sales records of 2014.

A3: Use the sign “|” to calculate concatenation of the two sets.

A4: Group A3’s result set by product and calculate the total sales amount.

The SPL sequence supports members of different structures, so it does not require that the involved sets have same structure when concatenate them. It can directly access their common fields, like PRODUCTID and AMOUNT in this case, as it accesses an ordinary data table. It’s superbly convenient.

SPL supports retrieving a data table from the database, we can change A1 in the above script as follows:

A
1 =connect("db").query("select * from SALES_STORE where extract (year from ORDERDATE)=2014")

【Example 2】Based on the following scores table, find the top 4 math scores, top 3 English scores and top 2 PE scores. Below is part of the source table:

CLASS STUDENTID SUBJECT SCORE
1 1 English 84
1 1 Math 77
1 1 PE 69
1 2 English 81
1 2 Math 80

SQL solution:

SQL lacks a special method of calculating concatenation of two or more sets. It uses UNION ALL operator to do this. Oracle, for instance, defines an open table, uses ROW_NUMBER() OVER to calculate the rankings of all subjects ordered by scores in descending order, and then get top N for each subject:

with cte1 as
    (select
        CLASS, STUDENTID, SUBJECT, SCORE,
        ROW_NUMBER() OVER(PARTITION BY SUBJECT
        ORDER BY SCORE DESC) grouprank
    from SCORES)
select CLASS, STUDENTID, SUBJECT, SCORE 
from cte1
where grouprank <= 4 and SUBJECT='Math'
UNION ALL
select CLASS, STUDENTID, SUBJECT, SCORE 
from cte1
where grouprank <= 3 and SUBJECT='English'
UNION ALL
select CLASS, STUDENTID, SUBJECT, SCORE 
from cte1
where grouprank <= 2 and SUBJECT='PE'

SPL solution:

SPL uses A.conj() function to calculate the concatenation of member sets when A is a set of sets.

A
1 =T("Scores.csv")
2 [Math,English,PE]
3 [4,3,2]
4 =A3.conj(A1.select(SUBJECT==A2(A3.#)).top(-~;SCORE))

A1: Import student scores table.

A2: Define a set of subjects.

A3: Define a set of ranks corresponding to the set of subjects.

A4: Get top N scores for each specified subject, and use A.conj() function to concatenate multiple sets.

As there isn’t a special method of calculating concatenation of sets, we use UNION ALL to combine two sets each time. The code becomes more and more complicated when the number of sets involved increases. SPL has the special function A.conj() to do the job. It can calculate concatenation of any number of sets.

2. Intersection

The intersection of set A and set B is a set including all members belonging to both A and B.

【Example 3】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the customers who purchase through both sales channels in the year 2014. Below is part of the source table:

SALES_ONLINE:

ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT
1 HANAR 2014/01/01 17 3868.6
2 ERNSH 2014/01/03 42 1005.9
3 LINOD 2014/01/06 64 400.0
4 OTTIK 2014/01/07 10 1194.0
5 OCEAN 2014/01/09 41 319.2

SALES_STORE:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
1 EASTC 2014/01/01 1 27 3063.0
2 ERNSH 2014/01/02 8 70 2713.5
3 MAGAA 2014/01/03 2 74 1675.0
4 SAVEA 2014/01/07 7 45 2018.2
5 FOLIG 2014/01/08 8 30 1622.4

SQL solution:

SQL INTERSECT operator is used to calculate the intersection of two or more sub result sets. The sub result sets must have same data structure and same number of columns, and their columns should contain data of same types. Below is SQL query:

select 
    DISTINCT CUSTOMERID
from
    SALES_ONLINE
where
    extract (year from ORDERDATE)=2014
INTERSECT
select
    DISTINCT CUSTOMERID
from
    SALES_STORE
where
    extract (year from ORDERDATE)=2014

Early databases probably do not support INTERSECT operator. So we can first calculate concatenation (UNION ALL) and then perform a filtering to get the target through grouping operation and count operation:

select
    CUSTOMERID, COUNT(*)
from
    (select DISTINCT CUSTOMERID
    from SALES_ONLINE
    where extract (year from ORDERDATE)=2014
UNION ALL
select DISTINCT CUSTOMERID
from SALES_STORE
where extract (year from ORDERDATE)=2014)
group by
    CUSTOMERID
HAVING
    COUNT(*)>1
order by
    CUSTOMERID

The additional layer of grouping and aggregation makes the SQL query extremely complicated.

SPL solution:

SPL uses the sign “^” to calculate intersection of sets. A^B represents the intersection of set A and set B.

A
1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014)
2 =T("SalesStore.csv").select(year(ORDERDATE)==2014)
3 =A1.id(CUSTOMERID)^A2.id(CUSTOMERID)

A1: Import SalesOnline table from the source file and select sales records of 2014.

A2: Import SalesStore table from the source file and select sales records of 2014.

A3: A.id() function gets the set of unique customer IDs. The sign “^” is used to calculate intersection of the two sets, which are customers who purchase products in both in online and at stores.

【Example 4】Based on the following sales data table, find top 10 customers in terms of monthly sales amount in the year 2014. Below is part of the source table:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

SQL solution:

with cte1 as
    (select
        extract (month from ORDERDATE) ORDERMONTH,
        CUSTOMERID, SUM(AMOUNT) AMOUNT
    from SALES
    where extract (year from ORDERDATE)=2014
    group by extract (month from ORDERDATE),CUSTOMERID
    order by ORDERMONTH ASC, AMOUNT DESC),
cte2 as
    (select
        ORDERMONTH,CUSTOMERID,AMOUNT,
        ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank
    from cte1)
select
    CUSTOMERID,count(*)
from cte2
where grouprank<=10
group by CUSTOMERID
having count(*)=12

SQL has not a method of calculating intersection of sets. The language uses INTERSECT operator to do the job. It is unsuitable to get top 10 customers in each month and then use INTERSECT operator to get the combination. Here we group records and perform a count operation. If a customer ranking in monthly top 10 for 12 times, it enters in the top 10 in each month.

SPL solution:

SPL uses A.isect() function to calculate intersection of all member sets if A is a set of sets.

A
1 =T("Sales.csv").select(year(ORDERDATE)==2014)
2 =A1.group(month(ORDERDATE))
3 =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT))
4 =A3.(~.top(-10;AMOUNT).(CUSTOMERID))
5 =A4.isect()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group A1’s table by month.

A3: Group records in each month by customer.

A4: Get customers whose monthly sales amounts rank in top 10 in each month.

A5: A.isect() function calculates intersection of all selected customers in all months.

Though we employ a trick to calculate the intersection, the SQL query is still complex. One reason is that SQL does not have a special method to calculate intersection of sets, the other is that SQL cannot retain the post-grouping subsets for a further grouping operation or other computations. The grouping operation will be explained separately later.

3. Union

The union of set A and set B is a set including all members of A and members of B, except for the duplicates.

【Example 5】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the products whose total online sales amounts in the year 2014 is above 10,000 or that are purchased over 5 times at stores. Below are parts of the source tables:

SALES_ONLINE:

ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT
1 HANAR 2014/01/01 17 3868.6
2 ERNSH 2014/01/03 42 1005.9
3 LINOD 2014/01/06 64 400.0
4 OTTIK 2014/01/07 10 1194.0
5 OCEAN 2014/01/09 41 319.2

SALES_STORE:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
1 EASTC 2014/01/01 1 27 3063.0
2 ERNSH 2014/01/02 8 70 2713.5
3 MAGAA 2014/01/03 2 74 1675.0
4 SAVEA 2014/01/07 7 45 2018.2
5 FOLIG 2014/01/08 8 30 1622.4

SQL solution:

select
    PRODUCTID
from
    (select PRODUCTID, sum(AMOUNT)
    from SALES_ONLINE
    where extract (year from ORDERDATE)=2014
    group by PRODUCTID
    having sum(AMOUNT)>10000)
union
select
    PRODUCTID
from
    (select PRODUCTID, count(*)
    from SALES_STORE
    where  extract (year from ORDERDATE)=2014
    group by PRODUCTID
    having count(*)>5)

As we have mentioned, SQL UNION operator is used to combine results sets of two or more SELECT statements. When UNION is used without ALL, the operator removes duplicate records while combining records.

SPL solution:

SPL offers the and sign “&” to calculate the union of sets. A&B represents the union of set A and set B.

A
1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014)
2 =T("SalesStore.csv").select(year(ORDERDATE)==2014)
3 =A1.groups(PRODUCTID; sum(AMOUNT):AMOUNT).select(AMOUNT>10000)
4 =A2.groups(PRODUCTID; count(~):COUNT).select(COUNT>5)
5 =A3.(PRODUCTID)&A4.(PRODUCTID)

A1: Import SalesOnline table from the source file and select records of the year 2014.

A2: Import SalesStore table from the source file and select records of the year 2014.

A3: Group A1’s records by product ID, calculate the total sales amount for each product, and select records where the totals are above 10,000.

A4: Group A2’s records by product ID, count purchase frequencies for each product, and select records where the frequency count is greater than 5.

A5: Use “&” to calculate the union of records purchased both through online and at stores.

【Example 6】Based on the following sales table, find the products whose yearly purchase frequency ranks in top 10. Below is part of the source table:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

SQL solution:

with cte1 as 
    (select 
        extract (year from ORDERDATE) ORDERYEAR, 
        PRODUCTID, COUNT(*) ORDERCOUNT
    from SALES 
    group by extract (year from ORDERDATE),PRODUCTID
    order by ORDERYEAR ASC, ORDERCOUNT DESC),
cte2 as 
    (select
        ORDERYEAR,PRODUCTID,ORDERCOUNT,
        ROW_NUMBER()OVER(PARTITION BY ORDERYEAR ORDER BY ORDERCOUNT DESC) grouprank 
    from cte1)
select 
    DISTINCT PRODUCTID 
from cte2 
where grouprank<=10

SQL does not offer a special method of calculating union of sets, so we group data by year, calculate the total purchase frequency of each product per year, sort the result records, get rankings through row numbers after sorting, then select IDs of products that rank in top 10 in each year in terms of purchase frequency, and finally, use DISTINCT operator to remove the duplicates to get the union.

SPL solution:

SPL uses A.union() function to calculate union of all member sets when A is a set of sets.

A
1 =T("Sales.csv")
2 =A1.group(year(ORDERDATE))
3 =A2.(~.groups(PRODUCTID;count(~):COUNT))
4 =A3.(~.top(-5;COUNT).(PRODUCTID))
5 =A4.union()

A1: Import Sales table from the source file.

A2: Group A1’s table by year.

A3: Group records of each year by product and calculate their purchase frequency.

A4: Get IDs of products whose yearly purchase frequencies rank in top 10.

A5: A.union() function calculates union of desired records of products in all years.

4. Difference

The difference of set A and set B is a set including all members of A that do not belong to set B.

【Example 7】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find customers whose total purchase amount at stores is above 1000 but who have not any online purchase records. Below is part of the source table:

SALES_ONLINE:

ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT
1 HANAR 2014/01/01 17 3868.6
2 ERNSH 2014/01/03 42 1005.9
3 LINOD 2014/01/06 64 400.0
4 OTTIK 2014/01/07 10 1194.0
5 OCEAN 2014/01/09 41 319.2

SALES_STORE:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
1 EASTC 2014/01/01 1 27 3063.0
2 ERNSH 2014/01/02 8 70 2713.5
3 MAGAA 2014/01/03 2 74 1675.0
4 SAVEA 2014/01/07 7 45 2018.2
5 FOLIG 2014/01/08 8 30 1622.4

SQL solution:

select
    CUSTOMERID
from
    (select
        CUSTOMERID,count(*)
    from
        SALES_STORE
    where
        extract (year from ORDERDATE)=2014
    group by
        CUSTOMERID
    having
        count(*)>3)
MINUS
select
    DISTINCT CUSTOMERID
from
    SALES_ONLINE
where
    extract (year from ORDERDATE)=2014

Take Oracle SQL as an example. The MINUS operator (some databases use EXCEPT) is used to calculate difference. MINUS requires that the involved sub result sets must have same data structures and same number of columns, and that corresponding columns should have same or similar data types.

SPL solution:

SPL uses the backslash “\” to calculate difference of sets. A\B represents the difference of set A and set B.

A
1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014)
2 =T("SalesStore.csv").select(year(ORDERDATE)==2014)
3 =A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3)
4 =A3.id(CUSTOMERID)\A1.id(CUSTOMERID)

A1: Import SalesOnline table from the source file and select records of the year 2014.

A2: Import SalesStore table from the source file and select records of the year 2014.

A3: Group A2’s records by customer IDs, calculate the total purchase frequency for each customer, and select records where the total frequency is greater than 3.

A4: Use “\” to calculate difference of customers who purchase products online and those who buy things at stores.

【Example 8】Based on the following sales table, find customers whose total sales amounts rank in top 10 only in January in the year 2014. Below is part of the source table:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

SQL solution:

with cte1 as
    (select 
        extract (month from ORDERDATE) ORDERMONTH, 
        CUSTOMERID, SUM(AMOUNT) AMOUNT
    from SALES 
    where extract (year from ORDERDATE)=2014 
    group by extract (month from ORDERDATE),CUSTOMERID
    order by ORDERMONTH ASC, AMOUNT DESC),
cte2 as 
    (select
        ORDERMONTH,CUSTOMERID,AMOUNT,
        ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank 
    from cte1)
select CUSTOMERID
from cte2 
where grouprank<=10 and ORDERMONTH=1
MINUS
select CUSTOMERID
from cte2 
where grouprank<=10 and ORDERMONTH<>1

SQL does not have a particular method for calculating difference of sets, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers of January, and use MINUS operator to calculate difference January’s customers and the set of customers in other months.

SPL solution:

SPL offers A.diff() function to calculate difference of the first member set and all the other member sets when A is a set of sets.

A
1 =T("Sales.csv").select(year(ORDERDATE)==2014)
2 =A1.group(month(ORDERDATE))
3 =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT))
4 =A3.(~.top(-10;AMOUNT).(CUSTOMERID))
5 =A4.diff()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group A1’s records by month.

A3: Group records of each month by customer IDs and calculate total sales of each customer.

A4: Get customers in each month whose total sales amounts rank in top 10.

A5: Use A.diff() function to calculate difference between customers of January and those in the other months.

5. XOR

XOR of set A and set B is a set made up members that belong to A or B but not belong to both.

【Example 9】Students’ scores in two semesters are stored in two tables respectively. The task is to find students in class 1 whose total scores rank in top 10 only once in both semesters. Below is part of the source table:

SCORES1_SEMESTER1:

CLASS STUDENTID SUBJECT SCORE
1 1 English 84
1 1 Math 77
1 1 PE 69
1 2 English 81
1 2 Math 80

SCORES2_SEMESTER2:

CLASS STUDENTID SUBJECT SCORE
1 1 English 97
1 1 Math 64
1 1 PE 97
1 2 English 56
1 2 Math 82

SQL solution:

SQL does not define an operator for performing XOR. There are two methods if we choose to use set operators to do this:

  1. (A UNION B) EXCEPT (A INTERSECT B);

  2. (A EXCEPT B) UNION (B EXCEPT A);

Both methods are not convenient because they use multiple views and thus drag performance down. In the following query, we use FULL OUTER JOIN to counter the disadvantage:

    with cte1 as
      (select STUDENTID,TOTALSCORE 
   from 
      (select STUDENTID, sum(SCORE) TOTALSCORE
      from SCORES1
      group by STUDENTID
      order by TOTALSCORE DESC)
      where rownum <= 10),
   cte2 as 
      (select STUDENTID,TOTALSCORE
      from 
         (select STUDENTID, sum(SCORE) TOTALSCORE
         from SCORES2
         group by STUDENTID
         order by TOTALSCORE DESC)
      where rownum <= 10)
   select 
      COALESCE(cte1.STUDENTID, cte2.STUDENTID) STUDENTID,
      cte1.TOTALSCORE, cte2.TOTALSCORE
   from cte1
   FULL OUTER JOIN cte2
   ON cte1.STUDENTID=cte2.STUDENTID
   where cte1.TOTALSCORE IS NULL 
      OR cte2.TOTALSCORE IS NULL

With Oracle SQL, we can use LEFT JOIN UNION RIGHT JOIND to implement FULL JOIN if the involved database is one that does not support FULL JOIN, such as MySQL (Detail query is skipped here).

SPL solution:

SPL uses the percent sign “%” to calculate XOR. A%B represents XOR of set A and set B.

A
1 =T("Scores1.csv")
2 =T("Scores2.csv")
3 =A1.groups(STUDENTID; sum(SCORE):SCORE)
4 =A2.groups(STUDENTID; sum(SCORE):SCORE)
5 =A3.top(-10;SCORE).(STUDENTID)
6 =A4.top(-10;SCORE).(STUDENTID)
7 =A5%A6

A1: Import Scores1 table of from source file.

A2: Import Scores2 table of from source file.

A3: Group A1’s table by student ID and sum scores of each student.

A4: Group A2’s table by student ID and sum scores of each student.

A5: Get IDs of students whose total scores rank in top 10 in semester 1.

A6: Get IDs of students whose total scores rank in top 10 in semester 2.

A7: Use “%” to calculate XOR of student total scores in semester 1 and those in semester 2.

The SQL query is enormously complicated because there isn’t a particular operator in SQL to calculate XOR. SPL, however, is convenient by offering the percent sign “%” to do it.

【Example 10】Based on the following table, find whether customer RATTC ranked in top 3 in the year 2014 in terms of sales amount in a single month. Below is part of the source data:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

SQL solution:

    with cte1 as 
      (select 
         extract (month from ORDERDATE) ORDERMONTH, 
         CUSTOMERID, SUM(AMOUNT) AMOUNT
      from SALES 
      where extract (year from ORDERDATE)=2014 
      group by extract (month from ORDERDATE),CUSTOMERID
      order by ORDERMONTH ASC, AMOUNT DESC),
   cte2 as 
      (select
         ORDERMONTH,CUSTOMERID,AMOUNT,
         ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank 
      from cte1)
   select count(*) CUSTOMER_EXISTS 
   from cte2 
   where grouprank<=3 and CUSTOMERID='RATTC'

SQL does not have a specific method for getting XOR, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers entering top 3 in each month, and count records of customer "RATTC". If the count result is 1, it means true; and if it is 0, it means false.

SPL solution:

SPL offers A.cor() function to calculate XOR of member sets when A is a set of sets.

A
1 =T("Sales.csv").select(year(ORDERDATE)==2014)
2 =A1.group(month(ORDERDATE))
3 =A2.(~.groups(CUSTOMERID; sum(AMOUNT):AMOUNT))
4 =A3.new(~.top(-3; AMOUNT):TOP3)
5 =A4.(TOP3.(CUSTOMERID).pos("RATTC")>0)
6 =A5.cor()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group Sales table by month.

A3: Group records of each month by customer and calculate total sales amount of each customer in each group.

A4: Get customers whose sales amounts rank in top 3 in each month.

A5: Find if there is customer "RATTC" in each group.

A6: Use A.cor() function to find XOR, that is, whether customer "RATTC" is included in each month. The customer exists if the result is true, and it does not exist if the result is false.

6. Belong to & include

Suppose there are two sets A and B. If all members of set A are members of set B, we call that B includes A. The “belong to” relationship is between a member and a set. When there is a member x in set A, we call that x belongs to A.

【Example 11】Based on the following employee table, calculate the average salary in each department in the states of California, New York, Texas and Washington. Below is part of the source table:

ID NAME SURNAME STATE DEPT SALARY
1 Rebecca Moore California R&D 7000
2 Ashley Wilson New York Finance 11000
3 Rachel Johnson New Mexico Sales 9000
4 Emily Smith Texas HR 7000
5 Ashley Smith Texas R&D 16000

SQL solution:

SQL IN operator is used to define a “belong to” relationship in WHERE sub statement. The SQL query is as follows:

select
    DEPT, avg(SALARY) AVGSALARY 
from 
    EMPLOYEE 
where 
    STATE in ('California','New York','Texas','Washington') 
group by 
    DEPT

SPL solution:

SPL supplies A.contain(x) function to check whether member x belongs to set A.

A
1 =T("Employee.csv")
2 [California,New York,Texas,Washington]
3 =A1.select(A2.contain(STATE))
4 =A3.groups(DEPT; avg(SALARY):SALARY)

A1: Import Employee table from the source file.

A2: Define a constant set of states.

A3: Select records from A1’s table where the states belong to A2’s set.

A4: Group the selected records in A3 by department and calculate the average salary in each department.

【Example 12】Based on COURSE table and SELECT_COURSE table, find students who select both Matlab and Modern wireless communication system. Below is part of the s source table:

COURSE:

ID NAME TEACHERID
1 Environmental protection and sustainable development 5
2 Mental health of College Students 1
3 Matlab 8
4 Electromechanical basic practice 7
5 Introduction to modern life science 3

SELECT_COURSE:

ID STUDENT_NAME COURSE
1 Rebecca Moore 2,7
2 Ashley Wilson 1,8
3 Rachel Johnson 2,7,10
4 Emily Smith 1,10
5 Ashley Smith 5,6

The task can be described in another way. It checks whether the COURSE field value in SELECT_COURSE table contains the set [3,6], which are IDs of Matlab and Modern wireless communication system.

SQL solution:

The SQL field does not support set data type, so we cannot use the set include relationship to get this task done. With Oracle database here, we use REGEXP_SUBSTR function to split each COURSE value string according to a specific regular expression, left join SELECT_COURSE table and COURSE table to get records selecting both courses, group these records by IDs, get groups containing at least two records, that is, those selecting both courses, and then locate corresponding records from SELECT_COURSE table according to the selected IDs. Below is SQL query:

with cte1 as 
    (SELECT ID,REGEXP_SUBSTR(t1.COURSE ,'[^,]+',1,l) COURSE
     FROM SELECT_COURSE t1,
        (SELECT LEVEL l 
        FROM DUAL 
        CONNECT BY LEVEL<=10) t2
        WHERE l<=LENGTH(t1.COURSE) - LENGTH(REPLACE(COURSE,','))+1)
select * 
from SELECT_COURSE t5
inner join (
    select ID, count(*) 
    from (
        select t3.ID, COURSE 
        from cte1 t3
        inner join (
        select ID 
        from COURSE 
        where NAME='Matlab' or 
        NAME='Modern wireless communication system'
        ) t4
        on t3.COURSE=t4.ID
    ) 
    group by ID 
    having count(*)>=2
) t6
on t5.ID=t6.ID

SPL solution:

SPL uses A.pos(B) function locate the position of a member of set B in set A and returns null if the member does not exist in set A.

A
1 =T("Course.txt")
2 =T("SelectCourse.txt")
3 =A1.select(NAME=="Matlab"
4 =A2.run(COURSE=COURSE.split@cp())
5 =A4.select(COURSE.pos(A3)!=null)

A1: Import Course table from the source file.

A2: Import SelectCourse table from the source file.

A3: Get the set of IDs of the target two courses.

A4: Split each Course value by comma and parse the numbers into a set.

A5: Use A.pos() function to locate IDs of the target courses in COURSE value of SELECT_COURSE table, and a record that does not make it return null is a desired one.

As SQL lacks support of set type field values, it is a little complicated to handle this case. The language does snot provide a method of checking a set include relationship, so it handles the job using the filtering join. SPL, however, supports set type field values and offers a rich library of functions to locate members of a set, which facilitates the handling of set include relationship judgment cases.

【Example 13】Based on the following weather data in a certain area, find the dates when west wind occurs and when north wind visits in the previous dates. Below is part of the source table:

WEATHER_DATE RAIN_FALL WIND_GUST_DIR WIND_GUST_SPEED RAIN_TODAY RAIN_TOMORROW
2008/12/01 0.6 W 44 No No
2008/12/02 0.0 WNW 44 No No
2008/12/03 0.0 WSW 46 No No
2008/12/04 0.0 NE 24 No No
2008/12/05 1.0 W 41 No No

SQL solution:

The task is simple. It aims to find an ordered subset [N,N,W] in WIND_GUST_DIR set. SQL has a weak support for order-based calculations due to its unordered-set-based theoretic foundation (which is explained in Comparison of SQL & SPL: Order-based calculation). When the SQL you are using does not support window functions, you can only do this through table joins. Below is SQL query:

select 
    curr.WEATHER_DATE, RAIN_FALL,curr.WIND_GUST_DIR,
    WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW
from 
    weather curr
inner join
    (select 
        pre1.WEATHER_DATE,pre1.WIND_GUST_DIR
    from weather pre1
    left join 
    (select 
        WEATHER_DATE,WIND_GUST_DIR
    from weather) pre2
    on 
        pre1.WEATHER_DATE=pre2.WEATHER_DATE+1
    where 
        pre1.WIND_GUST_DIR='N' and pre2.WIND_GUST_DIR='N') yest
on 
    curr.WEATHER_DATE=yest.WEATHER_DATE+1
where 
    curr.WIND_GUST_DIR='W'
order by WEATHER_DATE

The SQL query is roundabout. Each inter-row access requires a self-join. It is inefficient. SQL introduced window functions in the year 2003 and brought in the concept of order. That has made the order-based calculations slightly easier:

select 
    WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,
    WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW
from
    (select 
        WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,
        WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW,
        LAG(WIND_GUST_DIR,1)
        OVER (
        ORDER BY WEATHER_DATE ASC
        ) YESTERDAY_DIR, 
        LAG(WIND_GUST_DIR,2)
        OVER (
        ORDER BY WEATHER_DATE ASC
        ) BEFORE_YESTERDAY_DIR
    from WEATHER)
where 
    WIND_GUST_DIR='W' and YESTERDAY_DIR='N' 
    and BEFORE_YESTERDAY_DIR='N'
order by WEATHER_DATE

SPL solution:

SPL supports accessing a record previous to the current one or after it for order-based calculations.

A
1 =T("weather.csv")
2 =A1.select(WIND_GUST_DIR[-2]=="N" && WIND_GUST_DIR[-1]=="N" && WIND_GUST_DIR=="W")

A1: Import weather data from the source file.

A2: Select records where WIND_GUST_DIR values are west and the previous two values are north.

Summary

SQL gives a good support for concatenation, intersection, union, and difference of two sets but a not good one for XOR. The language does not support theses operations on more sets, and it resorts to roundabout ways to get do jobs. SPL provides special functions for each type of set-oriented operations. This makes SPL code concise, efficient, and easy to understand.

SQL does not support set type field values. When a field value is separated by a certain identifier, SQL cannot perform set include operation on it. The language is awkward and produces complicated code in handling the “belong to” relationship on an ordered subset even with the window function. SPL designs a lot of overall location functions to deal with the set “belong to” relationship. It also supplies a complete set of supports for order-based calculations, which makes it easy to handle inter-row operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of set-oriented functions, which combines the advantages of both Java and SQL. With SPL, a set-oriented operation becomes simple and easy.

Clone this wiki locally