Skip to content

SPL:order related grouping

esProcSPL edited this page Jul 23, 2024 · 1 revision

Sometimes the order of the data makes sense when grouping. We at times group the adjacent records that have the same field values or that meet certain conditions. For example, find out the nation that ranks in the first of consecutive Olympic gold medals, find out how many days at most that the closing price of a stock has been increased, and so on. This is where order-related grouping comes in.

1. Grouping by consecutive the same values

When grouping an ordered set, a new group will be created when the values of the fields for grouping change.

[e.g. 1] According to the table of the Olympic medal tally, find out the nation with the most consecutive first places and its medal information. Some of the data are as follows:

Game Nation Gold Silver Copper
30 USA 46 29 29
30 China 38 27 23
30 UK 29 17 19
30 Russia 24 26 32
30 Korea 13 8 7

The option @o of A.group() function in SPL enables to create a new group when field values change.

The SPL script looks like this:

A
1 =T("Olympic.txt")
2 =A1.sort@z(GAME,GOLD,SILVER,COPPER)
3 =A2.group@o1(GAME)
4 =A3.group@o(NATION)
5 =A4.maxp(~.len())

A1: import the Olympic medal table.

A2: sort the Olympic Games and the number of medals (gold, silver, bronze) in descending order.

A3: select one of every Olympic Game, because order is the first one of each game.

A4: create new groups when nations change.

A5: select the group with the largest number of members, which is the group with the most consecutive gold medals.

2. Grouping by adjacent conditions

When an ordered set is grouped, a new group will be created when evaluation result of the grouping condition is true.

[e.g. 2] How many days at most are the closing prices of the Shanghai Composite Index in 2020 consecutively rise? (the rising of the first trading day index). Some of the data are as follows:

DATE CLOSE OPEN VOLUME AMOUNT
2020/01/02 3085.1976 3066.3357 292470208 3.27197122606E11
2020/01/03 3083.7858 3089.022 261496667 2.89991708382E11
2020/01/06 3083.4083 3070.9088 312575842 3.31182549906E11
2020/01/07 3104.8015 3085.4882 276583111 2.88159227657E11
2020/01/08 3066.8925 3094.2389 297872553 3.06517394459E11

The option @i of the A.group() function in SPL enables to create a new group when conditions change.

The SPL script looks like this:

A
1 =T("SSEC.csv")
2 =A1.select(year(DATE)==2020).sort(DATE)
3 =A2.group@i(CLOSE<CLOSE[-1])
4 =A3.max(~.len())

A1: import the Shanghai Composite Index table.

A2: select the records of 2020 and sort them in ascending order of date.

A3: create a new group when the closing price is less than the closing price of the previous day.

A4: calculate the maximum number of days with consecutive rising.

3. Grouping by sequence numbers

Sometimes, we can directly or indirectly get the group number (members should be assigned to which group). In this case, we can directly group by the group number.

[e.g. 3] Divide the employee into three groups based on their working years (numbers with a remainder are assigned to certain group), and calculate the average salary of each group. Some of the data are as follows:

ID NAME BIRTHDAY ENTRYDATE DEPT SALARY
1 Rebecca 1974/11/20 2005/03/11 R&D 7000
2 Ashley 1980/07/19 2008/03/16 Finance 11000
3 Rachel 1970/12/17 2010/12/01 Sales 9000
4 Emily 1985/03/07 2006/08/15 HR 7000
5 Ashley 1975/05/13 2004/07/30 R&D 16000

The option @n of the A.group() function in SPL is used to group by sequence number, and records with the same number are assigned to the same group (number N is assigned to Group N, N starts at 1) .

The SPL script looks like this:

A
1 =T("Employee.csv").sort(ENTRYDATE)
2 =A1.group@n((#-1)*3\A1.len()+ 1)
3 =A2.new(#:GROUP_NO, ~.avg(SALARY):AVG_SALARY)

A1: import the employee table, and sort them by date of entry.

A2: calculate the number of the group to which they belong by the sorted row number, and group them by the number.

A3: calculate the average salary of each group.

Employee.csv

Olympic.txt

SSEC.csv

Clone this wiki locally