-
Notifications
You must be signed in to change notification settings - Fork 339
SPL:order related grouping
Sometimes the order of the data makes sense when grouping. We at times group the adjacent records that have the same field values or that meet certain conditions. For example, find out the nation that ranks in the first of consecutive Olympic gold medals, find out how many days at most that the closing price of a stock has been increased, and so on. This is where order-related grouping comes in.
When grouping an ordered set, a new group will be created when the values of the fields for grouping change.
[e.g. 1] According to the table of the Olympic medal tally, find out the nation with the most consecutive first places and its medal information. Some of the data are as follows:
Game | Nation | Gold | Silver | Copper |
---|---|---|---|---|
30 | USA | 46 | 29 | 29 |
30 | China | 38 | 27 | 23 |
30 | UK | 29 | 17 | 19 |
30 | Russia | 24 | 26 | 32 |
30 | Korea | 13 | 8 | 7 |
… | … | … | … | … |
The option @o of A.group() function in SPL enables to create a new group when field values change.
The SPL script looks like this:
A | |
---|---|
1 | =T("Olympic.txt") |
2 | =A1.sort@z(GAME,GOLD,SILVER,COPPER) |
3 | =A2.group@o1(GAME) |
4 | =A3.group@o(NATION) |
5 | =A4.maxp(~.len()) |
A1: import the Olympic medal table.
A2: sort the Olympic Games and the number of medals (gold, silver, bronze) in descending order.
A3: select one of every Olympic Game, because order is the first one of each game.
A4: create new groups when nations change.
A5: select the group with the largest number of members, which is the group with the most consecutive gold medals.
When an ordered set is grouped, a new group will be created when evaluation result of the grouping condition is true.
[e.g. 2] How many days at most are the closing prices of the Shanghai Composite Index in 2020 consecutively rise? (the rising of the first trading day index). Some of the data are as follows:
DATE | CLOSE | OPEN | VOLUME | AMOUNT |
---|---|---|---|---|
2020/01/02 | 3085.1976 | 3066.3357 | 292470208 | 3.27197122606E11 |
2020/01/03 | 3083.7858 | 3089.022 | 261496667 | 2.89991708382E11 |
2020/01/06 | 3083.4083 | 3070.9088 | 312575842 | 3.31182549906E11 |
2020/01/07 | 3104.8015 | 3085.4882 | 276583111 | 2.88159227657E11 |
2020/01/08 | 3066.8925 | 3094.2389 | 297872553 | 3.06517394459E11 |
… | … | … | … | … |
The option @i of the A.group() function in SPL enables to create a new group when conditions change.
The SPL script looks like this:
A | |
---|---|
1 | =T("SSEC.csv") |
2 | =A1.select(year(DATE)==2020).sort(DATE) |
3 | =A2.group@i(CLOSE<CLOSE[-1]) |
4 | =A3.max(~.len()) |
A1: import the Shanghai Composite Index table.
A2: select the records of 2020 and sort them in ascending order of date.
A3: create a new group when the closing price is less than the closing price of the previous day.
A4: calculate the maximum number of days with consecutive rising.
Sometimes, we can directly or indirectly get the group number (members should be assigned to which group). In this case, we can directly group by the group number.
[e.g. 3] Divide the employee into three groups based on their working years (numbers with a remainder are assigned to certain group), and calculate the average salary of each group. Some of the data are as follows:
ID | NAME | BIRTHDAY | ENTRYDATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | 2005/03/11 | R&D | 7000 |
2 | Ashley | 1980/07/19 | 2008/03/16 | Finance | 11000 |
3 | Rachel | 1970/12/17 | 2010/12/01 | Sales | 9000 |
4 | Emily | 1985/03/07 | 2006/08/15 | HR | 7000 |
5 | Ashley | 1975/05/13 | 2004/07/30 | R&D | 16000 |
… | … | … | … | … | … |
The option @n of the A.group() function in SPL is used to group by sequence number, and records with the same number are assigned to the same group (number N is assigned to Group N, N starts at 1) .
The SPL script looks like this:
A | |
---|---|
1 | =T("Employee.csv").sort(ENTRYDATE) |
2 | =A1.group@n((#-1)*3\A1.len()+ 1) |
3 | =A2.new(#:GROUP_NO, ~.avg(SALARY):AVG_SALARY) |
A1: import the employee table, and sort them by date of entry.
A2: calculate the number of the group to which they belong by the sorted row number, and group them by the number.
A3: calculate the average salary of each group.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code