Skip to content

SPL:grouping by enumeration criteria

esProcSPL edited this page Jul 30, 2024 · 4 revisions

When grouping, we often need to assign records that meet the same criterion to the same group. For example, divide cities into large, medium and small ones according to the urban population, sort families into rich, middle class, well-off, and poor ones according to annual income, and so forth. According to the pre-defined enumeration criteria, the qualified members are assigned to the corresponding group, which is referred to as the enumeration grouping.

1. Saving each member in the first matching group

The members of the set to be grouped are taken as parameters to calculate the criteria according to the order of the enumeration criterion set defined in advance.

[e.g. 1] According to the city population of Chinese cities, classify and calculate the cities. Cities with more than 2 million people are megacity behemoths, cities with 1-2 million people are megacities, cities with 0.5-1 million people are big cities, and others are medium-sized and small cities. Some of the data are as follows:

ID CITY POPULATION PROVINCE
1 Shanghai 12286274 Shanghai
2 Beijing 9931140 Beijing
3 Chongqing 7421420 Chongqing
4 Guangzhou 7240465 Guangdong
5 Hong Kong 7010000 Hong Kong Special Administrative Region

The A.enum() function in SPL is used for enumeration grouping, and by default each member is saved only in the first matching group.

The SPL script looks like this:

A
1 =T("UrbanPopulation.txt")
2 [?>2000000,?>1000000,?>500000,?<=500000]
3 =A1.enum(A2,POPULATION)
4 [megacity behemoths, megacities, big cities, medium-sized and small cities]
5 =A3.new(A4(#):CITY_TYPE,~.count():COUNT)

A1: import the city population table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate and group city population tables by criteria, and by default save each member in the first matching group.

A4: define the name of each group.

A5: calculate the number of cities in each group based on the results of enumerated grouping.

2. Saving mismatching members in a new group

When grouping by enumeration criteria, there may be members that do not match any of the criteria, and we can save them in a new group. This applies to scenarios where we care not only about the information of matching members, but also about other mismatching members.

[e.g. 2] To calculate the average salaries by group, the employees are divided into three groups by age: those under 35, those under 45, and others. The data of the employee table are as follows:

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

option @n of the A.enum() function in SPL is used to save mismatching members to a new group when grouping by enumeration.

The SPL script is as follows:

A
1 =T("Employee.csv")
2 [?<35,?<45]
3 =A1.enum@n(A2, age(BIRTHDAY))
4 [under 35, under 45, other]
5 =A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY)

A1: import the employee table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate groups by age, use the option @n to save the mismatching members to a new group.

A4: define the name of each group.

A5: calculate the average salary per employee based on the results of enumerated grouping.

3. Overlapped grouping according to enumeration criteria

Sometimes we need to redistribute members that meet multiple grouping criteria to all corresponding groups.

[e.g.3] According to the city GDP table, calculate the per capita GDP of the municipalities, the first-tier cities and the second-tier cities respectively. It should be noted that the subgroups may have overlapped members, for example, Beijing is both a first-tier city and a municipality. Some of the data are as follows:

ID CITY GDP POPULATION
1 Shanghai 32679 2418
2 Beijing 30320 2171
3 Shenzhen 24691 1253
4 Guangzhou 23000 1450
5 Chongqing 20363 3372

The option @r of the A.enum() function in SPL is used to query whether all the members match or not for each group when enumerating the groups.

The SPL script looks like this:

A
1 =T("CityGDP.txt")
2 "Beijing","Shanghai","Tianjin","Chongqing"],["Beijing","Shanghai","Guangzhou","Shenzhen"],["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"
3 [A2(1).pos(?)>0,A2(2).pos(?)>0,A2(3).pos(?)>0]
4 =A1.enum@r(A3,CITY)
5 [municipalities, first-tier cities, second-tier cities]
6 =A4.new(A5(#):AREA,~.sum(GDP)/~.sum(POPULATION)*10000:CAPITA_GDP)

A1: import the city GDP table.

A2: define municipalities, first- and second-tier city constant variables.

A3: enumerate the criteria of municipalities, first-tier cities and second-tier cities.

A4: use the A.enum() function to group by enumeration criteria, and use the option @r to query whether all members of each group match or not.

A5: define the name of each group.

A6: calculate the per capita GDP for each group based on the results of enumerated grouping.

CityGDP.txt

Employee.csv

UrbanPopulation.txt

Clone this wiki locally