Skip to content

SPL:grouping and sorting aligned by sequence number

esProcSPL edited this page Jul 24, 2024 · 1 revision

Sometimes we group and sort the data in order of sequence number, grouping members with the same sequence number into the same group. For example, calculate the total sales of each month last year in order from January to December, calculate the numbers of accessors to a website according to the order from Monday to Sunday, and so on.

This kind of grouping operation, which is aligned to a specified criterion, is collectively referred to as alignment grouping. Grouping by sequence alignment is a special case of alignment grouping where the base set is a sequence of integers starting from 1. Alignment groups may have empty groups or members that are not assigned to any of the groups.

1. Sorting by sequence number

Sort the data in the specified sequence order, and retain up to one matching member per group. This is suitable for situations where we want to query or use data in a specified order.

[e.g. 1] Query the sales records in order from Monday to Sunday according to the daily sales table. Some of the data are as follows:

Week Day Amount
5 Sunday 1101.2
5 Saturday 538.6
5 Friday 2142.4
5 Thursday 1456.0
5 Wednesday 48.0
5 Tuesday 1376.0
5 Monday 676.0
4 Sunday 448.0
4 Saturday 4031.0
4 Friday 364.8

The A.align(n,y) function in SPL is used to align the groups which are directly divided into n groups (from 1 to n), and calculate straight the group number of each member by grouping expression y. By default, each group retains up to one matching member.

The SPL script looks like this:

A
1 =T("DailySales.csv")
2 ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
3 =A1.group(Week; ~.align(7,A2.pos(Day)):WeekSales)
4 =A3.conj(WeekSales)
5 =A4.select(~)

A1: query the daily sales table.

A2: define the sequence from Monday to Sunday.

A3: when grouping the sales data by week, use the A.align(n,y) function to align sales records of each week in the order defined by A2. It is important to note here that aligned groups may have empty groups. For example, when no sales were recorded on Friday and Saturday of the second week, there are still seven groups per week:

Week
[2,Monday,1194.0]
[2,Tuesday,1622.4]
[2,Wednesday,319.2]
[2,Thursday,802.0]
(null)
(null)
[2,Sunday,2123.2]

A4: concatenate the sorted sales records of each week.

A5: select non-empty records from the result set.

The execution results of A5 are as follows:

Week Day Amount
1 Monday 3063.0
1 Tuesday 3868.6
1 Wednesday 2713.5
1 Thursday 1005.9
1 Friday 1675.0
1 Saturday 400.0
1 Sunday 2018.2
2 Monday 1194.0
2 Tuesday 1622.4
2 Wednesday 319.2

2. Retaining all matching members of each group

Group the data in a specified sequence order, each retain all matching members. This applies to situations where we care about the information of members in each group, or where we need to continue calculating with these member records.

[e.g. 2] According to the sales table, list the total sales of each month in 2014 in order. Some data of the sales table are as follows:

ID CustomerID OrderDate Amount
10248 VINET 2013/07/04 2440
10249 TOMSP 2013/07/05 1863.4
10250 HANAR 2013/07/08 1813.0
10251 VICTE 2013/07/08 670.8
10252 SUPRD 2013/07/09 3730.0

option @a of the A.align(n,y) function in SPL is used to keep all matching members of each group while aligning the groups.

The SPL script looks like this:

A
1 =T("Sales.csv")
2 =A1.select(year(ORDERDATE)==2014)
3 =A2.align@a(12,month(ORDERDATE))
4 =A3.new(#:Month,~.sum(AMOUNT):AMOUNT)

A1: query the sales table.

A2: select the records of 2014 from the sales table.

A3: use the A.align@a(n,y) function to sort the months of the order table into 12 groups in order from 1 to 12, with the option @a retaining all matching members of each group.

A4: calculate the total sales of each month.

The execution results of A4 are as follows:

Month Amount
1 66692.8
2 52207.2
3 39979.9
4 60699.39

3. Overlapped grouping by sequence number

Sometimes the sequence number of groups calculated by each record is more than one, and we want the record to be repeatedly assigned to multiple groups according to the sequence numbers.

[e.g. 3] According to the posting record table, group the posts by label, and count the frequency of each label. Some of the data in the posting record table are as follows:

ID Title Author Label
1 Easy analysis of Excel 2 Excel,ETL,Import,Export
2 Early commute: Easy to pivot excel 3 Excel,Pivot,Python
3 Initial experience of SPL 1 Basics,Introduction
4 Talking about set and reference 4 Set,Reference,Dispersed,SQL
5 Early commute: Better weapon than Python 4 Python,Contrast,Install

option @r of the A.align(n,y) function in SPL is used to create overlapped groups by the sequence numbers when aligning groups.

The SPL script looks like this:

A
1 =T("PostRecord.txt")
2 =A1.conj(Label.split(",")).id()
3 =A1.align@ar(A2.len(),A2.pos(Label.split(",")))
4 =A3.new(A2(#):Label,~.count():Count).sort@z(Count)

A1: query the posting record table.

A2: separate the labels by comma and concatenate them into a sequence to get all the labels with no overlap values.

A3: use the A.align@r(n,y) function to group the posts by their labels’ position in all labels with the option @r.

A4: calculate the number of posts per label,and sort them in descending order.

The execution results of A4 are as follows:

Label Count
SPL 7
Excel 6
Basics 5
Clone this wiki locally