-
Notifications
You must be signed in to change notification settings - Fork 339
Unconventional aggregation
The standard SQL provides five most commonly used aggregation operations: SUM/COUNT/AVG/MIN/MAX. Observing these operations, we find that they can all be seen as a function that returns a single value with a set as a parameter. Let's first understand this commonality as the definition of aggregation operation, which involves turning a set to a single value, that is to turn multiple values to one, resulting in aggregation. Therefore, it is called aggregation operation.
Then obviously, aggregation operations can be applied when there is a set, so operations like SUM/COUNT can be implemented on a data table (record set).
The result of grouping operation is a batch of grouped subsets, and aggregation operation can also be applied on each subset, which is the grouping operation of SQL. In fact, the aggregation operation on the entire set can also be understood as on a special group that is only divided into one group (which is also a complete partition). After this understanding, we can consider that the aggregation operation always occurs after the grouping operation (but there may not always be an aggregation operation after the grouping operation, as mentioned earlier). Moreover, it can also be said that as long as it is recognized as an aggregation operation (an operation that meets the above definition), it can definitely be used after grouping. We will see below that this understanding will greatly expand the application scope of grouping plus aggregation operations.
In addition to these five aggregation operations, some databases also provide aggregation functions such as variance and standard deviation, which have similar properties to these five operations and can be called conventional aggregation operations. Let's now study other forms of aggregation operations that are meaningful in business.
Return record
The conventional aggregation mentioned above is for numerical operations, especially for structured data, it is for a certain field (or expression), and the return value is also the result of these numerical operations. But sometimes what we care about is not the result value itself, but the information related to the result value.
For example, we want to find the IP address used by a user during his/her first login from the log table, rather than the login time. Writing this operation in standard SQL would be like this:
SELECT ip_address FROM LogTable WHERE user=? AND logintime=
(SELECT MIN(logintime) FROM LogTable WHERE user=?)
Using a subquery, first calculate the first login time of the user, and then search for the IP address used at that time. This requires traversing the dataset twice.
ORACLE provides a KEEP function that can perform this operation without using subquery:
SELECT MIN(ip_address) KEEP(DENSE_RANK FIRST ORDER BY logintime) FROM LogTable WHERE user=?
However, what we are concerned about may not only be the IP address, but also other fields in the log table, such as the browser used and whether it is a mobile device, actually we are interested in the complete record corresponding to the minimum value. Due to the lack of discreteness in SQL, even with the KEEP function, it is not easy to write such operations. Either use KEEP for each field separately, or traverse twice with subqueries, both of which are cumbersome.
If there is an aggregation function that can return the record corresponding to the maximum/minimum value rather than the value itself, this operation would be simple to write and only needs to traverse once. SPL provides a minp function, and the above task can be written as follows:
LogTable.select(user=?).minp(logintime)
As mentioned earlier, such aggregation operations can also be used in grouping, such as finding the log records of each user's first login:
LogTable.group(user).(~.minp(logintime))
Similarly, SPL also has a maxp function to return the record corresponding to the maximum value.
Logging is often organized according to the time of event occurrence, which eliminates the need for comparison to calculate the minimum value. Instead, the first entry can be directly retrieved.
LogTable.select(user=?).m(1)
It can also be done in grouping, taking advantage of the order of the ordered grouping result set we discussed earlier:
LogTable.group(user).(~.m(1))
This type of operation is fairly common, and SPL has made it a function option, which can be written more simply:
LogTable.select@1(user=?)
LogTable.group@1(user)
SQL is based on the concept of unordered sets and cannot guarantee the order of the returned records. To perform this operation, it is necessary to manually create a sequence number and then use conditions to filter.
Return set
Let's modify the above task to: find the names of the youngest individuals in a group of people.
Unlike the previous issue, the same user can not have multiple identical login times, but there may be people of the same age in a group, and there may be more than one youngest person. The return value of the minp function should be a set to be reasonable.
When we reconsider our definition of aggregation operations, we will find that the requirement to return a single value is not necessary. As long as the parameter is a set, anything returned can be considered as an aggregation operation. Under this definition, allowing the minp/maxp function to return a set and still treat it as an aggregation operation is sufficient.
The more common aggregation operation that requires returning a set is TopN.
SQL does not understand TopN as an aggregation operation, but rather as a modifier when returning the result set. In principle, SQL first calculates the complete result set and then returns only the first N rows. TopN always comes after the sorting action. Sorting a large set is a time-consuming action, but in reality, only doing TopN does not require sorting the entire set. At this point, we can only rely on the database optimizer. In addition, as a modification of the result set, this operation cannot be applied to grouped subsets. The writing of TopN in groups is completely different from the writing of TopN of the entire set, and in this case, the operation becomes more complex, and the optimizer often fails, resulting in slow results.
SPL understands TopN as an aggregation operation, and many things become very easy. For example, to calculate the time interval between the last two logins of a user:
a=LogTable.select(user=?).top(logingtime,-2), a(2)-a(1) // The last two login interval for a certain user
LogTable.groups(user;(a=~.top(logintime,-2),a(2)-a(1))) // The last two login intervals for each user
Moreover, implementing calculations does not require deliberate engineering optimization, and high performance can also be achieved when used after grouping.
TopN also has the situation of returning records, that is, retrieving the corresponding records of a field (expression) ranking in the topN. Similar to minp/maxp, SPL's top function also supports this feature.
Similarly, ordered situations can also occur, such as in the previous log calculation. If it is assumed that the log table is already ordered by the event time, then calculating TopN does not require further comparison operations.
a=LogTable.select(user=?).top(0,-2),a(2)-a(1) // Use 0 to indicate no comparison
LogTable.groups(user;(a=~.top(0,-2),a(2)-a(1)))
Here we discussed two common cases of unconventional aggregation: returning record and returning set, both of which are not directly supported in SQL. Of course, according to the definition, there will be more forms of aggregation operations, and even in these two cases, there will be many variants, such as getting the member with the sorting position centered, and getting the set of DISTINCT values for a certain field. A deep understanding of aggregation operations and their relationship with grouping operations will expand the application scope of these operations and have significant implications for the description and implementation of calculations.
You can also understand the differences between SPL and SQL.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code