Skip to content

User Behavior Analysis in Practice 10:Ordered Storage by Account

esProcSPL edited this page Oct 22, 2024 · 1 revision

Target task

We have a user events table T. Below is its structure and part of its data:

Time UserID EventType
2022/6/1 10:20 1072755 Search
2022/6/1 12:12 1078030 Browse
2022/6/1 12:36 1005093 Submit
2022/6/1 13:21 1048655 Login
2022/6/1 14:46 1037824 Logout
2022/6/1 15:19 1049626 AddtoCart
2022/6/1 16:00 1009296 Submit
2022/6/1 16:39 1070713 Browse
2022/6/1 17:40 1090884 Search

Fields in table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
EventType String Event type

Computing task:

Find the number of occurrences and the distinct users under each type of event within a specified time period.

The key point is that the number of distinct users is too large and they cannot fit into the memory.

Techniques involved:

Find detailed explanation about order-based distinct/increment in SQL Performance Enhancement: DISTINCT & COUNT(DISTINCT) on Big Data and Ordered Storage of SPL.

1. Sort data by account and perform a distinct operation

Sort table T by user ID and save it as ordered data. By this we can achieve a fast distinct operation. To perform distinct count on an ordered field, we just compare each record with the previous one during the traversal without the need to retain result set in the memory. It is fast and won’t cause memory overflow.

2. External sorting

When the volume of data is too large to be wholly loaded into the memory, sorting data in the database is slow. An alternative is to read data into a bin file, sort it in esProc and save the result set as a composite table file.

3. Order-based increment

Usually the newly-generated data isn’t ordered by the account field. It cannot be directly appended to the original ordered data. On the other hand, it is time-consuming to combine the new data and the old ordered data and sort them again.

The composite table’s patch table is SPL’s solution. We retain a small-scale ordered data, which is called patch table, and merge the sorted newly-generated data with the patch table while keeping the original composite table unchanged. After a certain period of time when the patch table grows to a certain size, we merge it with the original composite table. To perform a distinct operation, we need to retrieve data from both the original composite table and the patch table, merge and traverse them. The performance is a little lower than that with only one ordered table, but the orderliness can still help achieve a fast distinct operation.

When to merge the patch table and the original composite table is relevant to how long data is updated. If there is newly-generated data every day, we can perform the merge every month. During the period of one month, the patch table contains data within the past month and the original composite table has all data before the last month. The former is much smaller than the latter. This means the amount of data to be merged each day is small and the data appending is fast. The order-based merge of the whole data only occurs once every month. And this makes it not very unacceptable even if it takes relatively long to finish the merge.

4. Parallel processing

Specify to segment data by UserID when creating a composite table so that records having same UserID values will be put into same segment. Then generate a multicursor based on the composite table and use parallel processing to summarize data faster.

Sample code

1. Basic method of sorting table T by UserID and dumping it:

A
1 =connect("demo").cursor@x("select * from T order by UserID,Time")
2 =file("T.ctx").create@p(#UserID,#Time,EventType,…)
3 =A2.append(A1)
4 >A2.close()

A1 Import the fact table while sorting it by UserID and Time.

A2 Create composite table file’s structure. @p option enables segmenting the composite table by the first field (UserID). This means that records having same UserID will not be given to different threads during parallel processing and ensures a correct result. Without this option, the table will simply be segmented according to the number of records and it is probably that records having same UserID will be allocated to two threads. This could cause errors in performing an order-based distinct aggregate. As we do not use the order-based distinct aggregate algorithm in the previous computations, there aren’t special segmentation requirements and @p option is not needed to get correct parallel computations.

A3 Export and append data of the fact table to A2’s composite table file.

2. External sorting

If the volume of data is too large to fit into the memory and hard disk buffer is needed to do the sorting, database sorting will be extremely slow. A better alternative is to export data to a bin file, sort it with esProc and save the result as a composite table file, as the following code shows:

A
1 =connect("demo").cursor@x("select * from T")
2 =file("T.btx").export@b(A1)
3 =file("T.ctx").create@p(#UserID,#Time,EventType,…)
4 =file("T.btx").cursor@b(UserID,Time,EventType,…).sortx(UserID,Time;1000000)
5 =A3.append(A4)
6 >A3.close()

A1 Import the fact table without sorting it.

A2 Export data in the fact table to a bin file.

A3 Create composite table structure.

A4 Import data in the bin file as a cursor and sort data by UserID and Time. Specify the number of buffer rows by parameter n in sortx(…;n) function. If the number of buffer rows is not definite, the parameter can be absent but sortx function will automatically calculates a relative suitable number. It is generally recommended that the buffer size does not exceed half of the memory space because more memory usage will slow down the Java virtual machine. When the memory is sufficiently large, the larger the buffer size the faster the sorting operation.

A5 Append the sorted data to composite table file.

3. Order-based increment

The increased data: We use the time stamp to identify the increased data. Each day after 24 o’clock the increased data will be appended to the composite table file:

A B
1 =connect("demo").cursor@x("select * from T where Time>=? && Time<? order by UserID,Time",date(now()-1), date(now()))
2 =file("T.ctx").open()
3 =A2.append@a(A1)
4 if (day(now())==1 >A2.reset()
5 >A2.close()

A1 Import the newly-increased data from the fact table. As the increment is small, we sort the increased data directly by UserID and Time.

A2 Open the composite table file.

A3 Export the increased data to the composite table file’s patch table by appending. @a option ensures writing data to patched table by appending.

A4-B4 Check whether the current date is the first day of the month. Merge the composite table and the patch table if it is, and then empty the patch table using reset() function.

Since @p option is used at the creation of composite table, SPL will automatically put the newly-appended increased data into corresponding segment so that no error occurs in the subsequent parallel computation.

4. Count distinct users based on the composite table ordered by UserID

A
1 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2 =file("T.ctx").open().cursor@m(UserID, EventType;Time>=start && Time<=end;2).groups(EventType; count(1):Num, icount@o(UserID):iNum)

A2 Filter table T in the cursor and perform grouping & aggregation. In icount@o(), the option means data is ordered by the distinct field. @p option is used in the data preparation phase, so here the @m option is enough to implement the parallel computation.

The unsolved problem: Before we perform a distinct count, data also needs to be filtered by Time field. Yet data has been first been sorted by UserID in order to do the distinct count where Time isn’t the first field by which data is ordered. In this case the filtering on Time cannot be executed fast but can only performed on the ordered data within each UserID. The data volume in each UserID, however, is very small, generally less than one data block. The skip-block retrieval is seldom. The effect is equivalent to traversing all data. With scenarios where the time span of total data is large but that of the to-be-queried data is small, the above approach has an unsatisfactory performance. We will introduce the SPL solution that can both address the distinct count on UserID and the filtering operation on Time in another essay (which focuses on bi-dimension ordering structure).

Execution result:

EventType Num iNum
AddtoCart 1845674 175476
Browse 3578901 348791
Login 4033000 393400
Logout 4033000 393400
Search 2947931 257539
Submit 867345 83375
Clone this wiki locally