-
Notifications
You must be signed in to change notification settings - Fork 339
User Behavior Analysis in Practice 7:Dimension Table Filtering
We have a user events table T. Below is its structure and part of its data:
Time | UserID | ProductID | Quantity | … |
---|---|---|---|---|
2022/6/1 10:20 | 1072755 | 1 | 7 | … |
2022/6/1 12:12 | 1078030 | 2 | 8 | … |
2022/6/1 12:36 | 1005093 | 3 | 3 | … |
2022/6/1 13:21 | 1048655 | 4 | 9 | … |
2022/6/1 14:46 | 1037824 | 5 | 5 | … |
2022/6/1 15:19 | 1049626 | 6 | 4 | … |
2022/6/1 16:00 | 1009296 | 7 | 6 | … |
2022/6/1 16:39 | 1070713 | 8 | 7 | … |
2022/6/1 17:40 | 1090884 | 9 | 4 | … |
Fields in table T: |
Field name | Data type | Description |
---|---|---|
Time | Datetime | Time stamp of an event, accurate to milliseconds |
UserID | String | User ID |
ProductID | Integer | Product ID |
Quantity | Numeric | Quantity |
Dimension table Product: |
ProductID | ProductName | Unit | Price | ProductTypeID |
---|---|---|---|---|
1 | Apple | Pound | 5.5 | 1 |
2 | Tissue | Packs | 16 | 2 |
3 | Beef | Pound | 35 | 3 |
4 | Wine | Bottles | 120 | 4 |
5 | Pork | Pound | 25 | 3 |
6 | Bread | Packs | 10 | 5 |
7 | Juice | Bottles | 6 | 4 |
… | … | … | … | … |
Fields of dimension table Product: |
Field name | Data type | Description |
---|---|---|
ProductID | String | Product ID |
ProductName | String | Product name |
Unit | String | Sales unit |
Price | Numeric | Unit price |
ProductTypeID | Integer | Product type ID |
Computing task: |
Find the number of sold pieces for each product under type IDs 1, 2 and 3 within a specified time period.
1. Filtering dimension table before join
We first filter the dimension table, and then join it with the fact table. Thereisnoneedtojudgethespecifiedconditionondimensiontableforfacttablerecordsthatdonotmatchthefiltereddimensiontable. In this way the number of comparisons will decrease considerably.
2. Performing join in cursor
Join the fact table’s cursor with the filtered dimension table, during which fact table records that cannot associate with the dimension table will not be generated. This further reduces time spent in generating records in the joining result set.
3. Index reuse
Use the existing index on the filtered dimension table. When the filtering result set is still large, which means only a small number of records are filtered away, the technique saves your time for re-creating the index.
4. Alignment sequence
We can use an efficient alignment sequence to filter a numberized dimension table by generating a same-length table sequence, whose member values are whether the corresponding dimension records meet the filtering condition. When joining the dimension table with the fact table, we can directly find whether a fact table record matches or not according to its ordinal number. This saves time for performing the join and comparisons.
Suppose T.ctx and Product.btx are already created as above introduces. T.ctx is ordered by Time and Product.btx is ordered by ProductID.
1. Filter dimension table Product and join it with T; delete records from table T that do not have matching records in the dimension table during joining.
A | |
---|---|
1 | >Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID) |
2 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
3 | =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end) |
4 | =A3.switch**@i**(ProductID,Product:ProductID) |
5 | =A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity) |
A1 Import the dimension table and filter it, and set index on the primary key.
A4 Join the fact table and the dimension table; @i option enables deleting records from the fact table that do not match the dimension table.
2. Move the association action in step 1 to the cursor generation statement, during which fact table records that cannot match won’t be generated in the cursor.
A | |
---|---|
1 | >Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID) |
2 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
3 | =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:Product) |
4 | =A3.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity) |
A3 Move the action of association between the fact table and the dimension table to the cursor generation statement; fact table records that are not matched won’t be generated in the result set.
3. Use the existing index on filtered dimension table Product
A | |
---|---|
1 | >Product=file("Product.btx").import@b().keys@i(ProductID) |
2 | =Product.select**@i**([1,2,3].pos(ProductTypeID)!=null) |
3 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
4 | =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:A2) |
5 | =A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity) |
A1 Import dimension table Product and set index on primary key.
A2 @i option enables reusing the existing index on the filtered dimension table.
Index reuse does not always make the computation faster. As records that are filtered away need to be deleted from the index table, it takes some time to perform the delete action when there are a lot of records are discarded (the number of desired records is small). In this case re-creating the index may be faster. It is important that we choose an appropriate method according to the actual situation.
4. Use alignment sequence as primary key values of dimension table Product are ordinal numbers
A | |
---|---|
1 | >Product=file("Product.btx").import@b() |
2 | Product.([1,2,3].pos(ProductTypeID))= |
3 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
4 | =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end) |
5 | =A4.select(A2(ProductID)) |
6 | =A5.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity) |
A1 Import dimension table Product without setting index on the primary key.
A2 Generate an alignment sequence according to Product table. Each of its values is whether the current record meets the filtering condition or not.
A5 Directly locate the target member in A2 according to table T’s ProductID to find out whether the current record satisfies the filtering condition.
5. Move the alignment-sequence-based filtering action to the cursor generation statement
A | |
---|---|
1 | >Product=file("Product.btx").import@b() |
2 | Product.([1,2,3].pos(ProductTypeID))= |
3 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
4 | =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end && A2(ProductID) ) |
5 | =A4.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity) |
A4 Move the alignment-sequence-based filtering action to the cursor generation statement. Records that do not meet the filtering condition won’t be generated.
Execution result:
ProductID | ProductName | Quantity |
---|---|---|
1 | Apple | 206938 |
2 | Tissue | 463188 |
3 | Beef | 94378 |
5 | Pork | 217504 |
… | … | … |
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code