Skip to content

User Behavior Analysis in Practice 7:Dimension Table Filtering

esProcSPL edited this page Oct 18, 2024 · 1 revision

Target task

We have a user events table T. Below is its structure and part of its data:

Time UserID ProductID Quantity
2022/6/1 10:20 1072755 1 7
2022/6/1 12:12 1078030 2 8
2022/6/1 12:36 1005093 3 3
2022/6/1 13:21 1048655 4 9
2022/6/1 14:46 1037824 5 5
2022/6/1 15:19 1049626 6 4
2022/6/1 16:00 1009296 7 6
2022/6/1 16:39 1070713 8 7
2022/6/1 17:40 1090884 9 4
Fields in table T:
Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
ProductID Integer Product ID
Quantity Numeric Quantity
Dimension table Product:
ProductID ProductName Unit Price ProductTypeID
1 Apple Pound 5.5 1
2 Tissue Packs 16 2
3 Beef Pound 35 3
4 Wine Bottles 120 4
5 Pork Pound 25 3
6 Bread Packs 10 5
7 Juice Bottles 6 4
Fields of dimension table Product:
Field name Data type Description
ProductID String Product ID
ProductName String Product name
Unit String Sales unit
Price Numeric Unit price
ProductTypeID Integer Product type ID
Computing task:

Find the number of sold pieces for each product under type IDs 1, 2 and 3 within a specified time period.

Techniques involved:

1. Filtering dimension table before join

We first filter the dimension table, and then join it with the fact table. Thereisnoneedtojudgethespecifiedconditionondimensiontableforfacttablerecordsthatdonotmatchthefiltereddimensiontable. In this way the number of comparisons will decrease considerably.

2. Performing join in cursor

Join the fact table’s cursor with the filtered dimension table, during which fact table records that cannot associate with the dimension table will not be generated. This further reduces time spent in generating records in the joining result set.

3. Index reuse

Use the existing index on the filtered dimension table. When the filtering result set is still large, which means only a small number of records are filtered away, the technique saves your time for re-creating the index.

4. Alignment sequence

We can use an efficient alignment sequence to filter a numberized dimension table by generating a same-length table sequence, whose member values are whether the corresponding dimension records meet the filtering condition. When joining the dimension table with the fact table, we can directly find whether a fact table record matches or not according to its ordinal number. This saves time for performing the join and comparisons.

Sample code

Suppose T.ctx and Product.btx are already created as above introduces. T.ctx is ordered by Time and Product.btx is ordered by ProductID.

1. Filter dimension table Product and join it with T; delete records from table T that do not have matching records in the dimension table during joining.

A
1 >Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3 =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
4 =A3.switch**@i**(ProductID,Product:ProductID)
5 =A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import the dimension table and filter it, and set index on the primary key.

A4 Join the fact table and the dimension table; @i option enables deleting records from the fact table that do not match the dimension table.

2. Move the association action in step 1 to the cursor generation statement, during which fact table records that cannot match won’t be generated in the cursor.

A
1 >Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3 =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:Product)
4 =A3.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A3 Move the action of association between the fact table and the dimension table to the cursor generation statement; fact table records that are not matched won’t be generated in the result set.

3. Use the existing index on filtered dimension table Product

A
1 >Product=file("Product.btx").import@b().keys@i(ProductID)
2 =Product.select**@i**([1,2,3].pos(ProductTypeID)!=null)
3 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4 =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:A2)
5 =A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product and set index on primary key.

A2 @i option enables reusing the existing index on the filtered dimension table.

Index reuse does not always make the computation faster. As records that are filtered away need to be deleted from the index table, it takes some time to perform the delete action when there are a lot of records are discarded (the number of desired records is small). In this case re-creating the index may be faster. It is important that we choose an appropriate method according to the actual situation.

4. Use alignment sequence as primary key values of dimension table Product are ordinal numbers

A
1 >Product=file("Product.btx").import@b()
2 Product.([1,2,3].pos(ProductTypeID))=
3 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4 =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
5 =A4.select(A2(ProductID))
6 =A5.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product without setting index on the primary key.

A2 Generate an alignment sequence according to Product table. Each of its values is whether the current record meets the filtering condition or not.

A5 Directly locate the target member in A2 according to table T’s ProductID to find out whether the current record satisfies the filtering condition.

5. Move the alignment-sequence-based filtering action to the cursor generation statement

A
1 >Product=file("Product.btx").import@b()
2 Product.([1,2,3].pos(ProductTypeID))=
3 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4 =file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end && A2(ProductID) )
5 =A4.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A4 Move the alignment-sequence-based filtering action to the cursor generation statement. Records that do not meet the filtering condition won’t be generated.

Execution result:

ProductID ProductName Quantity
1 Apple 206938
2 Tissue 463188
3 Beef 94378
5 Pork 217504
Clone this wiki locally