Skip to content

User Behavior Analysis in Practice 6:Numberizing the Dimension Table

esProcSPL edited this page Oct 18, 2024 · 1 revision

Target task:

We have a user events table T. Below is its structure and part of its data:

Time UserID EventTypeID ProductID Quantity
2022/6/1 10:20 1072755 3 100001
2022/6/1 12:12 1078030 2 100002
2022/6/1 12:36 1005093 5 100003 3
2022/6/1 13:21 1048655 1
2022/6/1 14:46 1037824 6
2022/6/1 15:19 1049626 4 100004 4
2022/6/1 16:00 1009296 5 100005 6
2022/6/1 16:39 1070713 2 100006
2022/6/1 17:40 1090884 3 100007

Fields in table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
EventTypeID Integer Event type ID
ProductID String Product ID
Quantity Numeric Quantity

Dimension table EventType:

EventTypeID EventType
1 Login
2 Browse
3 Search
4 AddtoCart
5 Submit
6 Logout

Dimension table Product:

ProductID ProductName Unit Price ProductTypeID
100001 Apple Pound 5.5 1
100002 Tissue Packs 16 2
100003 Beef Pound 35 3
100004 Wine Bottles 120 4
100005 Pork Pound 25 3
100006 Bread Packs 10 5
100007 Juice Bottles 6 4

Fields in dimension table Product:

Field name Data type Description
ProductID String Product ID
ProductName String Product name
Unit String Sales unit
Price Numeric Unit prices
ProductTypeID Integer Product type ID

Dimension table ProductType:

ProductTypeID ProductType
1 Fruits
2 Home&Personalcare
3 Meat
4 Beverage
5 Bakery

Relationship between tables:

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1. Associate with dimension table using ordinal-number-based location.

In both EventType and ProductType tables, primary keys are ordinal natural numbers. They can be directly used to locate records and achieve associations without creating index and computing and comparing HASH values. This can boost performance.

2. Convert a dimension table’s primary key values that are non-ordinal-numbers into ordinal natural numbers so that ordinal-number-based location can be used to speed up association.

Product table’s primary key values are not ordinal numbers represented by natural numbers, but we can first transform them into ordinal numbers and, in the meantime, convert ProductID field values in user events table T into corresponding ordinal number. Now ordinal-number-based location can be used.

Sample code

The code has five parts:

1. **According to practices in previous essays, **we dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx that are respectively ordered by EventTypeID, ProductType and ProductTypeID.

2. Join user events table T and EventType table, Product table and ProductType table through ordinal numbers.

3. Add an ordinal number field to Product table.

4. Dump data from user events table T to store it in composite table T.ctx ordered by Time field, and transform ProductID values into ordinal numbers of corresponding records in Product table.

5. Import each dimension table into memory, open the composite table cursor, establish associations with dimension tables during which the association with Product table is through ordinal numbers. It is not necessary to set primary key and create index for dimension table Product.

1. Dump dimension table data

A
1 =connect("demo")
2 =A1.query("select * from Product")
3 =file("Product.btx").export@b(A1)
4 =A1.query("select * from EventType order by EventTypeID")
5 =file("EventType.btx").export@b(A4)
6 =A1.query@x("select * from ProductType order by ProductTypeID")
7 =file("ProductType.btx").export@b(A6)

A4 Sort by EventTypID.

A6 Sort by ProductTypID.

2. Change types of association between EventType table and table T, ProductType table and table T into ordinal number-based association.

A
1 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2 =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5
3 >EventType=file("EventType.btx").import@b()
4 >ProductType=file("ProductType.btx").import@b()
5 >Product=file("Product.btx").import@b().keys@i(ProductID)
6 >Product=Product.switch(ProductTypeID, ProductType:#)
7 =A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:#)
8 =A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A3-A4 Do not set create indexes on primary keys for EventType and ProductType.

A6 Change type of association between ProductType table and table T into ordinal number-based association.

A7 Change type of association between EventType table and table T into ordinal number-based association.

3. Add an ordinal number field in Product table.

Original data: add an ordinal number field to it directly.

A
1 =connect("demo").query@x("select * from Product").derive(#:ProductNum)
2 = file("Product.btx").export@b(A1)

Updated data: Whenever a dimension table is updated, it is wholly retrieved and compared with the dumped btx file. The comparison is based on ordinal numbers of the dumped records, otherwise historical data will be mismatched. The newly-increased data is placed at the end. Usually there are no deletion actions on dimension tables. Any deletion of dimension data will cause errors when historical records of the fact table try to reference records of the dimension table.

A
1 .keys@i(ProductID)=connect("demo").query@x("select * from Product").derive(:ProductNum)
2 = file("Product.btx").import@b().keys@i(ProductID)
3 =A1.select(A2.find(A1.ProductID)==null)
4 =A2.(if(r=A1.find(A2.ProductID),r,~) )
5 =(A4
6 =file("Product.btx").export@b(A5)

A1 Load the updated dimension table, add ProductNum field and set primary key and index.

A2 Load the original dimension table Product from the corresponding bin file and set primary key.

A3 Find the newly-increased records in the updated dimension table.

A4 If a record in the original dimension table exist in the updated one, use the new record; if it does not exist, use the original record.

A5 Union A4 and A3 and set ordinal numbers. As A4 keeps the order of the original dimension table, ordinal numbers in the original dimension table are retained.

A6 Write A5’s result to a bin file.

4. The code of preparing file T.ctx, during which ProductID field values are changed into ordinal numbers.

Take stocked data as an example:

A
1 >Product=file("Product.btx").import@b().keys@i(ProductID)
2 =connect("demo").cursor@x("select * from T order by Time")
3 =A2.run(ProductID=Product.find(A2.ProductID).ProductNum)
4 =file("T.ctx").create@y(#Time,UserID,EventTypeID, ProductID, Quantity)
5 .append(A3)=A4
6 >A4.close()

A1 Load dimension table Product into the memory and create index on primary key.

A2 Sort table T by time while retrieving data from it.

A3 Replace ProductID field values in table T with ordinal numbers of corresponding records in the dimension table.

A4 Create a composite table.

A5 Append data of table T to A4’s composite table.

Similar code for the newly-increased data.

5. Perform the whole analysis on the converted data, all through ordinal numbers.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

A
1 >EventType=file("EventType.btx").import@b()
2 >ProductType=file("ProductType.btx").import@b()
3 >Product=file("Product.btx").import@b()
4 >Product=Product.switch(ProductTypeID, ProductType:#)
5 =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5
6 =A5.switch(ProductID,Product:#; EventTypeID,EventType:#)
7 =A6.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A1-A3 Do not set primary keys.

A6 A join via ordinal numbers.

Execution result:

EventTypeID ProductTypeID EventType ProductType Quantity Num iNum
3 1 Search Fruits 0 499586 48735
3 2 Search Home&Personalcare 0 508897 49872
3 3 Search Meat 0 403213 39923
3 4 Search Beverage 0 324567 29045
3 5 Search Bakery 0 335498 30234
5 1 Submit Fruits 206938 103469 13523
5 2 Submit Home&Personalcare 463188 154396 14656
5 3 Submit Meat 94378 93366 8754
5 4 Submit Beverage 217504 54376 5233
5 5 Submit Bakery 339480 67896 5844
Clone this wiki locally