-
Notifications
You must be signed in to change notification settings - Fork 339
Performance optimization skill:Numberizing Foreign Key
In Performance Optimization Skill: Partial Pre-Ass ociation, we introduced the technique of loading dimension tables in memory and pre-associating them which effectively improves performance when the memory isn’t large enough to hold all the data. However, when associating the fact table with dimension tables, we still need to do hash computation and comparison, which can also be improved by another optimization skill, i.e., numberizing foreign key.
The idea is this: if the primary key values of a dimension table are natural numbers starting from 1 (which are equivalent to row numbers), we can locate a record in the dimension table through the key values, i.e., according to the row numbers, rather than by computing and comparing hash values. This method speeds up the association of a dimension table with the fact table and eventually makes the whole computation faster. The numberization-based location makes it unnecessary to create indexes on dimensions and thus reduces memory consumption.
In the following part, we’ll illustrate how SPL implements the old query in the old test environment(explained in the previous article) by numberizing foreign key values and make a comparison with the previous results to see the effects of this skill on the performance.
To numberize foreign key values requires the primary key values of a dimension table to be sequence numbers (the row numbers of records). Yet very few of the cases in actual businesses meet that requirement, so first we need to convert the primary key values into sequence numbers:
1)Create a list of key-value – number pairs where the correspondences between each key value and natural sequence number are stored;
2)Replace the primary key values of the dimension table with natural sequence numbers to get a new dimension table file;
3)Convert the foreign key values of the fact table into corresponding sequence numbers in the dimension table according to the key-value – number list to get a new fact table;
The three dimension tables and the fact table used in the test are supplier, part, orders and lineitem, respectively.
The SPL script for replacing the table’s key values with natural sequence numbers is as follows:
A | |
---|---|
1 | =file("/home/tbl/supplier.tbl").cursor(;," |
2 | =file("/home/btx/supplier_xh_key.btx").export@b(A1,S_SUPPKEY,xh) |
3 | =A1.new(xh:S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL) |
4 | =file("/home/ctx/supplier_xh.ctx").create(#S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL) |
5 | >A4.append(A3.cursor()) |
A1: Read in data from the source file supplier.tbl and add a new column xh in the derive() function which uses the row number as its field value.
A2: Export the S_SUPPKEY field and xh field in A1 to bin file supplier_xh_key.btx to generate the list of key value – number pairs for numberizing the fact table later.
A3: Replace S_SUPPKEY field values with xh field values to rebuild the supplier table sequence.
A4: Create composite table supplier_xh.ctx to save the numberized suppliertable.
A5: Export data in the rebuilt supplier table sequence to the composite table supplier_xh.ctx.
The SPL script:
A | |
---|---|
1 | =file("/home/tbl/part.tbl").cursor(;," |
2 | =file("/home/btx/part_xh_key.btx").export@b(A1,P_PARTKEY,xh) |
3 | =A1.new(xh:P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE) |
4 | =file("/home/ctx/part_xh.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE) |
5 | >A4.append(A3.cursor()) |
The script is phrased in the same way as the previous one. The list of key-value – number pairs is part_xh_key.btx and the numberized part file is stored as composite table part_xh.ctx.
The SPL script:
A | |
---|---|
1 | =file("/home/tbl/orders.tbl").cursor(;," |
2 | =file("/home/btx/orders_xh_key.btx").export@b(A1,O_ORDERKEY,xh) |
3 | =A1.new(xh:O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_SHIPPRIORITY) |
4 | =file("/home/ctx/orders_xh.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_SHIPPRIORITY) |
5 | >A4.append(A3.cursor()) |
The script is phrased in the same way as the previous one. The list of key-value – number pairs is orders_xh_key.btx and the numberized part file is stored as composite table orders_xh.ctx.
The SPL script:
A | |
---|---|
1 | =file("/home/tbl/lineitem.tbl").cursor(;," |
2 | =file("/home/btx/orders_xh_key.btx").import@b() |
3 | =file("/home/btx/part_xh_key.btx").import@b() |
4 | =file("/home/btx/supplier_xh_key.btx").import@b() |
5 | =A1.switch(L_ORDERKEY,A2:O_ORDERKEY;L_PARTKEY,A3:P_PARTKEY;L_SUPPKEY,A4:S_SUPPKEY) |
6 | =A5.run(L_ORDERKEY=L_ORDERKEY.xh, L_PARTKEY=L_PARTKEY.xh, L_SUPPKEY=L_SUPPKEY.xh) |
7 | =file("/home/ctx/lineitem_xh.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT;L_ORDERKEY) |
8 | >A7.append(A6) |
A1: Create a cursor to read data from the source lineitem file.
A2/A3/A4: Import the key value – number pairs lists of the three dimension tables, respectively.
A5: Create association of L_ORDERKEY with orders’s key value – number pairs list, L_PARTKEY with part’s corresponding list and L_SUPPKEY with supplier’s list.
A6: Replace the values of each key with the associated numbers in the corresponding list to create a new cursor.
A7: Create composite table lineitem_xh.ctx to save data of the numberized lineitem file.
A8: Export data of the numberized lineitem file cursor to the composite table lineitem_xh.ctx.
The SPL script for pre-loading the dimension tables:
A | |
---|---|
1 | >env(supplier, file("/home/ctx/supplier.ctx").open().memory().keys@i(S_SUPPKEY)) |
2 | >env(part, file("/home/ctx/part.ctx").open().memory().keys@i(P_PARTKEY)) |
3 | >env(orders,file("/home/ctx/orders.ctx").open().memory().keys@i(O_ORDERKEY)) |
It loads every dimension table in memory and creates an index on its keys.
SPL test script to implement the query:
A | |
---|---|
1 | =file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_PARTKEY, L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 | =A1.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier) |
3 | =A2.select(L_ORDERKEY.O_TOTALPRICE>0 && L_PARTKEY.P_SIZE>0 && L_SUPPKEY.S_ACCTBAL<999999) |
4 | =A3.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
We first execute the pre-loading script and then the test script to get the execution time of test script is 450 seconds.
The SPL script for pre-loading the dimension tables:
A | |
---|---|
1 | >env(supplier, file("/home/ctx/supplier_xh.ctx").open().memory()) |
2 | >env(part, file("/home/ctx/part_xh.ctx").open().memory()) |
3 | >env(orders,file("/home/ctx/orders_xh.ctx").open().memory()) |
There’s no need to create indexes because the keys of dimension tables are already numberized.
SPL test script to implement the query:
A | |
---|---|
1 | =file("/home/ctx/lineitem_xh.ctx").open().cursor(L_ORDERKEY,L_PARTKEY, L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 | =A1.switch(L_ORDERKEY,orders:#;L_PARTKEY,part:#;L_SUPPKEY,supplier:#) |
3 | =A2.select(L_ORDERKEY.O_TOTALPRICE>0 && L_PARTKEY.P_SIZE>0 && L_SUPPKEY.S_ACCTBAL<999999) |
4 | =A3.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
Note: The “:#” in A2 is used to create association of the fact table’s foreign key values with a dimension table’s row numbers. The foreign key value 7, for instance, will directly associate to the 7th row in the dimension table.
We first execute the pre-loading script and then the test script to get the execution time of text script is 269 seconds.
In the above two tests, the numbers of records and fields in the dimension tables and the fact table remain unchanged. The only change is that their key values are replaced with sequence numbers. The two queries select the same records for computation using the same filtering condition - that is to say, they spend the same time in data filtering and grouping, so the only difference lies in the way to implement association: one is by row numbers while the other is through hash value comparison. By numberizing foreign key values and locating corresponding records in the dimension table via row numbers, the query is 181 seconds faster than before.
We can only numberize the primary key of the dimension table, but there’s no restriction on the data type of the primary key values which can be integer, string, date, time, etc. If a dimension table has a multi-field primary key, we can add another number field and create a pairs list where one sequence number corresponds to multiple key values to numberize the foreign key of the fact table.
It’s convenient to apply the technique to historical data queries. Numberizing the historical data once is enough and there’s no need to retain the list of key-value - number pairs.
The technique is also suitable for querying newly-added data with only a few more steps.
-
Get new records from the dimension table(s) and append the new key values and their sequence numbers to the corresponding list;
-
Append new records to the numberized dimension table according to the corresponding list;
-
Append new records to the numberized fact table according to the corresponding list.
Under the premise of unchanged dimension table(s), just perform step 3) to handle with the new data.
The technique of numberizing foreign key can be applied after the new data are processed.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code