The majority of the work completed in this report focuses on an implementation of Mondrian Forests, first introduced by Lakshminarayanan et. al.
The fundamental concept driving Mondrian Forests is the Mondrian Process
The Mondrian Forest is constructed similarly to a random forest: Given training data
In more natural language, the goal is to construct a distribution
The key insight made by Lakshminarayanan et. al. is that if we are able to perform the above steps, the resultant set of online trained mondrian trees would be functionally identical to a set of the same sized trained in batch learning, i.e. the trees produced are agnostic to the order the training data is in.
First introduced by Roy and Teh
This definition gives mondrian processes some interesting and useful characteristics. First, the subtrees of any given partition are conditionally independent given
Decision trees by definition consist of nested partitioning operations on
If we imagine each node as a partition, it naturally follows that each node
For instance, in a one dimensional input space
To describe the mondrian adaptation to decision trees, the authors introduce some notation
Let
Let
Let
Let
Lakshminarayanan et. al. propose a set of mondrian trees on the finite range of attributes in
Given a positive budget
SampleMondrianBlock(
Add
For all
Sample E from expontential distribution with rate
if
Set
Sample split dimension
Sample split location
Set
Set
SampleMondrianBlock(
SampleMondrianBlock(
else do:
Set
Add
Algorithm 1, SampleMondrianTree, should be called to instantiate a mondrian tree. After that, algorithm 2 should recurse down the tree, building it as it goes. For each new node, we first compute
Thanks to the projective rules of mondrian processes defined by Roy and Teh and discussed in section 2.1, we can demonstrate some a useful property of the tree we have defined. If we sample a mondrian tree T from MT(
Input params: Tree T = (T,
ExtendMondrianBlock(
Set
Sample E from
if
Sample split dimension
Sample split location
Insert a new node
$\delta_{j'} = \delta$
$\xi_{j'}=\xi$
$\tau_{j'}=\tau_{parent(j)} + E$
$l^x_{j'} = min(l^x_j, x)$
$u^x_{j'} = max(u^x_j, x)$
$j'' = left(j')$ iff
SampleMondrianBlock(
else do:
update
if
if
ExtendMondrianBlock(
Similar to other decision tree growth algorithms, ExtendMondrianBlock halts partitioning on blocks when all labels are identical; if
For a given tree
Similar to other decision tree algorithms, we return the class as an empirical distribution of examples which fall in
For every
To predict the label of a new example
Several datasets were chosen for this project in order to expose the algorithm to a wide variety of problems. A few sample datasets used for experimentation were placed in /datasets, from which the model pulls from.
Mondrian Forests were implemented in /randomforests following the algorithms and explanations above, in python through the use of several popular libraries. The implementation involves an abstract classifier type, of which Mondrian Decision Forest is a subtype. This was done to facilitate easy comparison between both common implementations of MF and other decision tree algorithms.
Finally, several jupyter notebooks were created in /notebooks to facilitate efficient fine-tuning and experimentation. These were used both to refine hyperparameters and to perform the comparisons discussed above. Please note that several of the notebooks take quite a bit of time to run, on account of various hyperparameter choices. Referencing Bonab and Can [3], the ideal number of classifiers for ensemble models is simply the number of classes, but that assumes perfectly independent classifiers, which are functionally infeasible. As a result, Bonab and Can recommend increasing the number of classifiers inversely relative to their independence. Additional hyperparameter tests were implemented over a range of arbitrary values, with the range extended should one extreme prove most effective.
The purpose of mondrian decision trees is to sacrifice accuracy for decreased training time. As a result, we might term mondrian trees as weak classifiers, of which we form an ensemble to compensate. Increasing the accuracy was the primary goal of this section, and several methods were attempted (though some of them must be disqualified due to converting our online learner into a batch caching learner).
First, it must be noted that most of the common strategies for increasing the accuracy of ensemble classifiers cannot be directly applied in an online learning context. Additionally, though increasing accuracy was the primary goal, methods were implemented to improve training time after the comparative section of this project.
Bagging as applied to online learners is not terribly complicated, and following the implementation laid out by Oza and Russell [4] a bagging algorithm was implemented.
for
set
for
$tree.fit(x, y)$
This algorithm is incredibly simple, just running each example through each tree a variable number of times based on the Poisson distribution for 1
The mondrian tree growth process was put forward by Lakshminarayanan et. al. as a trade off between accuracy and training efficiency. Because we construct a forest instead of a single tree, accuracy loss is somewhat mitigated. In an online context, training efficiency might be a critical metric, so the proposed MDF use cases are situations where accuracy is not critical and where training efficiency is. These cases exist, but they are rare; the goal of this research extension is then to improve upon the accuracy of the MDF without sacrificing overmuch training efficiency.
Rolling weight boosting is inspired by traditional online boosting algorithms described by Oza and Russell [4]. We introduce
Initialize
for
set
for
$tree.fit(x, y)$
$prob = tree.predict(x)$
if
update $ r_t^w \leftarrow r_t^w + e_w$
update
else do:
update $ r_t^c \leftarrow r_t^c + e_w$
update
This procedure is executed for each new example, and sourced roughly from [4]. For each
We add the weight to the sum of weights for either correct or incorrect classifications, such that
After updating this, we then update the weight of the given example to be used in the remaining classifiers. We take
Then, when it comes time to classify our examples, we calculate a scaling factor equivalent to
and weight the output of each
The aim of this algorithm is to stabilize the random nature of mondrian forests. Those trees that correctly predict examples that others do not experience a greater positive change in
Because of the notable lack of an efficient search heuristic for split criterions, each individual mondrian tree tends to be less powerful than a tree trained using an algorithm like ID3, which takes information gain into account. Conversely, mondrian trees are much more efficient to train, and are only linearly more complex with the addition of more attributes. As such, mondrian trees lend themselves to ensemble learning to a decent degree, but see quickly diminishing returns as n_tree increases, as we can see from our n_tree tests in tables 1-3. The reasoning for these quickly diminishing improvements is likely because the ExtendMondrianBlock algorithm takes random splits, and so the individual classifiers are not very closely related. As Bonab and Can noted [3], the ideal number of classifiers is the number of class labels when the classifiers are not interrelated, which supports the idea that mondrian trees are fairly independent and therefore have an ideal classifier count only slightly higher than the number of labels.
n_tree | 2 | 4 | 8 | 16 |
---|---|---|---|---|
Acc | 0.75 | 0.85 | 0.90 | 0.95 |
Pre | 1.00 | 1.00 | 1.00 | 1.00 |
Rec | 0.38 | 0.63 | 0.75 | 0.88 |
Auc | 0.81 | 0.57 | 0.88 | 0.62 |
Time | 0.02 | 0.03 | 0.06 | 0.14 |
n_tree | 2 | 4 | 8 | 16 |
---|---|---|---|---|
Acc | 0.87 | 0.90 | 0.87 | 0.90 |
Pre | 0.83 | 0.85 | 0.83 | 0.85 |
Rec | 0.90 | 0.95 | 0.93 | 0.94 |
Auc | 0.91 | 0.96 | 0.96 | 0.95 |
Time | 2.95 | 6.64 | 13.9 | 26.6 |
n_tree | 2 | 4 | 8 | 16 |
---|---|---|---|---|
Acc | 0.63 | 0.90 | 0.87 | 0.90 |
Pre | 0.58 | 0.85 | 0.83 | 0.85 |
Rec | 0.88 | 0.95 | 0.93 | 0.94 |
Auc | 0.69 | 0.96 | 0.96 | 0.95 |
Time | 2.19 | 6.64 | 13.9 | 26.6 |
For remaining tests, 8 classifiers were used.
When applying rolling weight boosting to the mondrian forest algorithm described by Lakshminarayanan et. al, we find surprisingly encouraging results with regards to model performance, similarly with online bagging.
Aggregation | Rolling Boosting | Online Bagging | Default Mondrian Forest |
---|---|---|---|
Acc | 1.00 | 1.00 | 0.90 |
Pre | 1.00 | 1.00 | 1.00 |
Rec | 1.00 | 1.00 | 0.75 |
Auc | 1.00 | 1.00 | 0.88 |
Time | 0.20 | 0.07 | 0.06 |
Aggregation | Rolling Boosting | Online Bagging | Default Mondrian Forest |
---|---|---|---|
Acc | 0.92 | 0.91 | 0.87 |
Pre | 0.92 | 0.92 | 0.83 |
Rec | 0.91 | 0.90 | 0.93 |
Auc | 0.97 | 0.97 | 0.96 |
Time | 37.8 | 20.4 | 13.9 |
Aggregation | Rolling Boosting | Online Bagging | Default Mondrian Forest |
---|---|---|---|
Acc | 0.62 | 0.65 | 0.67 |
Pre | 0.58 | 0.65 | 0.58 |
Rec | 0.85 | 0.82 | 0.88 |
Auc | 0.70 | 0.73 | 0.69 |
Time | 324.2 | 53.2 | 21.9 |
Aggregation | Rolling Boosting | Online Bagging | Default Mondrian Forest |
---|---|---|---|
Acc | 0.62 | 0.85 | 0.78 |
Pre | 0.58 | 0.80 | 0.80 |
Rec | 0.85 | 0.82 | 0.25 |
Auc | 0.70 | 0.92 | 0.39 |
Time | 84.2 | 11.3 | 9.93 |
In each dataset, the performance of the mondrian forest was generally increased by the rolling boost algorithm, but the tradeoff for increased training time is worrying. The increase is slight for smaller datasets, but it compounds as more examples are included. Mondrian decision trees are very efficient to grow, but classification is still relatively inefficient, and grows more so as the tree becomes more complex. As a result, for huge datasets, the first operation in RollingWeightFit - the classification of
One advantage (unimplemented in this project) of mondrian decision trees is the ability to parallelize fitting over a batch of examples, thanks to the projectivity property of mondrian processes outlined by Roy and Teh [2] and implemented by Lakshminarayanan et. al. [1]. While Rolling Boosting is generally an improvement on the accuracy of mondrian forests, it by nature requires examples to be fit incrementally on one tree at a time, removing the possibility of massive parallelization.
As a result, we cannot say that rolling boosting is truly successful, despite accomplishing our original aim of increasing tree performance, as the increase is minimal and the decrease in training efficiency is too large.
On the other hand, online bagging seems to be incredibly useful, positively influencing nearly every metric on every dataset at the cost of a very small amount of training time. In essence, this algorithm artificially increases the size of our training data, and in doing so causes many more partitions to be made, increasing the overall expressivity of the model. However, we must also consider the fact that we lose the projectivity of mondrian processes with this extension as well, as by weighting each example based on the previous weight, the order which the examples arrive in is now a factor in the resultant model. Because we take a Poisson of the weight and use it to determine how many times each example is fitted, the order of examples now dictates the number of training examples.
Overall, the results of these experiments were mixed. While the original mondrian tree algorithm performed fairly well, both of the extensions were ultimately held back by the fact that mondrian trees must ensure a specific set of circumstances to maintain projectivity, and the vast majority of feasible modifications to the algorithm break these rules. Ultimately, mondrian decision trees are an effective classification algorithm and should undoubtedly be considered for any online learning problem.
[1] Lakshminarayanan, B., Roy, D. M., & Teh, Y. W. (2015). Mondrian Forests: Efficient Online Random Forests. arXiv [Stat.ML]. Retrieved from http://arxiv.org/abs/1406.2673
[2] Roy, D. M., & Teh, Y. (2008). The Mondrian Process. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems (Vol. 21). Retrieved from https://proceedings.neurips.cc/paper_files/paper/2008/file/fe8c15fed5f808006ce95eddb7366e35-Paper.pdf
[3] Bonab, H. R., & Can, F. (2016). A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2053–2056. Presented at the Indianapolis, Indiana, USA. doi:10.1145/2983323.2983907
[4] Oza, N.C. & Russell, S.J.. (2001). Online Bagging and Boosting. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R3:229-236. Available from https://proceedings.mlr.press/r3/oza01a.html. Reissued by PMLR on 31 March 2021.