Decision tree analysis

Decision trees are used for classification as well as regression, hence they can have a categorical or a continuous dependent metric. MicroStrategy produces binary trees. There are various algorithms for decision trees and MicroStrategy chose to use CART. These algorithms basically differ in the way they decide which variable is the best one to branch off on. This basically involves three steps:

Build a maximal tree
Prune the maximal tree
Select the best pruned tree

The larger the tree, the better the quality of the result and longer the time taken. Hence the pruning step determines various sub-trees of the maximal tree which have a trade-off between time taken and the quality of the output. Of these pruned trees, MicroStrategy selects the one which balances these two aspects.

K-folds cross-validation

When MicroStrategy trains a decision tree model, the decision tree algorithm splits the training data into two sets; one set is used to develop the tree and the other set is used to validate it. Prior to MicroStrategy 9.0, one fifth of the training data was always reserved for validating the model built on the remaining four fifths of the data. The quality of this model (referred to as the holdout method) can vary depending on how the data is split, especially if there is an unintended bias between the training set and the validation set.

K-folds cross-validation is an improvement over the holdout method. The training data is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are used to build a model. Then the result across all k trials is computed, typically resulting in a better model. Since every data point is in the validation set only once, and in the training dataset k-1 times, the model is much less sensitive to how the partition is made.

A downside is that training time tends to increase proportionally with k, so MicroStrategy allows the user to control the k parameter, limiting it to a maximum value of 10. The K-fold setting is specified on theSelect Type of Analysisdialog in the Training Metric Wizard.

If the user sets k=1, the basic hold-out method is used, with one fifth of the training data withheld for validation of a model built on the remaining four fifths of the data.