MicroStrategy ONE
Determining the importance of a variable
Predictive models can contain information about the importance of the variables that participate in the model. In PMML, each MiningField has an optional attribute for importance, which describes the relative importance of an input (independent or active) variable to the target (dependent or predicted) variable. A value of one indicates that the variables are directly related, and a value of zero indicates that the variable is irrelevant. The PMML definition of importance requires that the value always be positive. Therefore, variables that are perfectly inverse related have an importance value of one.
The presence of importance information depends on the nature of the model, and importance can also vary from vendor to vendor. For models built using MicroStrategy, the importance information included in the PMML depends on the model type since importance is only a by-product of some algorithms. The importance for available model types in MicroStrategy is described below:
-
Regression Type Models: These models forecast a continuous value:
-
Linear and Exponential Regression:
-
Variable importance information is a by-product of the model creation process and is included in the model.
-
The calculation is 1 - p-Value(T-dist).
-
For categorical variables, the importance reported is the maximum of the importance of indicator transformed variables derived from that variable.
-
This statistic is used by the Variable Reduction Settings. Variables are eliminated based on the settings the calculation is using, referred to as p-Value(T-dist)-Initial. The p-Value(T-dist)-Final is used to establish the importance of the variable in the final model, after variable reduction has taken place.
-
Time Series: The concept of importance does not apply since there is just one variable to be predicted, along with a time-related variable. Both of these variables are required regardless of importance.
If you are using Time Series models and you are concerned with variable importance, you should build Linear and Exponential Regression models first. This knowledge can then be used with your Time Series models. -
Classification Type Models: These models forecast a categorical value:
-
Logistic Regression:
-
Variable importance information is a by-product of the model creation process and is included in the model.
-
The calculation is 1 - p-Value(ChiSquare).
-
For categorical variables, the importance reported is the maximum of the importance of indicator transformed variables derived from that variable.
-
Decision Tree: Importance is not a direct by-product of the model creation process, so there is no quantitative importance value included in the model. However, Decision Tree models tend to split based on the most important variables first. Therefore, you can get a qualitative idea of variable importance by inspecting the model using the Model Viewer.
If you are using Decision Tree models and you are concerned with variable importance, you should build Logistic Regression models first. This knowledge can then be used with your Decision Tree models: -
Clustering Type Models: Importance is not a direct by-product of the model creation process, so there is no quantitative importance value included in the model. However, the cluster centers and the behavior of the model can yield insights into which variables affect the cluster assignment. Therefore, you can get a qualitative idea of variable importance by inspecting the model using the Model Viewer.
-
Association Type Models: The concept of importance does not apply since both variables are required to build the model.