MicroStrategy ONE
Variables tab, Statistics sub-tab
Detailed statistical information on each variable is available on the Statistics sub-tab. Information about each variable is grouped into sections. The sections have icons in the upper right that indicate the nature of each variable, specifically the OpType and Usage Type.
Since this information is usually descriptive in nature (not necessarily predictive) and beyond what is required to score a model, not all vendors include these statistics. Each variable's section can contain the following information:
-
Model level and statistical information about the variable in list form.
-
When provided by the PMML, a histogram showing the distribution of the values for this variable in the training data.
-
Analysis of Variance (ANOVA) information in tabular form. This table only appears for an independent categorical variable (X) when there is a continuous dependent variable (Y), and only for regression type models.
The ANOVA table is represented as:
Source
SS
Df
MS
F
P
Between
SS(between)
Df(between)
MS(between)
MS(between)/MS(within)
p-Value
Within
SS(within)
Df(within)
MS(within)
Total
SS(total)
Df(total)
While a discussion of ANOVA statistics is beyond the scope of this documentation, a simple summary follows.
The information below summarizes a series of calculations that test the Null hypothesis:
H0: X has no effect on Y.
The p-Value represents the probability that the null hypothesis H0 is true.
If the p-Value is greater than a particular threshold (typically set at 0.20, or 20%), then the null hypothesis H0 is accepted and the ANOVA conclusion is that X has no effect on Y.
However, if the p-Value falls below the threshold, then the null hypothesis H0 is rejected and the ANOVA conclusion is that X does affect Y.
-
A table that shows statistics for different model segments and partitions.
The tables below describe the statistics MicroStrategy includes its models:
Statistic |
Description |
---|---|
Model Related |
|
Op Type |
Indicates what type of operations can be performed on this variable: Categorical values can only be tested for equality. A variable with this Op Type is displayed with the Category icon. Ordinal values, in addition to being tested for equality, have an order defined. A variable with this Op Type is displayed with the Ordinal icon. Continuous values support arithmetic operators. A variable with this Op Type is displayed with the Numeric icon. |
Data Type |
Indicates the type of data represented by the variable, such as numeric (for example, integer, double), string, date, and so on. |
Usage Type |
Indicates how the variable is used in the model: Active is used for independent variables, which are usually inputs to the model. A variable with this Usage Type is displayed with the Input icon. Predicted is used for the dependent variable, which is usually the target of the model. A variable with this Usage Type is displayed with the Target icon. Supplementary means the variable is not used in the calculation of model. A variable with this Usage Type is displayed with the Extra icon. Group means the variable is used to group records. For example, grouping items in a transaction. A variable with this Usage Type is displayed with the Group icon. Order means the variable defines the sequence of records. A variable with this Usage Type is displayed with the Order icon. |
Replace Missing |
The value used to replace a missing value. If a variable's Usage Type is predicted, then the missing value is replaced when predictive metrics are created, rather than when the model is scored. This is because the dependent variable is only used to train the model, it is not relevant to scoring the model. |
Missing Value Treatment |
Indicates the source of the value used for missing value replacement. This information is for reference only, the value of this statistic does not affect the scoring of the model. |
Outlier Value Treatment |
Indicates how values that are identified as outliers are handled. This applies only to continuous variables that are numeric: asIs means values are used exactly as they are provided. asExtremeValues means values are replaced by their low or high value extremes (see the information on Low Value and High Value below). asMissing means these values are treated as missing, and missing value replacement is used. |
Low Value |
Lower boundary for outliers. When a value is less than this value and Outlier Value Treatment is asExtremeValues, this value is used instead. |
High Value |
Upper boundary for outliers. When a value is greater than this value and Outlier Value Treatment is asExtremeValues, this value is used instead. |
Invalid Value Treatment |
Indicates how values that are identified as invalid are handled: asIs means that invalid values are processed exactly as they are provided. returnInvalid means scores from records with invalid inputs are flagged as invalid. asMissing means that invalid values are treated as missing values, and Missing Value Treatment is used. |
Statistic | Description | Technical Definition |
---|---|---|
Model Level | ||
R2 | For Regression Models | For Regression Models Square of the percent of variance in the dependent, explained by the given of all the independent variables. R2= SSR/(SSR+SSE) |
For Clustering Models The proportion of variation explained by the clustering model. | For Clustering Models The proportion of variation explained by a particular clustering of the observations. R2= (TSS-ESS)/TSS | |
For Time Series Models The variability of dependent variable explained by the model. | For Time Series Models R2= 1-SSE/SST | |
Adjusted R2 | For Regression models, adding independent variables sometimes makes the regression equation less accurate. Adjusted R2 takes into account the degrees of freedom to include any independent variable effects to the R2 calculation. | Adjusted R2= 1 - (1 - R2)[(n-1)/(n-k-1)] where k is the number of independent variables. |
p-Value (HighR2byChance) | The probability that a high R2 value occurred by chance. | Generated from F-Distribution, this p-value gives the probability that a highR2value occurred by chance, which means theR2value is not statistically significant. |
Likelihood Ratio Test - Intercept-only Model | The difference in the quality of fit between the maximal model and the intercept-only model. | G2=-2LL0 Where LL0 is the log likelihood of the model, considering intercepts only. In other words, this is the model not considering any independent variables. |
Likelihood Ratio Test - Final Model | The difference in the quality of fit between the maximal model and the final model. | G2=-2LL Where LL is the log likelihood of the model, considering all of the independent variables. |
Likelihood Ratio Test - p-Value | The probability that all the coefficients are zero. | The p-Value is obtained from a chi-square test with the statistic Likelihood Ratio Test - Degrees of Freedom and the chi-square statistic: G2=-2(LL0-LL) Where LLO is the log likelihood of the model, considering intercepts only, and LL is the log likelihood of the model, considering all of the independent variables. The null hypothesis is H0 = The intercept-only model performs the same as the final model. The null hypothesis is rejected at α significance level if p-Value < α. |
Likelihood Ratio Test - Degrees of freedom | The different degrees of freedom in the intercept-only model and the final model. | The number of variables in the final model |
Degrees of freedom (df) | Number of records beyond the minimum needed. Low values can mean this model is overfit. High values are associated with more robust models. | df = Num_Obs - Min_Obs where: Num_Obs is the number of observations. Min_Obs is the minimum number of observations required to uniquely define the model. |
RMSE (SEy) | Root Mean Square Error. Square root of the average of the squared residuals. A residual is the difference between the observed values and the values predicted by the model. | This statistic is sometimes also called the Standard Error of the y estimates. SEy = √(Σ(yi-y'i)2)/n) The value of n is dependent on the type of model: For regression models, n is the number of degrees of freedom for the model. For time series models, n is the size of the testing dataset. |
Sum of Squares (SSE) | Residual Sum of the squared differences between the actual value and the predicted value. | This statistic is sometimes also called the Sum of Square of Errors. SSE = Σ(yi-y'i)2 |
Sum of Squares Regression (SSR) | For Regression models, the sum of the squared differences between the predicted values and the mean of the actual values. | This statistic is abbreviated SSR. SSR = Σ(y'i-ymean)2 |
Sum of Squares Total (SST)
| For Time Series models, the sum of the squared differences between the actual values and the mean of the actual values. | This statistics is abbreviated SST. SST = Σ(yi-ymean)2 |
Error Sum of Squares (ESS) | For Cluster models, the sum of squared differences between an observation and its cluster's center. | This statistic is similar to Sum of Squares Residual but for Cluster models. This statistic is abbreviated ESS: ESS = Σ(x Where x |
Total Sum of Squares (TSS) | For Cluster models, the sum of squared differences between an observation and the mean of all cluster centers. | This statistic is similar to Sum of Squares Residual but for Cluster models. This statistic is abbreviated TSS: TSS= Σ(x Where x |
F-statistic (F) | Used to determine whether the observed relationship between the dependent variables and the independent variables occurs by chance. | The significance of regression is evaluated using F-statistics. F = (SSR/dfSSR)/(SSE/dfSSE) where: dfSSR is the number of degrees of freedom for the regression sum of squares, which is equal to the number of predicators (g). dfSSE is the number of degrees of freedom for the error sum of squares which is equal to the number of observations (n) minus the number of predictors (g) minus 1. (n - g - 1) |
Lift Chart | A graphical representation of a model's ability to rank records compared to optimal and random strategies. | Optimal is the cumulative sum of records, which are sorted in the ideal order (descending for continuous dependent variables, true values first for categorical dependent variables). Model is the cumulative sum of records, which are sorted by descending order of the predictions (predictive values for continuous dependent variables, probability for categorical dependent variables). Random is the straight line from (0%,0%) to (100%,100%). |
Confusion Matrix | A graphical representation of a classification model's results. This displays the model's ability to generate correct results and it's tendency to confuse classes. | A matrix with actual outcomes on rows and predicted outcomes on columns, which contains three general regions: Main diagonal. These are the correct predictions. Below the main diagonal. These are the false positive or Type I errors, commonly associated with binary results. Above the main diagonal. These are the false negative or Type II errors, commonly associated with binary results. |
Cluster Model Quality | Used to compare the quality of cluster models. | This statistic calculates the total distance of all records to their assigned cluster's center. |
Intercept | The point where a line crosses the Y axis. | The constant term of a regression equation. For example, for the slope equation y = mx + b, b is the Intercept. The Intercept is also referred to as the Y-Intercept. |
Standard Error (Intercept) | The variability, or dispersion, of the intercept. | For a full technical definition, refer to the information on Standard Error provided below. |
Standard Error (Target) | The variability, or dispersion, of the target (dependent) variable. | For a full technical definition, refer to the information on RMSE (SEy) provided above. |
Number of Records | The number of records used to create the model. | The total number of values in the data, including missing values, minus the number of missing values. Total - Missing |
Variable Level (Univariate Analysis) | ||
Mean | The arithmetic mean of a numeric variable. | Also called the expected value. Mean = Σ(xi)/n |
Standard Deviation | The variability, or dispersion, of a numeric variable's values. | Typically defined for the entire population. It can also be described as the square root of the variance. Std Dev = √Σ(xi-xmean)2/n |
Median | The middle value for a numeric variable when its values are sorted lowest to highest. | This is also known as the fiftieth percentile value. |
Mode | The value that occurs most frequently. | Typically only reported for categorical variables. It is possible for a set of values to have more than one mode and, in these cases, the first mode found is listed here. |
Min | The minimum value, only reported for numeric variables. | The smallest number in a set of values |
Max | The maximum value, only reported for numeric variables. | The largest number in a set of values |
Inter-quartile range | A measure of dispersion, it represents the range of the middle fifty percent of the values of a numeric variable. | It is the difference between the twenty-fifth and the seventy-fifth percentiles. |
Total | The total number of values in the data, including missing values. | This is also known as the Total Frequency. |
Missing
| The number of missing values. | This is also known as the Missing Frequency. |
Cardinality | The number of unique values for a variable. | Cardinality refers to the uniqueness of data values observed for a variable. The lower the cardinality, the more duplicated values exist. The lowest possible cardinality is one, which means that all the values are the same. Binary indicators and status flags are examples of low cardinality variables. Primary keys and IDs are examples of high cardinality variables. For the purposes of data mining, variables that contain values such as names and area codes are typically considered high cardinality. |
Analysis of Variance (ANOVA) | A table that describes the relationship between a continuous dependent variable and a categorical independent variable. | Describes the variance between a continuous dependent variable and a categorical independent variable. This includes the p-Value, which represents the probability that the relationship between the variables occurred by chance and is not statistically significant. |
Variable Level (Multivariate Analysis) | ||
Importance | Indicates the importance of the given independent variable in predicting the dependent variable. The importance is described as a scale of 0 to 1 where 1 is the most important. | The calculation depends on the type of model. For additional information on how Importance of a variable is determined, see Determining the importance of a variable. |
Standard Error | Similar to Standard Deviation (which describes the variability, or dispersion of a population), standard error describes the variability, or dispersion, of a coefficient estimate. | The standard deviation of a sampling distribution of the coefficient. It can also be described as the square root of the variance. For the simple linear regression model, y = mx + b: |
z-Score
| Describes how many standard deviations the dependent variable changes when a coefficient changes one standard deviation. | Computed for each coefficient by dividing the coefficient by its standard error. |
Coefficient | Quantifies the rate of change of the dependent variable with respect to a particular independent variable. For example, with a simple linear regression model, y = mx + b, m is the coefficient that represents the slope of the regression line. | A multiplicative factor of a term in a regression equation, each independent variable has an associated coefficient that was determined to best describe its relationship with the dependent variable. |
Exponent | The power to which a variable is raised. | Typically, the value is 1. |
p-Value | A measure of statistical significance. This is the probability that there is nothing statistically significant about a relationship. Statisticians compare this value to a critical value called Alpha (α, typically set to 0.05) to determine significance (p-Value> α indicates statistical significance). | The p-Value gives the probability of getting a coefficient value in your sample assuming that the coefficient in the population is zero. For Linear and Exponential Regression models, the p-Value is generated from T distribution. For Logistic Regression models, the p-Value is generated from ChiSquare distribution, with the square of z-Score as the ChiSquare value. |
p-Value(T-Dist)-Final | For Linear and Exponential Regression models, a variable's importance in the final version of the model is calculated by subtracting this value from one. | This p-Value comes from the T-Distribution using the t-Value and is calculated after any variables are eliminated due to variable reduction settings (see Advanced Options). |
p-Value(T-Dist)-Initial | For Linear and Exponential Regression models, a variable's importance in the initial version of the model, that is saved in the PMML, is calculated by subtracting this value from one. | This p-Value comes from the T-Distribution using the t-Value and is calculated prior to the elimination of any variables due to variable reduction settings (see Advanced Options). |
p-Value(ChiSquare) | For Logistic Regression models, a variable's importance, that is saved in the PMML, is calculated by subtracting this value from one. | This p-Value comes from the ChiSquare-Distribution using the square of z-Score. |
t-Value
| Used to assess the significance of individual coefficients. | Computed for each coefficient by dividing the coefficient by its standard error. Compare to the t-value (t-critical) from a table based on the degrees of freedom and alpha (usually set at 0.05). Used to assess the significance of individual coefficients, specifically testing the null hypothesis that the regression coefficient is zero. If the observed |
Correlation Matrix | For Linear and Exponential Regression models, displays pair-wise variable correlations. | Correlations range from: 1: Perfectly correlated 0: No correlation -1: Inversely correlated. |