MicroStrategy ONE

Advanced variable settings

There are a number of settings that can be applied to each variable, which affects how each input is variable during the training process.

Use the Advanced Variable Settings dialog box to apply settings to variables. Select one of the variables from the list, and then make your changes to the default settings. The following advanced settings are available:

  • Type defines the type of the variable. By default, the training algorithm determines the type of variable, based on the input values. Alternatively, you can specify that the input be treated specifically as one of the following types:

    • Categorical: This type indicates that the field consists of a number of discrete numeric or non-numeric values. These values can only be tested for equality. For example, the categorical variable Risk can have the values High, Medium, and Low.

    • Continuous: This type indicates that the field consists of a range of numeric values. Fields of this type can be used with arithmetic operators. For example, the continuous variable Age can be any number in the range of 0 to 100.

    • Ordinal: Fields of this type have similar properties to categorical fields. In addition, they have an order defined, which means that they can be tested for Greater Than or Less Than. For example, the ordinal variable Day can have the values Sunday, Monday, Tuesday, and so on, through Saturday.

      Data Mining Services does not treat ordinal variables any differently than categorical variables.

  • Missing Values can be used to define missing values in the input data. By default, only empty data is treated as missing. You may define a single numeric or non-numeric value (for example, -1, 999, 'NULL') to be treated as missing during analysis, in addition to empty data.

  • Missing Value Treatment defines how missing values should be treated during analysis. The following treatments are available:

    • Default: The default behavior depends on the type of variable selected:

      • Independent: Missing values for independent variables of type continuous are replaced with the median of the training input data. Likewise, missing values for independent variables of type categorical or ordinal are replaced with the mode of the training input data.

      • Dependent: Records containing missing dependent values are ignored during the training process.

      • Segmentation: All segmentation variables are categorical. Missing values are replaced with the mode of the training input data.

    • Ignore: Missing input values are ignored during the training process. When the model is generated, the variable is not assigned a missing value replacement. This usually means that when scoring a record that contains any missing input, a missing result will be returned. Certain models have intrinsic missing value handling that allows them to handle missing values within the model. For example, with Regression models, missing categorical values result in a value of zero for the corresponding predictor terms. Refer to the PMML specification for more details about the model-specific missing value handling.

    • Replace: Missing input values are replaced with one of the following: mean, median, mode or a user-defined value.

      If a variable's Usage Type is predicted, then the missing value is replaced when predictive metrics are created, rather than when the model is scored. This is because the dependent variable is only used to train the model, it is not relevant to scoring the model.

  • Invalid Values specifies how any invalid input should be treated when the model is scored. All input during the training process is treated as valid. The following treatments are available:

    • As is (default): Invalid input values will be processed unaltered.

    • As missing: Invalid input values will be treated as missing.

    • Return invalid: The input should not be processed, resulting in an invalid result being returned.

  • Outliers Range allows you to define the expected range of the input values. Any input values which lie outside the defined range are treated as outliers. The following methods for defining the range are available:

    • None (default): No range is specified (no outliers).

    • Absolute: Specify absolute lower and upper bounds.

    • StdDev: Specify a number of standard deviations above and below the mean.

    • Percentile: Specify a percentage of all input above and below the mean.

  • Outlier Treatment specifies how outliers are to be treated, both during the training process, and when the generated model is scored (treatment information is included within the model). The following treatments are available:

    • As is (default): Outliers will be processed unaltered.

    • As extreme: Outliers less than the lower bound will be replaced with the lower bound value, while those greater than the upper bound will be replaced with the upper bound value.

    • As missing: Outliers will be treated as missing.

  • Transformations allow you to experiment with applying transformations to the input data prior to the analysis being performed. Transformations can only be applied to independent variables that are of type continuous, and are only available for linear and exponential regression.

Do one of the following: