Configuring Advanced Options

  1. Leave Ignore constant fields checked to skip fields that have the same value for each record.
  2. Check Balance classes to balance the class distribution and either undersample the majority classes or oversample the minority classes.
  3. Select a Histogram type.
    Auto
    Buckets are binned from minimum to maximum in steps of (max-min)/N. Use this option to specify the type of histogram for finding optimal split points.
    QuantilesGlobal

    Buckets have equal population. This computes nbins quantiles for each numeric (non-binary) column, then refines/pads each bucket (between two quantiles) uniformly (and randomly for remainders) into a total of nbins_top_level bins.

    Random

    The algorithm will sample N-1 points from minimum to maximum and use the sorted list of those to find the best split.

    RoundRobin

    The algorithm will cycle through all histogram types (one per tree).

    UniformAdaptive

    Each feature is binned into buckets of equal step size (not population). This is the quickest method but can lead to less accurate splits if the distribution is highly skewed.

  4. Select a Categorical encoding.
    Auto

    Automatically performs enum encoding.

    Binary
    Converts categories to integers, then to binary, and assigns each digit a separate column. Encodes the data in fewer dimensions but with some distortion of the distances.
    Note: No more than 32 columns can exist per categorical feature.
    Eigen

    k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only.

    Enum

    Cycles through all histogram types (one per tree).

    OneHotExplicit

    One column exists per category, with "1" or "0" in each cell representing whether the row contains that column’s category.

  5. Leave Seed for algorithm and N fold checked and enter a seed number to ensure that when the data is split into test and training data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  6. Check N fold and enter the number of folds if you are performing cross-validation.
  7. Check Fold assignment and select from the drop-down list if you are performing cross-validation. This field is applicable only if you entered a value in N fold and Fold field is not specified.
    Auto

    Allows the algorithm to automatically choose an option; currently it uses Random.

    Modulo

    Evenly splits the dataset into the folds and does not depend on the seed.

    Random

    Randomly splits the data into nfolds pieces; best for large datasets.

    Stratified

    Stratifies the folds based on the response variable for classification problems. Evenly distributes observations from the different classes to all sets when splitting a dataset into train and test data. This can be useful if there are many classes and the dataset is relatively small.

  8. If you are performing cross-validation, check Fold field and select the field that contains the cross-validation fold index assignment from the drop-down list.
    This field is applicable only if you did not enter a value in N fold and Fold assignment.
  9. Check Stopping rounds to end training when the Stopping_metric option does not improve for the specified number of training rounds and enter the number of unsuccessful training rounds to occur before stopping. To disable this feature, specify 0. The metric is computed on the validation data (if provided); otherwise, training data is used.
  10. Select a Stopping metric to determine when to quit creating new trees.
    AUC
    Area under ROC curve.
    Note: Applicable only to binomial models.
    Auto

    Defaults to deviance.

    Lifttopgroup

    Top 1%.

    Logloss

    Logarithmic loss.

    Meanperclasserror

    The average misclassification rate.

    Misclassification

    The value of (1 - (correct predictions/total predictions)) * 100.

    MSE

    Mean squared error; incorporates both the variance and the bias of the predictor.

    RMSE

    Root mean square error; measures the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. Also the square root of MSE.

  11. Check Stopping tolerance and enter a value to specify the relative tolerance for the metric-based stopping to end training if the improvement is less than this value. This field is enabled only if you checked Stopping rounds.
  12. Check Minimum split improvement and enter a value to specify the minimum relative improvement in squared error reduction in order for a split to happen. When properly executed, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range. This field is enabled only if you checked Stopping rounds
  13. Click OK to save the model and configuration or continue to the next tab.