Configuring Advanced Options

  1. Leave Ignore constant fields checked to skip fields that have the same value for each record.
  2. Leave Seed for algorithm checked and enter a seed number to ensure that when the data is split into test and training data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  3. Select the correct initialization mode in the Init drop-down.
    Furthest

    Initializes the first centroid randomly, but then initializes the second centroid to be the data point farthest away from it. Initializes the centroids to be well spread-out from each other.

    Plus-Plus

    Initializes the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

    Random

    Default. Chooses K clusters from the set of N observations at random so that each observation has an equal chance of being chosen.

  4. Leave Seed for N fold checked and enter a seed number to ensure that when the data is split into test and train data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  5. Check N fold and enter the number of folds if you are performing cross-validation.
  6. Check Fold assignment and select from the drop-down list if you are performing cross-validation. This field is applicable only if you entered a value in N fold.
    Auto

    Default. Allows the algorithm to automatically choose an option; currently it uses Random.

    Modulo

    Evenly splits the dataset into the folds and does not depend on the seed.

    Random

    Randomly splits the data into nfolds pieces; best for large datasets.

  7. Check Maximum iterations and enter the number of training iterations that should take place.
  8. Click OK to save the model and configuration or continue to the next tab.