Creating a Best of Breed Record

To eliminate duplicate records from your data, you may choose to merge data from groups of duplicate records into a single "best of breed" record. This approach is useful when each duplicate record contains data of the same type (for example, phone numbers or names) and you want to preserve the best data from each record in the surviving record.

This procedure describes how create a dataflow that merges duplicate records into a best of breed record.

  1. In Enterprise Designer, create a dataflow that identifies duplicate records through matching.

    Matching is the first step in deduplication because you need to identify records that are similar, such as records that have the same account number or name. See the following topics for instructions on creating a dataflow that matches records.

    Note: You only need to build the dataflow to the point where it reads data and performs matching with an Interflow Match, Intraflow Match, or Transactional Match stage. Once you have created a dataflow to this point, continue with the following steps.
  2. Once you have defined a dataflow that reads data and matches records, drag a Best of Breed stage to the canvas and connect it to the stage that performs the matching (Interflow Match, Intraflow Match, or Transactional Match).

    For example, if your dataflow reads data from a file and performs matching with Intraflow Match, your dataflow would look like this after adding a Best of Breed stage:

  3. Double-click the Best of Breed stage on the canvas.
  4. In the Group by field, select CollectionNumber.
  5. Under Best of Breed Settings, select Rules in the conditions tree.
  6. Click Add Rule.

    Records in each group are evaluated to see if they meet the rules you define here. If a record matches a rule, its data may be copied to the best of breed record, depending on how you configure the actions associated with the rule. You will define actions later.

  7. Define a rule that a duplicate record must meet in order for a its data to be copied to the best of breed record.

    Use the following options to define a rule:

    Option Description

    Field name

    Specifies the name of the dataflow field whose value you want to evaluate to determine if the condition is met and the associated actions should be taken.

    Field Type

    Specifies the type of data in the field. One of the following:

    Non-Numeric
    Choose this option if the field contains non-numeric data (for example, string data).
    Numeric
    Choose this option if the field contains numeric data (for example, double, float, and so on).

    Operator

    Specifies the type of comparison you want to use to evaluate the field. One of the following:

    Contains
    Determines if the field contains the value specified. For example, "sailboat" contains the value "boat".
    Equal
    Determines if the field contains the exact value specified.
    Greater Than
    Determines if the field value is greater than the value specified. This operation only works on numeric fields.
    Greater Than Or Equal To
    Determines if the field value is greater than or equal to the value specified. This operation only works on numeric fields.
    Highest
    Compares the field's value for all the records group and determines which record has the highest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 100 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
    Is Empty
    Determines if the field contains no value.
    Is Not Empty
    Determines if the field contains any value.
    Less Than
    Determines if the field value is less than the value specified. This operation only works on numeric fields.
    Less Than Or Equal To
    Determines if the field value is less than or equal to the value specified. This operation only works on numeric fields.
    Longest
    Compares the field's value for all the records group and determines which record has the longest (in bytes) value in the field. For example, if the group contains the values "Mike" and "Michael", the record with the value "Michael" would be selected. If multiple records are tied for the longest value, one record is selected.
    Lowest
    Compares the field's value for all the records group and determines which record has the lowest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 10 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
    Most Common
    Determines if the field value contains the value that occurs most frequently in this field among the records in the group. If two or more values are most common, no action is taken.
    Not Equal
    Determines if the field value is not the same as the value specified.

    Value type

    Specifies the type of value you want to compare to the field's value. One of the following:

    Note: This option is not available if you select the operator Highest, Lowest, or Longest.
    Field
    Choose this option if you want to compare another dataflow field's value to the field.
    String
    Choose this option if you want to compare the field to a specific value.
    Value

    Specifies the value to compare to the field's value. If you selected Field in the Field type field, select a dataflow field. If you selected String in the Value type field, type the value you want to use in the comparison.

    Note: This option is not available if you select the operator Highest, Lowest, or Longest.
  8. Click OK.
  9. Click the Actions node in the tree.
  10. Click Add Action.
  11. Specify the data to copy to the best of breed record if the record meets the criteria you defined in the rule.
    Option Description

    Source type

    Specifies the type of data to copy to the best of breed record. One of the following.

    Field
    Choose this option if you want to copy a value from a field to the best of breed record.
    String
    Choose this option if you want to copy a constant value to the best of breed record.

    Source data

    Specifies the data to copy to the best of breed record. If the source type is Field, select the field whose value you want to copy to the destination field. If the source type is String, specify a constant value to copy to the destination field.

    Destination

    Specifies the field in the best of breed record to which you want to copy the data specified in the Source data field.

    Accumulate source data

    If the data in the Source data field is numeric data, you can enable this option to combine the source data for all duplicate records and put the total value in the best of breed record.

    For example, if there were three duplicate records in the group and they contained these values in the Deposits field:

    100.00
    20.00
    5.00

    Then all three values would be combined and the total value, 125.00, would be put in the best of breed record's Deposits field.

  12. Click OK.

    You have now configured Best of Breed with one rule and one action. You can add additional rules and actions if needed.

  13. Click OK to close the Best of Breed Options window.
  14. Drag a sink stage onto the canvas and connect it to the Best of Breed stage.

    For example, if you were using a Write to File sink stage your dataflow would look like this:

  15. Double-click the sink stage and configure it.

    For information on configuring sink stages, see the Dataflow Designer's Guide.

You now have a dataflow that identifies matching records and merges records within a collection into a single best of breed record.