Creating a Best of Breed Record
To eliminate duplicate records from your data, you may choose to merge data from groups of duplicate records into a single "best of breed" record. This approach is useful when each duplicate record contains data of the same type (for example, phone numbers or names) and you want to preserve the best data from each record in the surviving record.
This procedure describes how create a dataflow that merges duplicate records into a best of breed record.
-
In Enterprise Designer, create a dataflow that identifies duplicate records through matching.
Matching is the first step in deduplication because you need to identify records that are similar, such as records that have the same account number or name. See the following topics for instructions on creating a dataflow that matches records.
- Matching Records from a Single Source
- Matching Records from One Source to Another Source
- Matching Records Against a Database
Note: You only need to build the dataflow to the point where it reads data and performs matching with an Interflow Match, Intraflow Match, or Transactional Match stage. Once you have created a dataflow to this point, continue with the following steps. -
Once you have defined a dataflow that reads data and matches records, drag a Best of Breed stage to the canvas and connect it to the stage that performs the matching (Interflow Match, Intraflow Match, or Transactional Match).
For example, if your dataflow reads data from a file and performs matching with Intraflow Match, your dataflow would look like this after adding a Best of Breed stage:
- Double-click the Best of Breed stage on the canvas.
- In the Group by field, select CollectionNumber.
- Under Best of Breed Settings, select Rules in the conditions tree.
-
Click Add Rule.
Records in each group are evaluated to see if they meet the rules you define here. If a record matches a rule, its data may be copied to the best of breed record, depending on how you configure the actions associated with the rule. You will define actions later.
-
Define a rule that a duplicate record must meet in order for a its data to be copied to the best of breed record.
Use the following options to define a rule:
Option Description Field name
Specifies the name of the dataflow field whose value you want to evaluate to determine if the condition is met and the associated actions should be taken.
Field Type
Specifies the type of data in the field. One of the following:
- Non-Numeric
- Choose this option if the field contains non-numeric data (for example, string data).
- Numeric
- Choose this option if the field contains numeric data (for example, double, float, and so on).
Operator
Specifies the type of comparison you want to use to evaluate the field. One of the following:
- Contains
- Determines if the field contains the value specified. For example, "sailboat" contains the value "boat".
- Equal
- Determines if the field contains the exact value specified.
- Greater Than
- Determines if the field value is greater than the value specified. This operation only works on numeric fields.
- Greater Than Or Equal To
- Determines if the field value is greater than or equal to the value specified. This operation only works on numeric fields.
- Highest
- Compares the field's value for all the records group and determines which record has the highest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 100 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
- Is Empty
- Determines if the field contains no value.
- Is Not Empty
- Determines if the field contains any value.
- Less Than
- Determines if the field value is less than the value specified. This operation only works on numeric fields.
- Less Than Or Equal To
- Determines if the field value is less than or equal to the value specified. This operation only works on numeric fields.
- Longest
- Compares the field's value for all the records group and determines which record has the longest (in bytes) value in the field. For example, if the group contains the values "Mike" and "Michael", the record with the value "Michael" would be selected. If multiple records are tied for the longest value, one record is selected.
- Lowest
- Compares the field's value for all the records group and determines which record has the lowest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 10 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
- Most Common
- Determines if the field value contains the value that occurs most frequently in this field among the records in the group. If two or more values are most common, no action is taken.
- Not Equal
- Determines if the field value is not the same as the value specified.
Value type
Specifies the type of value you want to compare to the field's value. One of the following:
Note: This option is not available if you select the operator Highest, Lowest, or Longest.- Field
- Choose this option if you want to compare another dataflow field's value to the field.
- String
- Choose this option if you want to compare the field to a specific value.
Value Specifies the value to compare to the field's value. If you selected Field in the Field type field, select a dataflow field. If you selected String in the Value type field, type the value you want to use in the comparison.
Note: This option is not available if you select the operator Highest, Lowest, or Longest. - Click OK.
- Click the Actions node in the tree.
- Click Add Action.
-
Specify the data to copy to the best of breed record if the record meets the criteria you defined in the rule.
Option Description Source type
Specifies the type of data to copy to the best of breed record. One of the following.
- Field
- Choose this option if you want to copy a value from a field to the best of breed record.
- String
- Choose this option if you want to copy a constant value to the best of breed record.
Source data
Specifies the data to copy to the best of breed record. If the source type is Field, select the field whose value you want to copy to the destination field. If the source type is String, specify a constant value to copy to the destination field.
Destination
Specifies the field in the best of breed record to which you want to copy the data specified in the Source data field.
Accumulate source data
If the data in the Source data field is numeric data, you can enable this option to combine the source data for all duplicate records and put the total value in the best of breed record.
For example, if there were three duplicate records in the group and they contained these values in the Deposits field:
100.00
20.00
5.00Then all three values would be combined and the total value, 125.00, would be put in the best of breed record's Deposits field.
-
Click OK.
You have now configured Best of Breed with one rule and one action. You can add additional rules and actions if needed.
- Click OK to close the Best of Breed Options window.
-
Drag a sink stage onto the canvas and connect it to the Best of Breed stage.
For example, if you were using a Write to File sink stage your dataflow would look like this:
-
Double-click the sink stage and configure it.
For information on configuring sink stages, see the Dataflow Designer's Guide.
You now have a dataflow that identifies matching records and merges records within a collection into a single best of breed record.