Group-By Option |
For a MapReduce job, pass these arguments:
- GroupBy Column
- The name of the column using which the records are to be grouped.
- Number of Reducer Tasks
- The number of reducer tasks required to group the records.
For a Spark job, to create a Group-By option pass these arguments:
- GroupBy Column
- The name of the column using which the records are to be grouped.
|
Match Rule |
Defines as many parent and child rules as required, to create a
MatchRule object.For more information, see MatchRule.
|
Candidate File |
For text files:
- File Path
- The path of the candidate text file on the Hadoop platform.
- Record Separator
- The record separator used in the candidate file.
- Field Separator
- The separator used between any two consecutive fields of a record, in the candidate
file.
- Text Qualifier
- The character used to surround text values in a delimited file.
- Header Row Fields
- An array of the header fields of the candidate file.
- Skip First Row
- Flag to indicate if the first row must be skipped while reading the suspect file
records.
This must be true in case the first row is a header
row.
Attention: Invoke the appropriate constructor of
FilePath .
For ORC format files:
- ORC File Path
- The path of the input ORC format file on the Hadoop platform.
Important: The suspect and candidate files
must be of the same format. Either text files or ORC format files.
Common
parameters:
- Field Mappings
- A map of key value pairs, with the existing column names as the keys and the desired
output column names as the values.
|
Suspect File |
For text files:
- File Path
- The path of the suspect text file on the Hadoop platform.
- Record Separator
- The record separator used in the suspect file.
- Field Separator
- The separator used between any two consecutive fields of a record, in the suspect
file.
- Text Qualifier
- The character used to surround text values in a delimited file.
- Header Row Fields
- An array of the header fields of the suspect file.
- Skip First Row
- Flag to indicate if the first row must be skipped while reading the suspect file
records.
This must be true in case the first row is a header
row.
Attention: Invoke the appropriate constructor of
FilePath .
For ORC format files:
- ORC File Path
- The path of the input ORC format file on the Hadoop platform.
Common parameters:
- Field Mappings
- A map of key value pairs, with the existing column names as the keys and the desired
output column names as the values.
|
Output File |
For text files:
- File Path
- The path of the output text file on the Hadoop platform.
- Field Separator
- The separator used between any two consecutive fields of a record, in the output
file.
Attention: Invoke the appropriate constructor of
FilePath .
For ORC format files:
- ORC File Path
- The path of the output ORC format file on the Hadoop platform.
For Parquet format files:
- Parquet File Path
- The path of the output Parquet format file on the Hadoop platform.
Common Parameters:
- Overwrite
- Flag to indicate if output file must overwrite any existing file of same name.
- Create Output Header
- Flag to indicate if header file is to be created on the Hadoop server or not.
|
Job Configurations |
The Hadoop configurations for the job. For a MapReduce job, the instance must be of
type MRJobConfig. For a Spark
job, the instance must be of type SparkJobConfig.
|
Match Key Settings |
A combination of the columns and the algorithms to be applied to generate the match
key, required to perform the matching. Note: Specify only one match key.
Attention: Set the match key settings only if you wish to generate a match key before
performing the matching.
|
Job Name |
The name of the job. |
Express Match Column |
The name of the column to be used for express matching of records. |
Setting Collection Number Zero to Unique Records |
Set this to true to set the collection number of unique records
as 0 (zero). |
Comparison Option |
Allows you to select one of the two options:
|
Compress Output |
Flag to indicate if the output must be compressed. Set this to
true to compress the output.
|