Using a Duplicate Synchronization MapReduce Job
-
Create an instance of
AdvanceMatchFactory
, using its static methodgetInstance()
. -
Provide the input and output details for the Duplicate Synchronization job by
creating an instance of
DuplicateSyncDetail
specifying theProcessType
. The instance must use the type MRProcessType.-
Specify the column using which the records are to be grouped by
creating an instance of
GroupbyOption
.Use an instance of GroupbyMROption to specify the group-by column and the number of reducers required. -
Generate the consolidation conditions for the job by creating an
instance of
DuplicateSynchronizationConfiguration
. Within this instance, define the consolidation conditions using instances ofConsolidationCondition
, and connecting the conditions using logical operators.Each instance ofConsolidationCondition
is defined using aConsolidationRule
instance and its correspondingConsolidationAction
instance.Note: Each instance ofConsolidationRule
can be defined either using a single instance ofSimpleRule
, or using a hierarchy of childSimpleRule
instances and nestedConjoinedRule
instances joined using logical operators. See Enum JoinType and Enum Operation. -
Create an instance of
DuplicateSyncDetail
, by passing an instance of typeJobConfig
, theGroupbyOption
instance created, and theDuplicateSynchronizationConfiguration
instance created above as the arguments to its constructor.TheJobConfig
parameter must be an instance of type MRJobConfig. -
Set the details of the input file using the
inputPath
field of theDuplicateSyncDetail
instance.- For a text input file, create an instance of
FilePath
with the relevant details of the input file by invoking the appropriate constructor. - For an ORC input file, create an instance of
OrcFilePath
with the path of the ORC input file as the argument. - For a Parquet input file, create an instance of ParquetFilePath with the path of the Parquet input file as the argument.
- For a text input file, create an instance of
-
Set the details of the output file using the
outputPath
field of theDuplicateSyncDetail
instance.- For a text output file, create an instance of
FilePath
with the relevant details of the output file by invoking the appropriate constructor. - For an ORC output file, create an instance of
OrcFilePath
with the path of the ORC output file as the argument. - For a Parquet output file, create an instance of ParquetFilePath with the path of the Parquet output file as the argument.
- For a text output file, create an instance of
-
Set the name of the job using the
jobName
field of theDuplicateSyncDetail
instance. -
Set the
compressOutput
flag of theDuplicateSyncDetail
instance to true to compress the output of the job.
-
Specify the column using which the records are to be grouped by
creating an instance of
-
Create the job by using the previously created instance of
AdvanceMatchFactory
to invoke its methodcreateJob()
. In this, pass the above instance ofDuplicateSyncDetail
as an argument.ThecreateJob()
method returns aList
of instances ofControlledJob
. -
Run the created job using an instance of
JobControl
. -
To display the reporting counters after successful MapReduce job run, use the
previously created instance of
AdvanceMatchFactory
to invoke its methodgetCounters()
, passing the created job as an argument.