Using an Interflow Match Spark Job
-
Create an instance of
AdvanceMatchFactory
, using its static methodgetInstance()
. -
Provide the input and output details for the Interflow Match job by creating an
instance of
InterMatchDetail
specifying theProcessType
. The instance must use the type SparkProcessType.-
Specify the column using which the records are to be grouped by
creating an instance of
GroupbyOption
.Use an instance of GroupbySparkOption to specify the group-by column. -
Generate the matching rules for the job by creating an instance of
MatchRule
. -
Create an instance of
InterMatchDetail
, by passing an instance of typeJobConfig
, theGroupbyOption
instance created, and theMatchRule
instance created above as the arguments to its constructor.TheJobConfig
parameter must be an instance of type SparkJobConfig. -
Set the details of the candidate file using the
candidateFilePath
field of theInterMatchDetail
instance.For a text candidate file, create an instance ofFilePath
with the relevant details of the candidate file by invoking the appropriate constructor. For an ORC candidate file, create an instance ofOrcFilePath
with the path of the ORC candidate file as the argument. -
Set the details of the suspect file using the
suspectFilePath
field of theInterMatchDetail
instance.For a text suspect file, create an instance ofFilePath
with the relevant details of the suspect file by invoking the appropriate constructor. For an ORC suspect file, create an instance ofOrcFilePath
with the path of the ORC suspect file as the argument. For a parquet suspect file, create an instance of ParquetFilePath with the path of the parquet suspect file as the argument.Important: The suspect and candidate files must be of the same format. Either text files or ORC format files. -
Set the details of the output file using the
outputPath
field of theInterMatchDetail
instance.- For a text output file, create an instance of
FilePath
with the relevant details of the output file by invoking the appropriate constructor. - For an ORC output file, create an instance of
OrcFilePath
with the path of the ORC output file as the argument. - For a Parquet output file, create an instance of ParquetFilePath with the path of the Parquet output file as the argument.
- For a text output file, create an instance of
-
Set the name of the job using the
jobName
field of theInterMatchDetail
instance. -
Set the Express Match Column using the
expressMatchColumn
field of theInterMatchDetail
instance, if required. -
Set the flag
collectionNumberZerotoUniqueRecords
of theInterMatchDetail
instance to true to allocate the collection number 0 (zero) to a unique record. The default is true.If you do not wish to allocate the collection number zero to unique records, set this flag to false. -
Set the comparison option using the
comparisonOption
field of theInterMatchDetail
instance. In this field, set the required value using the class InterMatchComparisonOption to select one of the two options:- Compare the Suspect record to all Candidate records: Specify whether unique records must be returned in the output or not.
- Compare the Suspect record to the selected Candidate record only: Specify the maximum number of duplicate records to be searched and returned.
-
Set the
compressOutput
flag of theInterMatchDetail
instance to true to compress the output of the job. -
If the input data does not have match keys, you must specify the match
key settings to first run the Match Key Generator job to generate the
match keys, before running the Interflow Match job.
To generate the match keys for the input data, specify the match key settings by creating and configuring an instance of
MatchKeySettings
to generate a match key before performing the interflow matching. Set this instance using thematchKeySettings
field of theInterMatchDetail
instance.Note: To see how to set match key settings, see the code samples.
-
Specify the column using which the records are to be grouped by
creating an instance of
-
To create and run the Spark job, use the previously created instance of
AdvanceMatchFactory
to invoke its methodrunSparkJob()
. In this, pass the above instance ofInterMatchDetail
as an argument.TherunSparkJob()
method runs the job and returns aMap
of the reporting counters of the job. - Display the counters to view the reporting statistics for the job.