Configuration Files
These tables describe the parameters and the values you need to specify before you run the Interflow Match job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: file, TEXT, or ORC. |
Suspect File | |
pb.bdq.match.suspect.inputfile.path | Path where you have placed the suspect input file on HDFS. Example: /user/hduser/sampledata/intermatch/ input/Interflow_Suspect.txt. |
pb.bdq.match.suspect.recordseparator | Record delimiter used in the suspect file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.match.suspect.fieldseparator | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.match.suspect.textqualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.match.suspect.header | Headers used in the suspect file. Example: name, firstname, lastname, matchkey, middlename, and recordid. |
pb.bdq.match.suspect.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Candidate File | |
pb.bdq.match.candidate.inputfile.path | Path where you have placed the candidate input file on HDFS. Example: /user/hduser/sampledata/intermatch/ input/Interflow_candidate.txt. |
pb.bdq.match.candidate.recordseparator | Record delimiter used in the candidate file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.match.candidate.fieldseparator | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.match.candidate.textqualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.match.candidate.header | Headers used in the candidate file. Example: name, firstname, lastname, matchkey, middlename, and recordid. |
pb.bdq.match.candidate.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: InterMatch. |
pb.bdq.job.name | Name of the job. Default is InterMatchSample. |
pb.bdq.match.rule | Json String for defining match rule. It specifies details, such as match rule hierarchy, matching method, method to score blank data in a field, scoring method, and algorithm to determine if the values in the field name matched. |
pb.bdq.match.groupby | Name of the column to be used for grouping records in the match queue. |
pb.bdq.reduce.count | Number of reducers to be run. Default is 1. |
pb.bdq.match.express.column | Name of the Express Match Column. If the content of this column matches between the suspect and the candidate, no further processing is needed to determine if the suspect and the candidates are duplicates. |
pb.bdq.match.keygenerator.json | Json string for defining match key generator rule, such as whether to use
expressMatchKey, name of the matchKeyField, and algorithm to be used. Note: This is an
optional detail.
|
pb.bdq.match.unique.collectnumber.zero | A true value assigns collection number 0 to unique records. |
pb.bdq.match.inter.comparison | Inter match comparison options.
Note: This is an optional detail.
|
Specifies the MapReduce configuration parameters |
---|
Use this file to customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. |
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: file, TEXT, or ORC format. |
pb.bdq.outputfile.path | The path where you want the output to be generated on HDFS. For example, /user/hduser/sampledata/intermatch/output. |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |
pb.bdq.job.print.counters.console | If the counters are printed on console or in a file. True indicates counters are printed on the console |
pb.bdq.job.counter.file.path | Path and the name of the file to which the counters are to be printed. You need to specify this if value in the pb.bdq.job.print.counters.console is false. |