Configuration Files
These tables describe the parameters and the values you need to specify before you run the Transactional Match job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: TEXT, ORC or PARQUET. |
pb.bdq.inputfile.path | The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/transactionalmatch/ input/TransMatch_Input.txt |
textinputformat.record.delimiter | File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.inputformat.field.delimiter | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.inputformat.text.qualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.inputformat.file.header | Headers used in the input file. Example: name, firstname, lastname, CandidateGroup, middlename, and recordid. |
pb.bdq.inputformat.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: Transactional. |
pb.bdq.job.name | Name of the job. Default is TransactionalMatchSample. |
pb.bdq.match.rule | Json String for defining match rule. It specifies details, such as match rule hierarchy, matching method, method to score blank data in a field, scoring method, and algorithm to determine if the values in the field name matched. |
pb.bdq.match.groupby | Name of the column to be used for grouping records in the match queue. |
pb.bdq.reduce.count | Number of reducers to be run. Default is 1. |
pb.bdq.match.keygenerator.json | Json string to indicate if an expressMatchKey is to be created, specify matchKeyField,
and the rules for generating match key. Note: Optional detail
|
pb.bdq.match.unique.candidate.return | Set to true if you want unique candidate records to be included in the output. |
Specifies the MapReduce configuration parameters |
---|
Use this file to customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. |
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: TEXT, ORC, or PARQUET format. |
outputfile.path | The path where you want the output file to be generated on HDFS. For example, /user/hduser/sampledata/transactionalmatch/output. |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |
pb.bdq.job.print.counters.console | If the counters are printed on console or in a file. True indicates counters are printed on the console |
pb.bdq.job.counter.file.path | Path and the name of the file to which the counters are to be printed. You need to specify this if value in the pb.bdq.job.print.counters.console is false. |
Properties of Parquet file | Â |
parquet.compression | The compression algorithm used to compress pages. It is one of these:
UNCOMPRESSED, SNAPPY,
GZIP, or LZO. Default is UNCOMPRESSED. |
parquet.block.size | The size of a row group being buffered in memory. Larger values improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024) |
parquet.page.size | Page constitutes block and is the smallest unit that must be read fully to access a
single record. Default size is 1048576 bytes (= 1 * 1024 * 1024)
Note: A very small page
size results in deterioration of compression.
|
parquet.dictionary.page.size | Default size is 1048576 bytes (= 1 * 1024 * 1024) |
parquet.enable.dictionary | The boolean value (True or False) to enable or disable dictionary encoding. Default is True |
parquet.validation | Default boolean value is False. |
parquet.writer.version | Specifies the version of writer. It should be PARQUET_1_0 or PARQUET_2_0. Default is PARQUET_1_0. |
parquet.writer.max-padding | Default to no padding, 0% of the row group size |
parquet.page.size.check.estimate | Default boolean value is True |
parquet.page.size.row.check.min | Default is 100 |
parquet.page.size.row.check.max | Default is 10000 |