Configuration Files
These tables describe the parameters and the values you need to specify before you run the Open Name Parser job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: TEXT, ORC or PARQUET. |
pb.bdq.inputfile.path | The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/opennameparser/ input/OpenNameParser_Input.csv |
textinputformat.record.delimiter | File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.inputformat.field.delimiter | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.inputformat.text.qualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.inputformat.file.header | Comma-separated value of the headers used in the input file. |
pb.bdq.inputformat.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: OpenNameParser. |
pb.bdq.job.name | Name of the job. Default is OpenNameParserSample. |
pb.dq.unm.opennameparser.configuration | Json string to define open name parser configurations, such as, types of names to be parsed. |
pb.bdq.reference.data | The path where you have placed the reference data. For example, {"referenceDataPathLocation":"LocaltoDataNodes", "dataDir":"/home/data/referenceData"} |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: OpenNameParser. |
pb.bdq.job.name | Name of the job. Default is OpenNameParserSample. |
pb.dq.unm.opennameparser.configuration | Json string to define open name parser configurations, such as, types of names to be parsed. |
pb.bdq.reference.data | Path of reference data on HDFS and the data downloader path. For example, {"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData", "dataDownloader":{"dataDownloader":"HDFS", "localFSRepository":"/local/download"}} |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: OpenNameParser. |
pb.bdq.job.name | Name of the job. Default is OpenNameParserSample. |
pb.dq.unm.opennameparser.configuration | Json string to define open name parser configurations, such as, types of names to be parsed. |
pb.bdq.reference.data | Path of the reference data on HDFS and the type of data downloader. For example, {"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData", "dataDownloader":{"dataDownloader":"DC"}} |
Specifies the MapReduce configuration parameters |
---|
Use this file to customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. |
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: TEXT, ORC, or PARQUET format. |
pb.bdq.outputfile.path | The path where you want the output file to be generated on HDFS. For example, /user/hduser/sampledata/ opennameparser/output. |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |
pb.bdq.job.print.counters.console | If the counters are printed on console or in a file. True indicates counters are printed on the console |
pb.bdq.job.counter.file.path | Path and the name of the file to which the counters are to be printed. You need to specify this if value in the pb.bdq.job.print.counters.console is false. |
Properties of Parquet file | Â |
parquet.compression | The compression algorithm used to compress pages. It is one of these:
UNCOMPRESSED, SNAPPY,
GZIP, or LZO. Default is UNCOMPRESSED. |
parquet.block.size | The size of a row group being buffered in memory. Larger values improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024) |
parquet.page.size | Page constitutes block and is the smallest unit that must be read fully to access a
single record. Default size is 1048576 bytes (= 1 * 1024 * 1024)
Note: A very small page
size results in deterioration of compression.
|
parquet.dictionary.page.size | Default size is 1048576 bytes (= 1 * 1024 * 1024) |
parquet.enable.dictionary | The boolean value (True or False) to enable or disable dictionary encoding. Default is True |
parquet.validation | Default boolean value is False. |
parquet.writer.version | Specifies the version of writer. It should be PARQUET_1_0 or PARQUET_2_0. Default is PARQUET_1_0. |
parquet.writer.max-padding | Default to no padding, 0% of the row group size |
parquet.page.size.check.estimate | Default boolean value is True |
parquet.page.size.row.check.min | Default is 100 |
parquet.page.size.row.check.max | Default is 10000 |