Configuration Files
These tables describe the parameters and the values you need to specify before you run the Advanced Transformer job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: TEXT, ORC or PARQUET. |
pb.bdq.inputfile.path | The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/advancedtransformer/input/ AdvancedTransformer_Input.txt |
textinputformat.record.delimiter | File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.inputformat.field.delimiter | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.inputformat.text.qualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.inputformat.file.header | Comma-separated value of the headers used in the input file. |
pb.bdq.inputformat.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
pb.bdq.inputfile.field.mapping | Maps the values of the headers used in the input file with their updated
values. Note: This is an optional parameter.
|
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: AdvTransformer. |
pb.bdq.job.name | Name of the job. Default is AdvanceTransformerSample. |
pb.bdq.dnm.advtransformer.configuration | Json string for defining advance transformer configuration. It specifies details, such as the source input field to be evaluated for scan and split, the output field where you want to put the extracted data, any special characters that you want to tokenize, and the type of extraction to be performed. |
pb.bdq.reference.data | The path where you have placed the reference data. For example, {"referenceDataPathLocation": "LocaltoDataNodes","dataDir":" /home/data/referenceData"} |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: AdvTransformer. |
pb.bdq.job.name | Name of the job. Default is AdvanceTransformerSample. |
pb.bdq.dnm.advtransformer.configuration | Json string for defining advance transformer configuration. It specifies details, such as the source input field to be evaluated for scan and split, the output field where you want to put the extracted data, any special characters that you want to tokenize, and the type of extraction to be performed. |
pb.bdq.reference.data | Path of reference data on HDFS and the data downloader path. For example, {"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData", "dataDownloader":{"dataDownloader":"HDFS", "localFSRepository":"/local/download"}} |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: AdvTransformer. |
pb.bdq.job.name | Name of the job. Default is AdvanceTransformerSample. |
pb.bdq.dnm.advtransformer.configuration | Json string for defining advance transformer configuration. It specifies details, such as the source input field to be evaluated for scan and split, the output field where you want to put the extracted data, any special characters that you want to tokenize, and the type of extraction to be performed. |
pb.bdq.reference.data | Path of the reference data on HDFS and the type of data downloader. For example, {"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData", "dataDownloader":{"dataDownloader":"DC"}} |
Specifies the MapReduce configuration parameters |
---|
Use this file to customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. |
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: TEXT, ORC, or PARQUET format. |
pb.bdq.outputfile.path | The path where you want the output file to be generated on HDFS. For example, /user/hduser/sampledata/ advancedtransformer/output |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |
Properties of Parquet file | Â |
parquet.compression | The compression algorithm used to compress pages. It is one of these:
UNCOMPRESSED, SNAPPY,
GZIP, or LZO. Default is UNCOMPRESSED. |
parquet.block.size | The size of a row group being buffered in memory. Larger values improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024) |
parquet.page.size | Page constitutes block and is the smallest unit that must be read fully to access a
single record. Default size is 1048576 bytes (= 1 * 1024 * 1024)
Note: A very small page
size results in deterioration of compression.
|
parquet.dictionary.page.size | Default size is 1048576 bytes (= 1 * 1024 * 1024) |
parquet.enable.dictionary | The boolean value (True or False) to enable or disable dictionary encoding. Default is True |
parquet.validation | Default boolean value is False. |
parquet.writer.version | Specifies the version of writer. It should be PARQUET_1_0 or PARQUET_2_0. Default is PARQUET_1_0. |
parquet.writer.max-padding | Default to no padding, 0% of the row group size |
parquet.page.size.check.estimate | Default boolean value is True |
parquet.page.size.row.check.min | Default is 100 |
parquet.page.size.row.check.max | Default is 10000 |