Configuration Files
These tables describe the parameters and values you need to specify before you run the Custom Groovy Script job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: TEXT, ORC or PARQUET. |
pb.bdq.inputfile.path | The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/groovy/input/groovy_Input.csv |
textinputformat.record.delimiter | File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.inputformat.field.delimiter | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.inputformat.text.qualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.inputformat.file.header | Comma-separated value of the headers used in the input file. |
pb.bdq.inputformat.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: CustomScript. |
pb.bdq.job.name | Name of the job. Default is CustomScriptSample. |
pb.bdq.dim.date.pattern | Specifies the date pattern to be used in the job as: M/d/yy
Note: This is an optional property.
|
pb.bdq.dim.datetime.pattern | Specifies the date-time pattern to be used in the job as: M/d/yy h:mm a
Note: This is an optional property.
|
pb.bdq.dim.time.pattern | Specifies the time pattern to be used in the job as: h:mm a
Note: This is an optional property.
|
pb.bdq.dim.groovy.input.fields.0 | Specifies the input fields and their data types in the format:
{"name":<"name of the field">,"type":<"field type">} .For example,
|
pb.bdq.dim.groovy.output.fields.0 | Specifies the input fields and their data types in the format:
{"name":<"name of the field">,"type":<"field type">} .For example,
|
pb.bdq.dim.groovy.script.0 | Path of the groovy script to be executed. For Example, /home/hduser/script/groovy.txt |
Specifies the MapReduce configuration parameters |
---|
Use this file to customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. |
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: TEXT, ORC, or PARQUET format. |
pb.bdq.outputfile.path | The path where you want the output file to be generated on HDFS. For example, /user/hduser/sampledata/groovy/output |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |
Properties of Parquet file | Â |
parquet.compression | The compression algorithm used to compress pages. It is one of these:
UNCOMPRESSED, SNAPPY,
GZIP, or LZO. Default is UNCOMPRESSED. |
parquet.block.size | The size of a row group being buffered in memory. Larger values improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024) |
parquet.page.size | Page constitutes block and is the smallest unit that must be read fully to access a
single record. Default size is 1048576 bytes (= 1 * 1024 * 1024)
Note: A very small page
size results in deterioration of compression.
|
parquet.dictionary.page.size | Default size is 1048576 bytes (= 1 * 1024 * 1024) |
parquet.enable.dictionary | The boolean value (True or False) to enable or disable dictionary encoding. Default is True |
parquet.validation | Default boolean value is False. |
parquet.writer.version | Specifies the version of writer. It should be PARQUET_1_0 or PARQUET_2_0. Default is PARQUET_1_0. |
parquet.writer.max-padding | Default to no padding, 0% of the row group size |
parquet.page.size.check.estimate | Default boolean value is True |
parquet.page.size.row.check.min | Default is 100 |
parquet.page.size.row.check.max | Default is 10000 |