Big Data Quality SDK
Automated Acushare Installation
The process for installing Acushare on each nodes in a cluster, required to run Validate Address jobs, has been automated in this release. You now just need to run the script file sdkrts.bin on each node to install and start the service automatically on that node.
CASS Reports for Validate Address
You can now create and run the Validate Address job in CASS Certified™ mode using the Big Data Quality SDK. Additionally, you can generate these CASS reports:
- CASS Report 3553
- CASS Detailed Report
You can also generate a summary report called the Validate Address Summary Report.
Run Jobs Using Configuration Files
You can now run a Big Data Quality job using a module's JAR file in a console. Use the commands
hadoop
or spark-submit
and pass in the configuration files
as arguments.
Configuration files must be in XML format. There are sample configuration files in:
BigDataQualityBundle\samples\configuration
The configuration files include input file properties, MapReduce and Spark configuration properties, output directory settings, and general properties for the job.
New Input File Settings
Text Qualifier
The Big Data Quality SDK now allows you to specify text qualifiers in the input configuration of MapReduce and Spark jobs. Text qualifiers identify text values in the input.
Field Mappings
A new field in the JobPath class allows you to specify the mapping between
source column names and output column names. The field takes a Map
of key-value
pairs to map source column names to their corresponding output column names.
Field Separator for Output Files
You can now specify the field separator when defining the details of the output file for a job.
ORC File Format Support
ORC file formats are now supported for the input and output of the jobs provided in the Big Data Quality SDK. For input, output, suspect, and candidate files, you can either use text files or ORC files.