Big Data Quality SDK

Automated Acushare Installation

The process for installing Acushare on each nodes in a cluster, required to run Validate Address jobs, has been automated in this release. You now just need to run the script file sdkrts.bin on each node to install and start the service automatically on that node.

CASS Reports for Validate Address

You can now create and run the Validate Address job in CASS Certified™ mode using the Big Data Quality SDK. Additionally, you can generate these CASS reports:

  • CASS Report 3553
  • CASS Detailed Report

You can also generate a summary report called the Validate Address Summary Report.

Run Jobs Using Configuration Files

You can now run a Big Data Quality job using a module's JAR file in a console. Use the commands hadoop or spark-submit and pass in the configuration files as arguments.

Configuration files must be in XML format. There are sample configuration files in:

BigDataQualityBundle\samples\configuration

The configuration files include input file properties, MapReduce and Spark configuration properties, output directory settings, and general properties for the job.

New Input File Settings

Text Qualifier

The Big Data Quality SDK now allows you to specify text qualifiers in the input configuration of MapReduce and Spark jobs. Text qualifiers identify text values in the input.

Field Mappings

A new field in the JobPath class allows you to specify the mapping between source column names and output column names. The field takes a Map of key-value pairs to map source column names to their corresponding output column names.

Field Separator for Output Files

You can now specify the field separator when defining the details of the output file for a job.

ORC File Format Support

ORC file formats are now supported for the input and output of the jobs provided in the Big Data Quality SDK. For input, output, suspect, and candidate files, you can either use text files or ORC files.

Note: When using Interflow Match, the suspect and candidate files must be of the same format. Either both must be ORC files, or both must be text files.