Run Hadoop MapReduce Job

The Run Hadoop MapReduce Job activity allows you to run any MapReduce job on a Hadoop cluster, by mapping the relevant JAR file. You can use this activity to run a MapReduce job of the Spectrum™ Data Quality for Big Data SDK or any external MapReduce job.

Note: If the MapReduce job fails, an error message is displayed along with the status of the job run.
Field Description
Hadoop server The list of configured Hadoop servers.

For information about mapping HDFS file servers through Management Console, see the Administration Guide.

Jar path The path of the relevant JAR file for the Hadoop MapReduce job to run.
Note: The JAR must be present at the external client location or Spectrum server. It must not be placed on the Hadoop cluster.
Driver class Select one of:
Default

To run an external job by simply entering the job's class name and arguments for the job, select Default.

On selecting Default, the fields Class name and Arguments are displayed.

Configure
To enter additional job properties of any external job or to run any one of the Spectrum Big Data Quality SDK jobs, select Configure.

On selecting Configure, the field Job type is displayed.

Job type Select one of:
Spectrum
To run any one of the Spectrum Big Data Quality SDK jobs, select Spectrum.

On selecting Spectrum, the field Spectrum jobs is displayed.

Generic
To specify additional properties for any external job, select Generic.
Spectrum jobs Select the required job from the list of Spectrum Big Data Quality SDK jobs. The list includes these jobs:
  • Address Validation
  • Advanced Transformer
  • Best of Breed
  • Duplicate Syncronization
  • Filter
  • Groovy
  • Intraflow Match
  • Interflow Match
  • Joiner
  • Match Key Generator
  • Open Name Parser
  • Open Parser
  • Table Lookup
  • Transactional Match
  • Validate Address
  • Validate Address Global
On selecting the desired Spectrum job:
  1. The fields Job name, Class name, and Arguments are auto-populated.

    All the auto-populated fields can be edited as required, except Class name.

    Important: For the selected Spectrum job, the auto-populated Class name must not be edited, else the job cannot run.
  2. The Properties grid is auto-populated with the required configuration properties of the selected Spectrum job, with their default values.

    You can add or import more properties as well as modify the auto-populated properties, as required.

Class name The fully qualified name of the driver class of the job.
Arguments The space-separated list of arguments. These are passed to the driver class at runtime to run the job.

For example,

23Dec2016 /home/Hadoop/EYInc.txt
  1. Those variables can be passed as arguments in the argument list, which are defined to accept runtime values either in the source stage or this current stage of the process flow.

    For example, if in the output of the previous stage of the process flow the variable SalesStartRange is defined, you can include this variable in this argument list as ${SalesStartRange} along with other required arguments, as illustrated:

    23Dec2016 /home/Hadoop/EYInc.txt ${SalesStartRange}
  2. In case a particular argument contains a space, enclose it in double quotes.

    For example, "/home/Hadoop/Sales Records".

Spectrum Big Data Quality SDK Jobs - Arguments:

To run the Spectrum Big Data Quality SDK MapReduce jobs, pass the various configuration files as a list of arguments. Each argument key accepts the path of a single configuration property file, where each file contains multiple configuration properties.

The syntax of the argument list for configuration properties is:

[-config <Path to configuration file>] [-debug] [-input <Path to input configuration file>] [-conf <Path to MapReduce configuration file>] [-output <Path of output directory>]

For example, for a MapReduce MatchKeyGenerator job:

-config /home/hadoop/matchkey/mkgConfig.xml -input /home/hadoop/matchkey/inputFileConfig.xml -conf /home/hadoop/matchkey/mapReduceConfig.xml -output /home/hadoop/matchkey/outputFileConfig.xml
Note: If the same configuration property key is specified both in the Arguments field and in the Properties grid but each points to different configuration files, the file indicated in the Properties grid for this property holds.

The sample configuration properties are shipped with the Data and Address Quality for Big Data SDK and are placed at the location <Big Data Quality bundle>\samples\configuration.

General Tab

Field Description Requirement
Job name The name of the Hadoop MapReduce job. Required
Input path The path of the input file for the job. Required
Output path The path of the output file for the job. Required
Overwrite output Indicates if the specified output path must be overwritten in case it already exists.
Note: If this check box is left unchecked, and the configured output path is found to exist at runtime, Hadoop throws an exception and the process flow is aborted.
Optional
Mapper class The fully qualified name of the class that handles the Mapper functionality for the job. Required
Reducer class The fully qualified name of the class that handles the Reducer functionality for the job. Optional
Combiner class The fully qualified name of the class that handles the Combiner functionality for the job. Optional
Partitioner class The fully qualified name of the class that handles the Partitioner functionality for the job. Optional
Number of reducers The number of reducers used to run the MapReduce job. Optional
Input format The format of the input data. Required
Output format The format of the output data. Required
Output key class The datatype of the keys in the output key-value pairs. Required
Output value class The datatype of the values in the output key-value pairs. Required

Properties Tab

To specify additional properties to run the required job, use this tab to define as many property-value pairs as required. You can add the required properties directly in the grid one at a time.

Alternatively, to import properties from a file, click Import. Go to the location of respective property file and select the file of XML format. The properties contained in the imported file are copied into the grid. The property file must be in XML format and must follow the syntax:
<configuration>
    <property>
        <name>key</name>
        <value>some_value</value>
        <description>A brief description of the 
            purpose of the property key.</description>
    </property>
</configuration>

You can directly import the Hadoop property file mapred.xml, or create your own files using this XML format.

Note:
  1. If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.
  2. If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
  3. You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
  4. Ensure the property file is present on the Spectrum™ Technology Platform server itself.
  5. The <description> tag is optional for each property key in a configuration property file.
  6. Reference data needs to be placed local to data nodes to run the relevant jobs. This property is available only for jobs that use reference data, such as Advanced Transformer, Validate Address Global, and Validate Address. The property is: pb.bdq.reference.data.location.

Dependencies Tab

In this tab, add the list of the input files and Jar files required to run the job.

Once the job is run, the reference files and the reference Jar files added here are available in the distributed cache of the job.

Reference Files
To add the various files required as input to run the job, click Add, go to the respective location on your local system or cluster, and select the particular file.

To remove any file added in the list, select the particular file and click Remove.

Reference Jars
To add the Jar files required to run the job, click Add, go to the respective location on your local system or cluster, and select the particular Jar file.

To remove any file added in the list, select the particular file and click Remove.