Connecting to Hadoop

In order for Spectrum™ Technology Platform to access data in Hadoop, you must define a connection to Hadoop using Management Console. Once you do this, you can create flows in Enterprise Designer that can read data from, and write data to, Hadoop.

Attention: Spectrum™ Technology Platform does not support Hadoop 2.x for Kerberos on Windows platforms.

Open Management Console.
Go to Resources > Data Sources.
Click the Add button .
In the Name field, enter a name for the connection. The name can be anything you choose.

Note: Once you save a connection you cannot change the name.
In the Type field, choose HDFS
In the Host field, enter the hostname or IP address of the NameNode in the HDFS cluster.
In the Port field, enter the network port number.
In User, select one of:

Server user

Choose this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum™ Technology Platform server runs under to authenticate to HDFS.

User name

Choose this option if authentication is disabled in your HDFS cluster.
Check Kerberos if you wish to enable Kerberos authentication feature for this HDFS file server connection.
If you have opted to enable Kerberos authentication, then enter the path of the keytab file in the Keytab file path field.

Note: Ensure the key tab file is placed on the Spectrum™ Technology Platform server.
In the Protocol field, select one of:

WEBHDFS

Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.

HFTP

Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.

HAR

Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
Expand the Advanced options.
If you selected the WEBHDFS protocol, you can specify these advanced options as required:

Replication factor

Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.

Block size

Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.

File permissions

Specifies the level of access to files written to the HDFS cluster by Spectrum™ Technology Platform. You can specify read and write permissions for each of these options:
Note: The Execute permission is not applicable to Spectrum™ Technology Platform.

User

This is the user specified above, either Server user or the user specified in the User name field.

Group

This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.

Other

This refers to any other users as well as groups of which the specified user is not a member.

In the grid below the File permissions table, specify the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity. To add properties, do one of these:

Click and add the properties and their respective values in the Property and Value fields.
Click and upload your configuration XML file. The XML file should be similar to hdfs-site.xml, yarn-site.xml, or core-site.xml.
Note: The configuration file needs to be placed on the server.

This table describes the properties and their values, depending on the stage or activity that will use the Hadoop connection. The properties are also dependent on the Hadoop version used (Hadoop 1.x or Hadoop 2.x).

Stage or Activity using the HDFS Connection	Required Server Properties
Stage Read from Sequence File Activity Run Hadoop Pig	Hadoop 1.x Parameters fs.default.name Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000` mapred.job.tracker Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task. For example, `152.144.226.224:9001` dfs.namenode.name.dir Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy. For example, `file:/home/hduser/Data/namenode` dfs.datanode.data.dir Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored. For example, `file:/home/hduser/Data/datanode` hadoop.tmp.dir Specifies the base location for other temporary directories. For example, `/home/hduser/Data/tmp` Hadoop 2.x Parameters fs.defaultFS Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`. Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference. For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards. yarn.resourcemanager.resource-tracker.address Specifies the hostname or IP-address of the Resource Manager. For example, `152.144.226.224:8025` yarn.resourcemanager.scheduler.address Specifies the address of the Scheduler Interface. For example, `152.144.226.224:8030` yarn.resourcemanager.address Specifies the address of the Applications Manager interface that is contained in the Resource Manager. For example, `152.144.226.224:8041` mapreduce.jobhistory.address Specifies the host name or IP address, and port on which the MapReduce Job History Server is running. For example, `152.144.226.224:10020` mapreduce.application.classpath Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. Note: The entries should be comma separated. For example `$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/, $HADOOP_COMMON_HOME/share/hadoop/common/lib/, $HADOOP_HDFS_HOME/share/hadoop/hdfs/, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/, $HADOOP_YARN_HOME/share/hadoop/yarn/, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/` mapreduce.app-submission.cross-platform Handles various platform issues that arise if your Spectrum server runs on a Windows machine, and you install Cloudera on it. If your Spectrum server and Cloudera are running on different Operating Systems, then enter the value of this parameter as `true`. Else, mark it as `false`. Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution to all resulting platform issues. If you have checked the Kerberos checkbox above, then add the below Kerberos configuration properties additionally: hadoop.security.authentication The type of authentication security being used. Enter the value `kerberos`. yarn.resourcemanager.principal The Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator. For example, `yarn/_HOST@HADOOP.COM` dfs.namenode.kerberos.principal The Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS). For example, `hdfs/_HOST@HADOOP.COM` dfs.datanode.kerberos.principal The Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS). For example, `hdfs/_HOST@HADOOP.COM`
Stage Read from File Stage Write to File Stage Read from Hive ORC File Stage Write to Hive ORC File	Hadoop 1.x Parameters fs.default.name Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000` Hadoop 2.x Parameters fs.defaultFS Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`. Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference. For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards.

Stage or Activity using the HDFS Connection

Required Server Properties

Stage Read from Sequence File
Activity Run Hadoop Pig

Hadoop 1.x Parameters

fs.default.name: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000
mapred.job.tracker: Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task.
For example, 152.144.226.224:9001
dfs.namenode.name.dir: Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy.
For example, file:/home/hduser/Data/namenode
dfs.datanode.data.dir: Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored.
For example, file:/home/hduser/Data/datanode
hadoop.tmp.dir: Specifies the base location for other temporary directories.
For example, /home/hduser/Data/tmp

Hadoop 2.x Parameters

fs.defaultFS: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000.

Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference.
For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards.
yarn.resourcemanager.resource-tracker.address: Specifies the hostname or IP-address of the Resource Manager.
For example, 152.144.226.224:8025
yarn.resourcemanager.scheduler.address: Specifies the address of the Scheduler Interface.
For example, 152.144.226.224:8030
yarn.resourcemanager.address: Specifies the address of the Applications Manager interface that is contained in the Resource Manager.
For example, 152.144.226.224:8041
mapreduce.jobhistory.address: Specifies the host name or IP address, and port on which the MapReduce Job History Server is running.
For example, 152.144.226.224:10020
mapreduce.application.classpath: Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found.
Note: The entries should be comma separated.

For example
$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

mapreduce.app-submission.cross-platform: Handles various platform issues that arise if your Spectrum server runs on a Windows machine, and you install Cloudera on it. If your Spectrum server and Cloudera are running on different Operating Systems, then enter the value of this parameter as true. Else, mark it as false.
Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution to all resulting platform issues.

If you have checked the Kerberos checkbox above, then add the below Kerberos configuration properties additionally:

hadoop.security.authentication: The type of authentication security being used. Enter the value kerberos.
yarn.resourcemanager.principal: The Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator.
For example, yarn/_HOST@HADOOP.COM
dfs.namenode.kerberos.principal: The Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS).
For example, hdfs/_HOST@HADOOP.COM
dfs.datanode.kerberos.principal: The Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS).
For example, hdfs/_HOST@HADOOP.COM

Stage Read from File
Stage Write to File
Stage Read from Hive ORC File
Stage Write to Hive ORC File

Hadoop 1.x Parameters

fs.default.name: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000

Hadoop 2.x Parameters

fs.defaultFS: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000.

Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference.
For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards.

To test the connection, click Test.
Click Save.

After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.