Connecting to Hadoop
In order for Spectrum™ Technology Platform to access data in Hadoop, you must define a connection to Hadoop using Management Console. Once you do this, you can create flows in Enterprise Designer that can read data from, and write data to, Hadoop.
- Open Management Console.
- Go to Resources > Data Sources.
- Click the Add button .
-
In the Name field, enter a name for the connection. The
name can be anything you choose.
Note: Once you save a connection you cannot change the name.
- In the Type field, choose HDFS.
- In the Host field, enter the hostname or IP address of the NameNode in the HDFS cluster.
- In the Port field, enter the network port number.
-
In the User field, select the method for authenticating
to HDFS:
- Server user
- Select this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum™ Technology Platform server runs under to authenticate to HDFS.
- User name
- Select this option if authentication is disabled in your HDFS cluster.
- Check Kerberos if you want to enable Kerberos authentication feature for this HDFS file server connection.
-
If you have opted to enable Kerberos authentication,
then enter the path of the keytab file in the Keytab file
path field.
Note: The keytab file must be on the Spectrum™ Technology Platform server.
-
In the Protocol field, select the method of
communication with HDFS:
- WEBHDFS
- Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.
- HFTP
- Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.
- HAR
- Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
-
If you selected the WEBHDFS protocol, expand Advanced server
options. Review the settings and make any changes that are
necessary.
- Replication factor
- Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.
- Block size
- Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.
- File permissions
- Specifies the level of access to files written to the HDFS cluster
by Spectrum™ Technology Platform. You can specify read and write
permissions for each of these options:Note: The Execute permission is not applicable to Spectrum™ Technology Platform.
- User
- This is the user specified above, either Server user or the user specified in the User name field.
- Group
- This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.
- Other
- This refers to any other users as well as groups of which the specified user is not a member.
In the grid below the File permissions table, specify the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity.
To add a new property, click . Then, define the properties, as described in this table, depending on the stage or activity that will use the Hadoop connection, and whether Hadoop 1.x or Hadoop 2.x is being used.
Stage or Activity using the HDFS Connection Required Server Properties - Stage Read from Sequence File
- Activity Run Hadoop Pig
Hadoop 1.x Parameters - fs.default.name
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000
- mapred.job.tracker
- Specifies the hostname or IP address, and port on
which the MapReduce job tracker runs. If the host
name is entered as local, then jobs are run as a
single map and reduce task.
For example, 152.144.226.224:9001
- dfs.namenode.name.dir
- Specifies where on the local files system a DFS name
node should store the name table. If this is a
comma-delimited list of directories, then the name
table is replicated in all of the directories, for
redundancy.
For example, file:/home/hduser/Data/namenode
- dfs.datanode.data.dir
- Specifies where on the local file system a DFS data
node should store its blocks. If this is a
comma-delimited list of directories, then data will
be stored in all the named directories that are
usually on different devices. Directories that do
not exist are ignored.
For example, file:/home/hduser/Data/datanode
- hadoop.tmp.dir
- Specifies the base location for other temporary
directories.
For example, /home/hduser/Data/tmp
Hadoop 2.x Parameters
- fs.defaultFS
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000.
Note: We recommend that the parameter name fs.defaultFS be used in Spectrum™ Technology Platform 11 SP1 and later. - yarn.resourcemanager.resource-tracker.address
- Specifies the hostname or IP-address of the Resource
Manager.
For example, 152.144.226.224:8025
- yarn.resourcemanager.scheduler.address
- Specifies the address of the Scheduler
Interface.
For example, 152.144.226.224:8030
- yarn.resourcemanager.address
- Specifies the address of the Applications Manager
interface that is contained in the Resource
Manager.
For example, 152.144.226.224:8041
- mapreduce.jobhistory.address
- Specifies the host name or IP address, and port on
which the MapReduce Job History Server is
running.
For example, 152.144.226.224:10020
- mapreduce.application.classpath
- Specifies the CLASSPATH for Map Reduce applications.
This CLASSPATH denotes the location where classes
related to Map Reduce applications are found.
Separate entries with a comma.For example
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
- mapreduce.app-submission.cross-platform
- Handles various platform issues that arise if your
Spectrum server runs on a Windows machine and you
install Cloudera on it. If your Spectrum server and
Cloudera are running on different operating systems,
then enter the value of this parameter as
true. If not, mark it as
false.Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution platform-related issues.
If you have checked the Kerberos checkbox, then add these Kerberos configuration properties:
- hadoop.security.authentication
- Specifies the type of authentication security being used. Enter the value kerberos.
- yarn.resourcemanager.principal
- Specifies the Kerberos principal being used for the
resource manager for your Hadoop YARN resource
negotiator.
For example, yarn/_HOST@HADOOP.COM
- dfs.namenode.kerberos.principal
- Specifies the Kerberos principal being used for the
namenode of your Hadoop Distributed File System
(HDFS).
For example, hdfs/_HOST@HADOOP.COM
- dfs.datanode.kerberos.principal
- Specifies the Kerberos principal being used for the
datanode of your Hadoop Distributed File System
(HDFS).
For example, hdfs/_HOST@HADOOP.COM
- Stage Read from File
- Stage Write to File
- Stage Read from Hive ORC File
- Stage Write to Hive ORC File
Hadoop 1.x Parameters - fs.default.name
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000
Hadoop 2.x Parameters
- fs.defaultFS
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000.
Note: It is recommended that the parameter name fs.defaultFS be used Spectrum™ Technology Platform 11 SP1 onwards.
Table 1. Properties for Read from File, Write to File, Read from Hive ORC File, and Write to Hive ORC File Hadoop 1.x Properties Hadoop 2.x Properties - fs.default.name
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000
- fs.defaultFS
- Specifies the node and port on which Hadoop
runs.
For example, hdfs://152.144.226.224:9000.
Note: We recommend that the parameter name fs.defaultFS be used for Spectrum™ Technology Platform 11 SP1 onwards.
- To test the connection, click Test.
- Click Save.
After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.