Connecting to Hadoop

In order for Spectrum™ Technology Platform to access data in Hadoop, you must define a connection to Hadoop using Management Console. Once you do this, you can create flows in Enterprise Designer that can read data from, and write data to, Hadoop.

Attention: Spectrum™ Technology Platform does not support Hadoop 2.x for Kerberos on Windows platforms.
  1. Open Management Console.
  2. Go to Resources > Data Sources.
  3. Click the Add button .
  4. In the Name field, enter a name for the connection. The name can be anything you choose.
    Note: Once you save a connection you cannot change the name.
  5. In the Type field, choose HDFS
  6. In the Host field, enter the hostname or IP address of the NameNode in the HDFS cluster.
  7. In the Port field, enter the network port number.
  8. In User, select one of:
    Server user
    Choose this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum™ Technology Platform server runs under to authenticate to HDFS.
    User name
    Choose this option if authentication is disabled in your HDFS cluster.
  9. Check Kerberos if you wish to enable Kerberos authentication feature for this HDFS file server connection.
  10. If you have opted to enable Kerberos authentication, then enter the path of the keytab file in the Keytab file path field.
    Note: Ensure the key tab file is placed on the Spectrum™ Technology Platform server.
  11. In the Protocol field, select one of:
    WEBHDFS
    Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.
    HFTP
    Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.
    HAR
    Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
  12. Expand the Advanced options.
  13. If you selected the WEBHDFS protocol, you can specify these advanced options as required:
    Replication factor
    Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.
    Block size
    Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.
    File permissions
    Specifies the level of access to files written to the HDFS cluster by Spectrum™ Technology Platform. You can specify read and write permissions for each of these options:
    Note: The Execute permission is not applicable to Spectrum™ Technology Platform.
    User
    This is the user specified above, either Server user or the user specified in the User name field.
    Group
    This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.
    Other
    This refers to any other users as well as groups of which the specified user is not a member.
  14. In the grid below the File permissions table, specify the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity. To add properties, do one of these:
    • Click and add the properties and their respective values in the Property and Value fields.
    • Click and upload your configuration XML file. The XML file should be similar to hdfs-site.xml, yarn-site.xml, or core-site.xml.
      Note: The configuration file needs to be placed on the server.

    This table describes the properties and their values, depending on the stage or activity that will use the Hadoop connection. The properties are also dependent on the Hadoop version used (Hadoop 1.x or Hadoop 2.x).

    Stage or Activity using the HDFS Connection Required Server Properties
    • Stage Read from Sequence File
    • Activity Run Hadoop Pig
    Hadoop 1.x Parameters
    fs.default.name
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000

    mapred.job.tracker
    Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task.

    For example, 152.144.226.224:9001

    dfs.namenode.name.dir
    Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy.

    For example, file:/home/hduser/Data/namenode

    dfs.datanode.data.dir
    Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored.

    For example, file:/home/hduser/Data/datanode

    hadoop.tmp.dir
    Specifies the base location for other temporary directories.

    For example, /home/hduser/Data/tmp

    Hadoop 2.x Parameters

    fs.defaultFS
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000.

    Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference.

    For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards.

    yarn.resourcemanager.resource-tracker.address
    Specifies the hostname or IP-address of the Resource Manager.

    For example, 152.144.226.224:8025

    yarn.resourcemanager.scheduler.address
    Specifies the address of the Scheduler Interface.

    For example, 152.144.226.224:8030

    yarn.resourcemanager.address
    Specifies the address of the Applications Manager interface that is contained in the Resource Manager.

    For example, 152.144.226.224:8041

    mapreduce.jobhistory.address
    Specifies the host name or IP address, and port on which the MapReduce Job History Server is running.

    For example, 152.144.226.224:10020

    mapreduce.application.classpath
    Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found.
    Note: The entries should be comma separated.
    For example

    $HADOOP_CONF_DIR,
    $HADOOP_COMMON_HOME/share/hadoop/common/*,
    $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
    $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
    $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
    $HADOOP_YARN_HOME/share/hadoop/yarn/*,
    $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

    mapreduce.app-submission.cross-platform
    Handles various platform issues that arise if your Spectrum server runs on a Windows machine, and you install Cloudera on it. If your Spectrum server and Cloudera are running on different Operating Systems, then enter the value of this parameter as true. Else, mark it as false.
    Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution to all resulting platform issues.
    If you have checked the Kerberos checkbox above, then add the below Kerberos configuration properties additionally:
    hadoop.security.authentication
    The type of authentication security being used. Enter the value kerberos.
    yarn.resourcemanager.principal
    The Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator.

    For example, yarn/_HOST@HADOOP.COM

    dfs.namenode.kerberos.principal
    The Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS).

    For example, hdfs/_HOST@HADOOP.COM

    dfs.datanode.kerberos.principal
    The Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS).

    For example, hdfs/_HOST@HADOOP.COM

    • Stage Read from File
    • Stage Write to File
    • Stage Read from Hive ORC File
    • Stage Write to Hive ORC File
    Hadoop 1.x Parameters
    fs.default.name
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000

    Hadoop 2.x Parameters

    fs.defaultFS
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000.

    Note: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference.

    For versions 11 SP1 and onwards, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using the parameter name fs.defaultFS Spectrum™ Technology Platform 11 SP1 onwards.

  15. To test the connection, click Test.
  16. Click Save.

After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.