Defining Fields for Reading from Hive File

In the Fields tab of the Read from Hive File stage, the schema names, datatypes, positions, and the given names of the fields in the file are listed.

  1. Click Regenerate.
    In case of ORC, Avro, and Parquet files, this generates the schema based on the metadata of the existing file. In case of RC files, any fields added before clicking Preview are cleared.

    The grid displays the columns Name, Type, Stage Field, and Include.

    The Name column displays the field name, as derived from the header record of the file.

    The Type column lists the datatypes of each respective field of the file.

    The stage supports the following data types:

    boolean
    A logical type with two values: true and false.
    date
    A data type that contains a month, day, and year. For example, 2012-01-30 or January 30, 2012. You can specify a default date format in Management Console.
    datetime
    A data type that contains a month, day, year, and hours, minutes, and seconds.

    For example, 2012/01/30 6:15:00 PM.

    Note: The datetime datatype in Spectrum maps to the timestamp datatype of Hive files.
    double
    A numeric data type that contains both negative and positive double precision numbers between 2-1074 and (2-2-52)×21023. In E notation, the range of values is -1.79769313486232E+308 to 1.79769313486232E+308.
    bigdecimal
    A numeric data type that supports 38 decimal points of precision. Use this data type for data that will be used in mathematical calculations requiring a high degree of precision, especially those involving financial data. The bigdecimal data type supports more precise calculations than the double data type.
    Note: For RC, Avro, and Parquet Hive files, fields of the decimal datatype in the input file are converted to bigdecimal datatype.
    long
    A numeric data type that contains both negative and positive whole numbers between -263 (-9,223,372,036,854,775,808) and 263-1 (9,223,372,036,854,775,807).
    Note: The long datatype in Spectrum maps to the bigint datatype of Hive files.
    integer
    A numeric data type that contains both negative and positive whole numbers between -231 (-2,147,483,648) and 231-1 (2,147,483,647).
    float
    A numeric data type that contains both negative and positive single precision numbers between 2-149 and (2-223)×2127. In E notation, the range of values -3.402823E+38 to 3.402823E+38.
    string
    A sequence of characters.
    Note: In case of RC files, smallint and complex datatypes are not supported.
    The Position column displays the starting position of the respective field within a record.
  2. In the Stage Field column, edit the existing field name to the desired name for each field.
    By default, this column displays the field names read from the file.
  3. In the Include column, select the checkboxes against the fields you wish to include in the output of the stage.
    By default, all the fields are selected in this column.
  4. For RC files, you can add and remove fields, and modify the sequence of the selected columns in the output using the below buttons:

    Option Name

    Description

    Add

    Adds a field to the output.

    Modify

    Modifies the selected field's name and datatype.

    Remove

    Removes the selected field from the output.

    Move Up/Move Down

    Reorders the position of the selected field in the output.

    Note: This feature is only available for RC files.
  5. Click OK.