Write to Hive File

The Write to Hive File stage writes the dataflow input to the specified output Hive file.

You can select any of these supported Hive file formats for the output file: ORC, RC, Parquet, and Avro.

Related task:

Connecting to Hadoop: To be able to use Write to Hive File stage, you need to create a connection to the Hadoop file server. Once you do that, the name by which you save the connection is displayed as the server name.

File Properties tab

Table 1. Common File Properties
Fields Description
Server name Indicates that the file selected in the File name field is located on the Hadoop system. Once you select a file located on a Hadoop system, the Server name reflects the name of the respective file server, as specified in Management Console.
File name Click the ellipses button (...) to browse to the output Hive file to be created in the defined Hadoop file server. The output data of this stage is written to the selected file.
Note: You need to create a connection to the Hadoop file server in Management Console before using it in the stage.
File type Select one of the four supported Hive file formats:
  • ORC
  • RC
  • Parquet
  • Avro
Table 2. ORC File Properties
Fields Description
Buffer size Defines the buffer size to be allocated while writing to an ORC file. This is specified in kilobytes.
Note: The default buffer size is 256 KB.
Stripe size Defines the size of stripes to be created while writing to an ORC file. This is specified in megabytes.
Note: The default stripe size is 64 MB.
Row index stride Defines the number of rows to be written between two consecutive row index entries.
Note: The default Row Index Stride is 10000 rows.
Compression type Defines the compression type to be used while writing to an ORC file. The compression types available are ZLIB and SNAPPY.
Note: The default compression type is ZLIB.
Padding Indicates whether the stripes are padded to minimize stripes that cross HDFS block boundaries, while writing to an ORC file.
Note: By default, the Padding checkbox is selected.
Preview The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.
Table 3. RC File Properties
Fields Description
Buffer size Defines the buffer size to be allocated while writing to an RC file. This is specified in kilobytes.
Note: The default buffer size is 256 KB.
Block size Defines the size of blocks to be created while writing to an RC file. This is specified in megabytes.
Note: The default block size is 64 MB.
Compression type Defines the compression type to be used while writing to an RC file. The compression types available are NONE and DEFLATE.
Note: The default compression type is NONE.
Preview The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.

The Fields tab is used to define the sequence and datatype of the required fields.

Note: For RC file type, you must define the metadata of the output file before clicking Preview to load the Preview grid.
Table 4. Parquet File Properties
Fields Description
Compression type Defines the compression type to be used while writing to a PARQUET file. The compression types available are UNCOMPRESSED, GZIP and SNAPPY.
Note: The default compression type is UNCOMPRESSED.
Block size Defines the size of block to be created while writing to a PARQUET file. This is specified in megabytes.
Note: The default block size is 128 MB.
Page size The page size is for compression. When reading, each page can be decompressed independently. This is specified in kilobytes.
Note: The default page size is 1024 KB.
Enable dictionary To enable/disable dictionary encoding.
Attention: The dictionary must be enabled for the Dictionary Page Size to be enabled.
Note: The default value is true.
Dictionary Page size There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size functions like the page size. This is specified in kilobytes.
Note: The default dictionary Page size is 1024 KB.
Writer version Parquet supports two writer API versions: PARQUET_1_0 and PARQUET_2_0.
Note: The default is PARQUET_1_0.
Preview The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.
Table 5. Avro File Properties
Fields Description
Sync Interval (in Bytes) Specifies the approximate number of uncompressed bytes to be written in each block. The valid values range from 32 to 2^30. However, it is suggested to keep the sync interval between 2K and 2M.
Note: The default sync interval is 16000.
Compression Defines the compression type to be used while writing to an Avro file. The compression types available are NONE, SNAPPY and DEFLATE. Choosing DEFLATE compression gives you an additional option of selecting the compression level (described below).
Note: The default compression type is NONE.
Compression level

This field is displayed if you select the DEFLATE option in the Compression field above.

It can have values ranging from 0 to 9, where 0 denotes no compression. The compression level increases from 1 to 9, with a simultaneous increase in the time taken to compress the data.

Note: The default compression level is 1.
Preview The first 50 records of the written file are fetched and displayed in this grid, after the dataflow is run at least once and the data is written to the selected file.

Fields tab

The Fields tab defines the names and types of the fields as present in the source file of this stage, and to be selected to be written to the output file.

For more information, see Defining Fields for Writing to Hive File.

Runtime tab

The Runtime tab provides the option to Overwrite an existing file of the same name in the configured Hadoop file server. If you check the Overwrite checkbox, then on running the dataflow, the new output Hive file overwrites any existing file of the same name in the same Hadoop file server.

By default, the Overwrite checkbox is unchecked.
Note: If you do not select Overwrite, an exception is thrown while running the dataflow, if the file to be written has the same name as an existing file in the same Hadoop file server.