Spark Sorter
The Spark Sorter activity allows you to sort a massive bulk of records. This activity uses Apache Spark libraries to power the feature and runs on your Spectrum™ Technology Platform server.
Currently, delimited files, present on the Spectrum™ Technology Platform server, are accepted to read the input records.
Field | Description |
---|---|
Server name |
Indicates the location of the file you select as input. Since the Spark Sorter activity only accepts files located on the Spectrum™ Technology Platform, this field displays Spectrum™ Technology Platform. |
File name |
Specifies the path to the file. Click the ellipses button (...) to go to the file you want. You can read multiple files by using a wild card character to read data from multiple
files in the directory. The wild card characters * and ? are supported. For example, you
could specify *.csv to read in all files with a
Attention: If the Spectrum™ Technology Platform server is running on Unix or
Linux, remember that file names and paths on these platforms are case sensitive.
|
Record Type | The format of the records in the file. Currently, delimited file formats
are accepted as input.
|
Character Encoding |
The character encoding of the input file. The encoding |
Field separator |
Specifies the character used to separate fields in a delimited file. For example, this record uses a pipe (|) as a field separator:
These characters available to define as field separators are:
If the file uses a different character as a field separator, click the ellipses button to select another character as a delimiter. |
Text qualifier |
The character used to surround text values in a delimited file. For example, this record uses double quotes (") as a text qualifier.
The characters available to define as text qualifiers are:
If the file uses a different text qualifier, click the ellipses button to select another character as a text qualifier. |
Record separator |
Specifies the character used to separate records in line a sequential or delimited file. This field is not available if you check the Use default EOL check box. The record separator settings available are:
If your file uses a different record separator, click the ellipses button to select another character as a record separator. |
Use default EOL |
Specifies that the file's record separator is the default end of line (EOL) character used on the operating system on which the Spectrum™ Technology Platform server is running. Do not select this option if the file uses an EOL character that is different from the default EOL character used on the server's operating system. For example, if the file uses a Windows EOL but the server is running on Linux, do not check this option. Instead, select the Windows option in the Record separator field. |
First row is header record |
Specifies whether the first record in a delimited file contains header information and not data. For example, this file snippet shows a header row in the first record.
|
Output |
Specifies the path to the output file on the Spectrum™ Technology Platform server. Click the ellipses button (...) to go to the output directory and file name you want. Attention: If the Spectrum™ Technology Platform server is running on Unix or
Linux, remember that file names and paths on these platforms are case sensitive.
|
Overwrite | Indicates that the output file must overwrite if a file exists with the same name as specified in the Output field. |
Concatenate | Indicates that all Spark part files must be concatenated into a single output file in the specified Output location. |
Preview | Once the input file is selected in the File name field, the
Preview grid displays the first 100 records of the existing output
file. To correctly display all the separate column values, click Regenerate on the Fields tab. |
Fields Tab
The Fields tab defines the names, types, and positions of fields in the file. For more information, see:
Sort Tab
The Sort tab defines fields by which to sort the input records before they are sent into the dataflow. For more information, see Sorting Records.Configuration Tab
To specify additional properties to run the required job, use this tab to define as many property-value pairs as required. You can add the required properties directly in the grid one at a time.
<configuration>
<property>
<name>key</name>
<value>some_value</value>
<description>A brief description of the
purpose of the property key.</description>
</property>
</configuration>
- If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.
- If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
- You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
- Ensure the property file is present on the Spectrum™ Technology Platform server itself.
- The
<description>
tag is optional for each property key in a configuration property file.
Runtime Tab
Field Name | Description |
---|---|
File name |
Displays the file name selected in the first tab. |
Starting record |
If you want to skip records at the beginning of the file when reading records into the dataflow, specify the first record you want to read. For example, if you want to skip the first 50 records, in a file, specify 51. The 51st record will be the first record read into the dataflow. |
All records |
Select this option if you want to read all records starting from the record specified in the Starting record field to the end of the file. |
Max records |
Select this option if you want to only read in a certain number of records starting from the record specified in the Starting record field. For example, if you want to read the first 100 records, select this option and enter 100. |