Spark Sorter

The Spark Sorter activity allows you to sort a massive bulk of records. This activity uses Apache Spark libraries to power the feature and runs on your Spectrum™ Technology Platform server.

Currently, delimited files, present on the Spectrum™ Technology Platform server, are accepted to read the input records.

Note: Files present on remote servers are not supported.
Field Description
Server name

Indicates the location of the file you select as input.

Since the Spark Sorter activity only accepts files located on the Spectrum™ Technology Platform, this field displays Spectrum™ Technology Platform.

File name

Specifies the path to the file. Click the ellipses button (...) to go to the file you want.

You can read multiple files by using a wild card character to read data from multiple files in the directory. The wild card characters * and ? are supported. For example, you could specify *.csv to read in all files with a .csv extension located in the directory. In order to successfully read multiple files, each file must have the same layout (the same fields in the same positions). Any record that does not match the layout specified on the Fields tab will be treated as a malformed record.

Attention: If the Spectrum™ Technology Platform server is running on Unix or Linux, remember that file names and paths on these platforms are case sensitive.
Record Type The format of the records in the file. Currently, delimited file formats are accepted as input.
Delimited
A text file in which records are separated by an end-of-line (EOL) character such as a carriage return or line feed (CR or LF) and each field is separated by a designated character such as a comma.
Character Encoding

The character encoding of the input file.

The encoding UTF-8 is supported. For more information about UTF, see unicode.org/faq/utf_bom.html.

Field separator

Specifies the character used to separate fields in a delimited file.

For example, this record uses a pipe (|) as a field separator:

7200 13TH ST|MIAMI|FL|33144

These characters available to define as field separators are:

  • Space
  • Tab
  • Comma
  • Period
  • Semicolon
  • Pipe

If the file uses a different character as a field separator, click the ellipses button to select another character as a delimiter.

Text qualifier

The character used to surround text values in a delimited file.

For example, this record uses double quotes (") as a text qualifier.

"7200 13TH ST"|"MIAMI"|"FL"|"33144"

The characters available to define as text qualifiers are:

  • Single quote (')
  • Double quote (")

If the file uses a different text qualifier, click the ellipses button to select another character as a text qualifier.

Record separator

Specifies the character used to separate records in line a sequential or delimited file. This field is not available if you check the Use default EOL check box.

The record separator settings available are:

Unix (U+000A)
A line feed character separates the records. This is the standard record separator for Unix systems.
Macintosh (U+000D)
A carriage return character separates the records. This is the standard record separator for Macintosh systems.
Windows (U+000D U+000A)
A carriage return followed by a line feed separates the records. This is the standard record separator for Windows systems.

If your file uses a different record separator, click the ellipses button to select another character as a record separator.

Use default EOL

Specifies that the file's record separator is the default end of line (EOL) character used on the operating system on which the Spectrum™ Technology Platform server is running.

Do not select this option if the file uses an EOL character that is different from the default EOL character used on the server's operating system. For example, if the file uses a Windows EOL but the server is running on Linux, do not check this option. Instead, select the Windows option in the Record separator field.

First row is header record

Specifies whether the first record in a delimited file contains header information and not data.

For example, this file snippet shows a header row in the first record.

"AddressLine1"|"City"|"StateProvince"|"PostalCode"
"7200 13TH ST"|"MIAMI"|"FL"|"33144"
"One Global View"|"Troy"|"NY"|12180
Output

Specifies the path to the output file on the Spectrum™ Technology Platform server. Click the ellipses button (...) to go to the output directory and file name you want.

Attention: If the Spectrum™ Technology Platform server is running on Unix or Linux, remember that file names and paths on these platforms are case sensitive.
Overwrite Indicates that the output file must overwrite if a file exists with the same name as specified in the Output field.
Concatenate Indicates that all Spark part files must be concatenated into a single output file in the specified Output location.
Preview Once the input file is selected in the File name field, the Preview grid displays the first 100 records of the existing output file.

To correctly display all the separate column values, click Regenerate on the Fields tab.

Fields Tab

The Fields tab defines the names, types, and positions of fields in the file. For more information, see:

Sort Tab

The Sort tab defines fields by which to sort the input records before they are sent into the dataflow. For more information, see Sorting Records.

Configuration Tab

To specify additional properties to run the required job, use this tab to define as many property-value pairs as required. You can add the required properties directly in the grid one at a time.

Alternatively, to import properties from a file, click Import. Go to the location of respective property file and select the file of XML format. The properties contained in the imported file are copied into the grid. The property file must be in XML format and must follow the syntax:
<configuration>
    <property>
        <name>key</name>
        <value>some_value</value>
        <description>A brief description of the 
            purpose of the property key.</description>
    </property>
</configuration>
Note:
  1. If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.
  2. If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
  3. You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
  4. Ensure the property file is present on the Spectrum™ Technology Platform server itself.
  5. The <description> tag is optional for each property key in a configuration property file.
  6. Reference data needs to be placed local to data nodes to run the relevant jobs. This property is available only for jobs that use reference data, such as Advanced Transformer, Validate Address Global, and Validate Address. The property is: pb.bdq.reference.data.location.

Runtime Tab

Field Name Description

File name

Displays the file name selected in the first tab.

Starting record

If you want to skip records at the beginning of the file when reading records into the dataflow, specify the first record you want to read. For example, if you want to skip the first 50 records, in a file, specify 51. The 51st record will be the first record read into the dataflow.

All records

Select this option if you want to read all records starting from the record specified in the Starting record field to the end of the file.

Max records

Select this option if you want to only read in a certain number of records starting from the record specified in the Starting record field. For example, if you want to read the first 100 records, select this option and enter 100.