Group Statistics

The Group Statistics stage allows you to run statistical operations across multiple data rows broken down into groups that you want to analyze. If no groups are defined all rows will be treated as belonging to one group.

Groups are defined by one or more fields that have the same value across multiple data rows.

For example, the data in this table could be grouped by region, state, or both.

Region State
East MD
East MD
East CT
West CA
West CA

A group by Region would yield East and West. A group by State would yield California, Connecticut, and Maryland. A group by Region and State would yield East/Maryland, East/Connecticut, and West/California.

Input

The Group Statistics stage takes any field as input. Grouping can be performed on numeric or string data.

Options

Table 1. Operations Tab
Option Description

Input fields

Lists the fields in the dataflow which you can use to group records and perform calculations.

Row

Specifies the field or fields you want to use as categories for the calculations. For example, if you had data that included a Region field and you wanted to calculate total population by region, you would group by the Region field.

To add a field, select the field in the Input fields list then click >>.

Column

Optional. For creating a pivot table, specifies the field or fields whose values you want to pivot into columns for the purposes of cross tabulation.

To add a field, select the field in the Input fields list then click >>.

For example, if you had data that includs regions and shipping dates, and you want to tally the number of shipments per day for each state, you must specify the state field as a row and the shipment date field as a column.

Rows and Columns are pre-sorted in the configured order

Indicates that the input data is already sorted.

If this checkbox is checked, the stage does not sort the data and performs the specified operation directly on the input data.

Operation

Specifies the calculation to perform on each group. To add an operation, select the field in the Input fields list that you want to use for the operation then click >>.

For more information about the supported Group Statistics operations, see Operations.

Type

For the input and output fields, specifies the data type.

Integer
A numeric data type that contains both negative and positive whole numbers between -231 (-2,147,483,648) and 231-1 (2,147,483,647)
Long
A numeric data type that contains both negative and positive whole numbers between -263 (-9,223,372,036,854,775,808) and 263-1 (9,223,372,036,854,775,807)
Float
A numeric data type that contains both negative and positive single precision numbers between 2-149 (1.4E-45) and (2-223)×2127 (3.4028235E38)
Double
A numeric data type that contains both negative and positive double precision numbers between 2-1074 (4.9E-324) and (2-2-52)×21023 (1.7976931348623157E308)
Note: When using the integer and long types, data can be lost if the input number or calculated number from an operation contains decimal data.
Get count of records that are computed upon Returns the actual number of records in a group on which the selected operation is performed.

This column Computational Count excludes those input records where the column on which the operation is performed contains null values.

Fields Tab

The Fields tab is used when creating a pivot table. For more information, see Creating a Pivot Table.

Output Tab

Option Description

Return one row per group

For each group of rows, return a single row that contains the aggregated data for all rows in the group. Individual rows will be dropped. If this option is not selected, all rows will be returned. No data will be dropped.

This option is not available if you use the Percent Rank or ZScore operations.

Return a count of rows in each group

Returns the number of rows in each group. The default output field name that will contain the count is GroupCount.

Return a unique ID for each group

Returns a unique ID for each group of rows. The ID starts at 1 and increments by 1 for each additional group found. The default field name is GroupID.