Strategy ONE

Apply Aggregation and Filtering to Hadoop Data Imports

When importing data from a Hadoop Distributed File System (HDFS) you can apply different aggregation functions, as well as filters, to the data during import. This allows you to customize the amount of data being brought into memory.

Apply Aggregation

The Aggregation option is available in the Preview dialog.

To apply an aggregation function to your data:

  1. Click Aggregation to open the Aggregation Dialog.
  2. Right-click on a field and select the desired function from the menu.
  3. Click Execute SQL to preview your data with the aggregation applied.

    The aggregation/function executes against top 100,000 rows of entire dataset, by default. Therefore, the precise of Sum, Max, Min, Average, and Count functions are impacted during preview (other functions are not affected). Aggregation results are re-calculated against the entire dataset while publishing the cube. The hgos.aggregation.preview.rows property value in /conf/hgos-spark.properties can be adjusted to specific number of rows, or set to a value of -1 for an unlimited number of rows.

  4. Click OK to save the new schema definitions with the aggregation/function applied.

    If you attempt to wrangle data after aggregation or functions have been applied, the system discards your changes.

Filtering

Hadoop Gateway also supports filtering of data imports. Filtering allows you to import limited data which matches a certain condition instead of entire dataset.

The Filter option is available in the context menu and at the top of the Aggregation Dialog box.