xpark.dataset.aggregate.AggregateFnV2#
- class xpark.dataset.aggregate.AggregateFnV2(name: str, zero_factory: Callable[[], AccumulatorType], *, on: str | None, ignore_nulls: bool)[source]#
Provides an interface to implement efficient aggregations to be applied to the dataset.
AggregateFnV2 instances are passed to a Dataset’s
.aggregate(...)method to perform distributed aggregations. To create a custom aggregation, you should subclass AggregateFnV2 and implement the aggregate_block and combine methods. The finalize method can also be overridden if the final accumulated state needs further transformation.Aggregation follows these steps:
Initialization: For each group (if grouping) or for the entire dataset, an initial accumulator is created using zero_factory.
Block Aggregation: The aggregate_block method is applied to each block independently, producing a partial aggregation result for that block.
Combination: The combine method is used to merge these partial results (or an existing accumulated result with a new partial result) into a single, combined accumulator.
Finalization: Optionally, the finalize method transforms the final combined accumulator into the desired output format.
- Parameters:
name – The name of the aggregation. This will be used as the column name in the output, e.g., “sum(my_col)”.
zero_factory – A callable that returns the initial “zero” value for the accumulator. For example, for a sum, this would be lambda: 0; for finding a minimum, lambda: float(“inf”), for finding a maximum, lambda: float(“-inf”).
on – The name of the column to perform the aggregation on. If None, the aggregation is performed over the entire row (e.g., for Count()).
ignore_nulls – Whether to ignore null values during aggregation. If True, nulls are skipped. If False, the presence of a null value might result in a null output, depending on the aggregation logic.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
aggregate_block(block)Aggregates data within a single block.
combine(current_accumulator, new)Combines a new partial aggregation result with the current accumulator.
finalize(accumulator)Transforms the final accumulated state into the desired output.
get_agg_name()Return the agg name (e.g., 'sum', 'mean', 'count').
get_target_column()