Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Closed
wants to merge 2 commits into from

Conversation

bgjackma
Copy link

@bgjackma bgjackma commented Sep 21, 2024

Which issue does this PR close?

Closes #5647.

Rationale for this change

Implements the GROUPING function as per Postgres.

https://www.postgresql.org/docs/15/functions-aggregate.html#FUNCTIONS-GROUPING-TABLE

This is in contrast to other implementations including Databricks and Oracle where GROUPING takes only one column and there is a GROUPING_ID function that yields a similar bitfield.

What changes are included in this PR?

Implement the aggregate function in the Physical Planning stage.

Are these changes tested?

A few unit tests and an integration test provided by @JasonLi-cn in a previous unfinished PR. May add more.

Are there any user-facing changes?

  • Slight change to definition of unimplemented function.
  • Implementation of that function.

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) functions labels Sep 21, 2024
@bgjackma bgjackma marked this pull request as ready for review September 21, 2024 06:11
Comment on lines +93 to +101
// The PhysicalExprs of grouping_exprs must be Column PhysicalExpr. Because if
// the group by PhysicalExpr in SQL is non-Column PhysicalExpr, then there is
// a ProjectionExec before AggregateExec to convert the non-column PhysicalExpr
// to Column PhysicalExpr.
let column_index =
|expr: &Arc<dyn PhysicalExpr>| match expr.as_any().downcast_ref::<Column>() {
Some(column) => Ok(column.index()),
None => internal_err!("Grouping doesn't support expr: {}", expr),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true when one enabled the optimizer rule CommonSubexprEliminate . Does not seems like a acceptable to depend on optimizer rules for correctness/basic support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look for equal PhysicalExprs?

The Postgres docs imply they do ~text comparison but I'm not sure how accessible that info is at this layer.

Comment on lines +584 to +588
// GROUPING is a special fxn that exposes info about group organization
if let Some(grouping) = agg_expr.fun().inner().as_any().downcast_ref::<Grouping>() {
let args = agg_expr.all_expressions().args;
return grouping.create_grouping_accumulator(&args, &group_by.expr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need special handling like this it seems to me that we should consider just making Grouping a build in.

Or we should probably make it more generic so it can be used to implement some other function. But since the input is is just the bitmaks and the output is the same. I wonder if there are any conceivable functions that could not just be implemented as a transformation on a builtin grouping function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of in a weird place, it's sort of not a real aggregation function but instead a way to leak metadata. That might be a reason to make it a built-in.

Do you have ideas about how and when to go about doing that?

There's another function called GROUP_ID (not to be confused with GROUPING_ID) which disambiguates duplicate rows, it might be relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of in a weird place, it's sort of not a real aggregation function but instead a way to leak metadata. That might be a reason to make it a built-in.

Do you have ideas about how and when to go about doing that?

One way might be that we expose the grouping_id column used in #12571 and implement the function as transformation on that. This should be possible as that column should "leak" the needed metadata. This is what was proposed in #5749

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed an initial implemenation of this here: #12704

I think someone with more experince with this project should decide what is the best way forward.

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

Looks like there is a minor clippy failure on this PR

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate labels Sep 25, 2024
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate labels Sep 25, 2024
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bgjackma for your contribution. Great PR and the testing is awesome.
I would probably add user documentation, examples and benchmarks. But it can be also done as followup PRs

grouping_args: &[Arc<dyn PhysicalExpr>],
group_exprs: &[(Arc<dyn PhysicalExpr>, String)],
) -> Result<Box<dyn GroupsAccumulator>> {
if grouping_args.len() > 32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have it as a const

};
let group_by_columns: Result<Vec<_>> =
group_exprs.iter().map(|(e, _)| column_index(e)).collect();
let group_by_columns = group_by_columns?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be 1 liner?

struct GroupingAccumulator {
// Grouping ID value for each group
grouping_ids: Vec<u32>,
// Indices of GROUPING arguments as they appear in the GROUPING SET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have more details or example on indices?

}

impl GroupingAccumulator {
fn mask_to_id(&self, mask: &[bool]) -> Result<u32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more description on this method, how it changes the mask

_opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(values.len(), 1, "single argument to merge_batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we always expect only 1 array ?

expr_indices: vec![5],
};
let res = grouping.mask_to_id(&[false]);
assert!(res.is_err())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to check the error message as well

@@ -1169,7 +1176,7 @@ pub(crate) fn evaluate_group_by(
.groups
.iter()
.map(|group| {
group
let v = group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have more meaningful name?

) -> Result<Box<dyn GroupsAccumulator>> {
// GROUPING is a special fxn that exposes info about group organization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GROUPING is a special fxn that exposes info about group organization
// GROUPING is a special function that exposes info about group organization

?

@@ -870,6 +883,7 @@ impl GroupedHashAggregateStream {
| AggregateMode::SinglePartitioned => output.push(acc.evaluate(emit_to)?),
}
}
debug!("Output: {:?}", output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug!("Output: {:?}", output);

})?;
let mut output = group_values
.first()
.map(|gs| gs.values.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have better naming?

@bgjackma
Copy link
Author

bgjackma commented Oct 3, 2024

Thanks for the review, @comphead but I think #12704 is a better solution so I'm going to close in favor of that.

@bgjackma bgjackma closed this Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Grouping functions with Group By CUBE/ROLLUP/GROUPING SETS
4 participants