Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ValueList to functions/lib/aggregates for Spark reuse #9213

Closed
wants to merge 3 commits into from

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Mar 22, 2024

The array_agg function in Spark and Presto has some inconsistent behavior
when handling null values. We need to re-implement Spark's array_agg in Velox.
To be able to reuse ValueList, it needs to be moved to the
functions/lib/aggregates directory first.

The copyValueListToArrayWriter method has also been moved to the
ValueList, making it reusable.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 22, 2024
Copy link

netlify bot commented Mar 22, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 123c4cf
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66018d8a81a2fc0008774ddd

@liujiayi771 liujiayi771 force-pushed the array_agg branch 2 times, most recently from e3b60b2 to 4d961e6 Compare March 22, 2024 11:41
@liujiayi771
Copy link
Contributor Author

Hi @mbasmanova. Could you help to review?

@liujiayi771 liujiayi771 force-pushed the array_agg branch 2 times, most recently from 153f0db to 3f914e0 Compare March 22, 2024 15:14
@mbasmanova
Copy link
Contributor

The array_agg function in Spark and Presto has some inconsistent behavior when handling null values. We need to re-implement Spark's array_agg in Velox.

What is the difference in behavior between Presto and Spark?

@liujiayi771
Copy link
Contributor Author

What is the difference in behavior between Presto and Spark?

The semantics are generally consistent, but there are inconsistencies between Presto and Spark in the handling of null values. Spark always ignores null values in the input, whereas Presto preserves them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty list or set.
I plan to rewrite Spark's array_agg based on the simple aggregate function interface.

@liujiayi771
Copy link
Contributor Author

Hi @mbasmanova, there is a draft PR #9231 for Spark's array_agg, you could get more details from it. I plan to split the move of ValueList and the implementation of Spark's array_agg into two PRs, although they could be combined into a single PR.

@mbasmanova
Copy link
Contributor

@liujiayi771 Thank you for clarifying. Presto's implementation depends on presto.array_agg.ignore_nulls config flag. Does Spark implementation match Presto's when presto.array_agg.ignore_nulls = true?

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Mar 25, 2024

Does Spark implementation match Presto's when presto.array_agg.ignore_nulls = true?

There is still some difference. When all inputs are null, the output of Spark is an empty array, while Presto returns null. So I need to use the Non-Default-Null Behavior simple function interface.

@mbasmanova
Copy link
Contributor

@liujiayi771

When all inputs are null, the output of Spark is an empty array, while Presto returns null.

Got it. Thank you for clarifying.

CI is red. Please, take a look.

@liujiayi771
Copy link
Contributor Author

@mbasmanova I will try to rebase the latest main branch to see if the CI is still red.

@liujiayi771
Copy link
Contributor Author

@mbasmanova I found that your PR also has these two failing tests. Perhaps these two tests themselves are problematic.

[  FAILED  ] MinMaxByNTest.sortedGlobal
[  FAILED  ] MinMaxByNTest.sortedGroupBy

@mbasmanova
Copy link
Contributor

@liujiayi771 It might be #9200

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kagamiori Wei, would you help review this PR?

@pedroerp pedroerp self-requested a review March 26, 2024 18:29
Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @liujiayi771

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Mar 28, 2024

@Yuhta @mbasmanova Did you encounter any issues with the merge? Do I need to rebase with the latest code?

@pedroerp
Copy link
Contributor

@Yuhta @mbasmanova Did you encounter any issues with the merge? Do I need to rebase with the latest code?

There were a few smaller internal fixes required since the header got moved, but I'm merging it today.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in 397bc05.

Copy link

Conbench analyzed the 1 benchmark run on commit 397bc050.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ncubator#9213)

Summary:
The array_agg function in Spark and Presto has some inconsistent behavior
when handling null values. We need to re-implement Spark's array_agg in Velox.
To be able to reuse `ValueList`, it needs to be moved to the
`functions/lib/aggregates` directory first.

The `copyValueListToArrayWriter` method has also been moved to the
`ValueList`, making it reusable.

Pull Request resolved: facebookincubator#9213

Reviewed By: Yuhta

Differential Revision: D55406723

Pulled By: pedroerp

fbshipit-source-id: 7447cfd12f18d9ca4c474579da6a55a876f97ec1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants