-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ValueList to functions/lib/aggregates for Spark reuse #9213
Conversation
✅ Deploy Preview for meta-velox canceled.
|
e3b60b2
to
4d961e6
Compare
Hi @mbasmanova. Could you help to review? |
153f0db
to
3f914e0
Compare
What is the difference in behavior between Presto and Spark? |
The semantics are generally consistent, but there are inconsistencies between Presto and Spark in the handling of null values. Spark always ignores null values in the input, whereas Presto preserves them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty list or set. |
Hi @mbasmanova, there is a draft PR #9231 for Spark's |
@liujiayi771 Thank you for clarifying. Presto's implementation depends on presto.array_agg.ignore_nulls config flag. Does Spark implementation match Presto's when presto.array_agg.ignore_nulls = true? |
There is still some difference. When all inputs are null, the output of Spark is an empty array, while Presto returns null. So I need to use the Non-Default-Null Behavior simple function interface. |
Got it. Thank you for clarifying. CI is red. Please, take a look. |
@mbasmanova I will try to rebase the latest main branch to see if the CI is still red. |
@mbasmanova I found that your PR also has these two failing tests. Perhaps these two tests themselves are problematic.
|
@liujiayi771 It might be #9200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori Wei, would you help review this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @liujiayi771
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@Yuhta @mbasmanova Did you encounter any issues with the merge? Do I need to rebase with the latest code? |
There were a few smaller internal fixes required since the header got moved, but I'm merging it today. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ncubator#9213) Summary: The array_agg function in Spark and Presto has some inconsistent behavior when handling null values. We need to re-implement Spark's array_agg in Velox. To be able to reuse `ValueList`, it needs to be moved to the `functions/lib/aggregates` directory first. The `copyValueListToArrayWriter` method has also been moved to the `ValueList`, making it reusable. Pull Request resolved: facebookincubator#9213 Reviewed By: Yuhta Differential Revision: D55406723 Pulled By: pedroerp fbshipit-source-id: 7447cfd12f18d9ca4c474579da6a55a876f97ec1
The array_agg function in Spark and Presto has some inconsistent behavior
when handling null values. We need to re-implement Spark's array_agg in Velox.
To be able to reuse
ValueList
, it needs to be moved to thefunctions/lib/aggregates
directory first.The
copyValueListToArrayWriter
method has also been moved to theValueList
, making it reusable.