Add support for collect_list Spark aggregate function #9231

liujiayi771 · 2024-03-24T11:57:21Z

The semantics of Spark's collect_list and Presto's array_agg are
generally consistent, but there are inconsistencies in the handling of null
values. Spark always ignores null values in the input, whereas Presto has a
parameter that controls whether to retain them. Moreover, Presto returns null
when all inputs are null, while Spark returns an empty array.

Because of these differences, we need to re-implement the array_agg
function for Spark.

netlify · 2024-03-24T11:57:43Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`3b0a10a`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/660c18eda477fb00081a6f8e

liujiayi771 · 2024-03-29T11:48:34Z

@mbasmanova Could you help review?

mbasmanova

@liujiayi771 Would you add documentation for this function?

velox/functions/sparksql/aggregates/CollectListAggregate.h

velox/functions/sparksql/aggregates/CollectListAggregate.cpp

mbasmanova

@liujiayi771 Looks good overall % comments for the tests.

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

mbasmanova · 2024-04-02T06:38:32Z

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

+      {"c0", "array_sort(a0)"},
+      "SELECT c0, array_sort(array_agg(a)"
+      "filter (where a is not null)) FROM tmp GROUP BY c0");
+  testAggregationsWithCompanion(


Is there any particular reasons companion function testing is not included as part of testAggregations? testAggregationsWithCompanion calls appear too verbose and repetitive.

CC: @kagamiori

@liujiayi771 Would you take a look at this comment?

@mbasmanova I think the possible reason is that some aggregate functions have not registered the companion functions due to certain restrictions, such as when isResultTypeResolvableGivenIntermediateType is false.

What I don't understand is why we need to pass [](auto& /*builder*/) {}, and {{BIGINT()}}, to testAggregationsWithCompanion and why do we need to call both testAggregations and testAggregationsWithCompanion.

Why can't we just call

testAggregationsWithCompanion( batches, {"c0"}, {"spark_collect_list(c1)"}, {"c0", "array_sort(a0)"}, "SELECT c0, array_sort(array_agg(c1)" "filter (where c1 is not null)) FROM tmp GROUP BY c0");

and have it test both regular functions as well as companion functions.

CC: @kagamiori

Yes, this is better, and the config parameter is also not necessary. Right now, many tests are calling testAggregations followed by testAggregationsWithCompanion. We need to combine these two test functions.

Let's do this refactoring in a follow-up.

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

mbasmanova · 2024-04-02T06:41:21Z

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

+      {},
+      {"spark_collect_list(c0)"},
+      {"array_sort(a0)"},
+      "SELECT array_sort(array_agg(c0)"


For simple cases like this, it might be better to provide expected results:

auto expected = makeRowVector({ makeArrayVectorFromJson<int32_t>({"[1, 2, 4, 5]"}); });

I still think this would be more readable.

@mbasmanova Will the result become unstable if we do not use agg_sort?

Failed Expected 1, got 1 1 extra rows, 1 missing rows 1 of extra rows: [4,1,5,2] 1 of missing rows: [1,2,4,5]

Or can I assume that the output will remain stable as [4,1,5,2]?

I think we can still use array_sort to ensure results are stable:

testAggregations( {data}, {}, {"spark_collect_list(c0)"}, {"array_sort(a0)"}, {expected});

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

liujiayi771 · 2024-04-02T10:04:48Z

velox/functions/sparksql/fuzzer/SparkAggregationFuzzerTest.cpp

@@ -88,6 +88,9 @@ int main(int argc, char** argv) {
      // coefficient. Meanwhile, DuckDB employs the sample kurtosis calculation
      // formula. The results from the two methods are completely different.
      "kurtosis",
+      // When all data in a group are null, Spark returns an empty array while
+      // DuckDB returns null.
+      "collect_list",


@mbasmanova I think this function should not be compared with DuckDB. If the fuzzer generates a group where all the data is null, DuckDB's result will be null, while Spark will return an empty array.

I agree. We need to change Fuzzer to verify results against Spark, not DuckDB: #9270

liujiayi771 · 2024-04-02T11:09:28Z

@mbasmanova Addressed the comments for the tests.

velox/docs/functions/spark/aggregate.rst

mbasmanova · 2024-04-02T12:37:49Z

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <random>


Are all these includes needed. Looks like some can be removed.

I have cleaned up the "include" and "using namespace" statements.

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

mbasmanova · 2024-04-02T12:39:22Z

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

+      {"c0", "array_sort(a0)"},
+      "SELECT c0, array_sort(array_agg(a)"
+      "filter (where a is not null)) FROM tmp GROUP BY c0");
+  testAggregationsWithCompanion(


@liujiayi771 Would you take a look at this comment?

mbasmanova · 2024-04-02T12:39:54Z

velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp

+      {},
+      {"spark_collect_list(c0)"},
+      {"array_sort(a0)"},
+      "SELECT array_sort(array_agg(c0)"


I still think this would be more readable.

mbasmanova

Thanks.

facebook-github-bot · 2024-04-02T14:29:05Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

liujiayi771 · 2024-04-02T14:40:33Z

velox/functions/sparksql/aggregates/tests/CMakeLists.txt

@@ -34,6 +35,7 @@ target_link_libraries(
  velox_functions_aggregates_test_lib
  velox_functions_spark_aggregates
  velox_hive_connector
+  velox_vector_fuzzer


@mbasmanova I noticed that there's an omission here that hasn't been removed. I have removed it. Please help to re-import.

facebook-github-bot · 2024-04-02T14:49:55Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-04-02T19:40:38Z

@mbasmanova merged this pull request in 1ba16a9.

conbench-facebook · 2024-04-02T20:02:39Z

Conbench analyzed the 1 benchmark run on commit 1ba16a96.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…tor#9231) Summary: The semantics of Spark's `collect_list` and Presto's `array_agg` are generally consistent, but there are inconsistencies in the handling of null values. Spark always ignores null values in the input, whereas Presto has a parameter that controls whether to retain them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty array. Because of these differences, we need to re-implement the `array_agg` function for Spark. Pull Request resolved: facebookincubator#9231 Reviewed By: xiaoxmeng Differential Revision: D55639676 Pulled By: mbasmanova fbshipit-source-id: 958471779a1fa66dba27569a6c12538ad5489f46

#9361) Summary: In #9231, `collect_list` is added to the disable list of `duckQueryRunner`. However, this is unnecessary because DuckDB does not have an aggregate function named `collect_list`, hence it would not be compared against DuckDB. This setting is redundant. Other than this, the results verification of `collect_list` has been set to `nullptr`, so its results are not verified. But we can use a custom array verifier used by Presto's `array_agg` to check the results of itself. Pull Request resolved: #9361 Reviewed By: xiaoxmeng Differential Revision: D55744044 Pulled By: mbasmanova fbshipit-source-id: a1a94c58b2a01463261775d8b6e08b65fd986d29

…tor#9231) Summary: The semantics of Spark's `collect_list` and Presto's `array_agg` are generally consistent, but there are inconsistencies in the handling of null values. Spark always ignores null values in the input, whereas Presto has a parameter that controls whether to retain them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty array. Because of these differences, we need to re-implement the `array_agg` function for Spark. Pull Request resolved: facebookincubator#9231 Reviewed By: xiaoxmeng Differential Revision: D55639676 Pulled By: mbasmanova fbshipit-source-id: 958471779a1fa66dba27569a6c12538ad5489f46

facebookincubator#9361) Summary: In facebookincubator#9231, `collect_list` is added to the disable list of `duckQueryRunner`. However, this is unnecessary because DuckDB does not have an aggregate function named `collect_list`, hence it would not be compared against DuckDB. This setting is redundant. Other than this, the results verification of `collect_list` has been set to `nullptr`, so its results are not verified. But we can use a custom array verifier used by Presto's `array_agg` to check the results of itself. Pull Request resolved: facebookincubator#9361 Reviewed By: xiaoxmeng Differential Revision: D55744044 Pulled By: mbasmanova fbshipit-source-id: a1a94c58b2a01463261775d8b6e08b65fd986d29

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2024

liujiayi771 mentioned this pull request Mar 24, 2024

Move ValueList to functions/lib/aggregates for Spark reuse #9213

Closed

liujiayi771 force-pushed the spark_array_agg branch 6 times, most recently from 059061f to c428066 Compare March 29, 2024 10:38

liujiayi771 marked this pull request as ready for review March 29, 2024 11:47

mbasmanova reviewed Mar 29, 2024

View reviewed changes

velox/functions/sparksql/aggregates/CollectListAggregate.h Outdated Show resolved Hide resolved

velox/functions/sparksql/aggregates/CollectListAggregate.cpp Outdated Show resolved Hide resolved

liujiayi771 added 3 commits March 30, 2024 09:52

Add support for collect_list Spark aggregate function

7870dd9

Use makeArrayVerifier for collect_list

b89d2ee

Address comments and rebase

c4dd224

liujiayi771 force-pushed the spark_array_agg branch from c428066 to c4dd224 Compare March 30, 2024 01:52

mbasmanova reviewed Apr 2, 2024

View reviewed changes

liujiayi771 force-pushed the spark_array_agg branch from c8cd7a2 to d0a9b2f Compare April 2, 2024 10:03

liujiayi771 commented Apr 2, 2024

View reviewed changes

Address comments

36a7595

liujiayi771 force-pushed the spark_array_agg branch from d0a9b2f to 36a7595 Compare April 2, 2024 11:14

mbasmanova reviewed Apr 2, 2024

View reviewed changes

liujiayi771 added 2 commits April 2, 2024 21:46

Address comments

0f2b0d6

Use expected result for global and groupBy test cases

48edcdd

mbasmanova approved these changes Apr 2, 2024

View reviewed changes

Remove velox_vector_fuzzer in target_link_libraries

3b0a10a

liujiayi771 commented Apr 2, 2024

View reviewed changes

facebook-github-bot closed this in 1ba16a9 Apr 2, 2024

facebook-github-bot added the Merged label Apr 2, 2024

liujiayi771 mentioned this pull request Apr 3, 2024

[VL] Use collect_list aggregate function in velox apache/incubator-gluten#5285

Merged

liujiayi771 mentioned this pull request Apr 4, 2024

Enable result validation for collect_list Spark function in the Fuzzer #9361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for collect_list Spark aggregate function #9231

Add support for collect_list Spark aggregate function #9231

liujiayi771 commented Mar 24, 2024

netlify bot commented Mar 24, 2024 •

edited

Loading

liujiayi771 commented Mar 29, 2024

mbasmanova left a comment

mbasmanova left a comment

mbasmanova Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 Apr 2, 2024

mbasmanova Apr 2, 2024

mbasmanova Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 Apr 2, 2024

liujiayi771 Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 commented Apr 2, 2024

mbasmanova Apr 2, 2024

liujiayi771 Apr 2, 2024

mbasmanova Apr 2, 2024

mbasmanova Apr 2, 2024

mbasmanova left a comment

facebook-github-bot commented Apr 2, 2024

liujiayi771 Apr 2, 2024 •

edited

Loading

facebook-github-bot commented Apr 2, 2024

facebook-github-bot commented Apr 2, 2024

conbench-facebook bot commented Apr 2, 2024

Add support for collect_list Spark aggregate function #9231

Add support for collect_list Spark aggregate function #9231

Conversation

liujiayi771 commented Mar 24, 2024

netlify bot commented Mar 24, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

liujiayi771 commented Mar 29, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liujiayi771 commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 2, 2024

liujiayi771 Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Apr 2, 2024

facebook-github-bot commented Apr 2, 2024

conbench-facebook bot commented Apr 2, 2024

netlify bot commented Mar 24, 2024 •

edited

Loading

liujiayi771 Apr 2, 2024 •

edited

Loading