Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle Presto function #3404

Closed

Conversation

darrenfu
Copy link
Contributor

@darrenfu darrenfu commented Dec 2, 2022

shuffle(x) -> array

Generate a random permutation of the given array x.

For example:

SELECT shuffle(cast(array[1,2,2,3,4,null,5, null] as array(int))) -- [null, 1, 3, 4, 2, 5, 2, null] or any other permutation
SELECT shuffle(cast(array[null, null] as array(int))) -- [null, null]
SELECT shuffle(array['a', 'a', 'a']) -- ['a', 'a', 'a']
SELECT shuffle(cast(null as array(int))) -- null

NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.

@netlify
Copy link

netlify bot commented Dec 2, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5b75e78
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/63cee0bafed3b8000893dea9

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2022
@darrenfu darrenfu marked this pull request as draft December 2, 2022 19:12
@darrenfu darrenfu mentioned this pull request Dec 2, 2022
@darrenfu darrenfu force-pushed the array_shuffle_simpleudf branch 2 times, most recently from 0c576a9 to b6277e4 Compare December 2, 2022 20:01
@darrenfu darrenfu marked this pull request as ready for review December 2, 2022 20:02
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Thank you for working on adding shuffle function. On top of regular usage, it could help enhance testing for array_sort function by allowing to generate random input then evaluate array_sort(x) and array_sort(shuffle(x)) and assert the results are the same.

CC: @kagamiori

velox/docs/functions/coverage.rst Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
@darrenfu darrenfu force-pushed the array_shuffle_simpleudf branch 2 times, most recently from 32780e0 to 3580343 Compare December 12, 2022 09:32
Copy link
Contributor Author

@darrenfu darrenfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite unit tests to better test the randomness and correctness of shuffle.

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayFunctions.h Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

@darrenfu I'm concerned about not adding support for all types and not implementing this as zero-copy for strings and complex types. What's preventing us from implementing this properly from the start, perhaps, using Vector function?

@darrenfu
Copy link
Contributor Author

@darrenfu Thank you for working on adding shuffle function. On top of regular usage, it could help enhance testing for array_sort function by allowing to generate random input then evaluate array_sort(x) and array_sort(shuffle(x)) and assert the results are the same.

Good callout, @mbasmanova! Apart from the randomness test you mentioned, here I run shuffle(x) 100 times with a ranged-based input {0, 1, ..., 9}, and assert:

  1. randomness: if the output shuffle arrays will be different at least 70% times in total. (actually, in this case, it should be > 99%)
  2. correctness: if it's a permutation of the origin array. I'm using std::is_permutation() whose time complexity is O(N^2), a bit slower. Since it's unit test, I think it'll be fine.

This borrows some ideas from Presto java version: https://github.com/prestodb/presto/blob/master/presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java#L4322. Let me know if the new tests suffices.

@darrenfu
Copy link
Contributor Author

@darrenfu I'm concerned about not adding support for all types and not implementing this as zero-copy for strings and complex types. What's preventing us from implementing this properly from the start, perhaps, using Vector function?

I chatted with @laithsakka, looks like generic types are not supported yet in simple function api, I will re-implement it via vector function api. Thanks!

@darrenfu darrenfu marked this pull request as draft December 16, 2022 07:34
@darrenfu darrenfu marked this pull request as ready for review December 22, 2022 03:13
@darrenfu darrenfu force-pushed the array_shuffle_simpleudf branch 2 times, most recently from 2166cc7 to b25d597 Compare December 22, 2022 03:17
@darrenfu darrenfu marked this pull request as draft December 22, 2022 05:24
@darrenfu darrenfu force-pushed the array_shuffle_simpleudf branch 2 times, most recently from 67da804 to 0ec77f4 Compare January 11, 2023 17:47
@darrenfu
Copy link
Contributor Author

darrenfu commented Jan 11, 2023

@darrenfu Darren, thank you for iterating on this PR. shuffle turns out to be a non-trivial function to implement. The code looks good to me, but the test seems overly complicated. I didn't have time to read it closely, but given that the function implementation doesn't depend on the type of the array elements, it seems redundant to test different element types.

@mbasmanova - I've removed all type related unit test cases. Let me know if you have any other feedback about the tests.

@mbasmanova
Copy link
Contributor

@darrenfu There are CI failures. Would you take a look?

@darrenfu
Copy link
Contributor Author

darrenfu commented Jan 11, 2023

@darrenfu There are CI failures. Would you take a look?

I rebased from main branch to include #3633 to skip non-deterministic udf verification on the fuzzer side. And the ci failure has gone away.

@mbasmanova would you give a final stamp on this PR?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu The implementation of the shuffle function looks great. I have some comments about the test.

velox/functions/prestosql/ArrayShuffle.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayShuffle.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/ArrayShuffle.cpp Show resolved Hide resolved
namespace {
DecodedVector* decode(DecodedVector& decoder, const BaseVector& vector) {
SelectivityVector rows(vector.size());
decoder.decode(vector, rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use decoder.decode(vector) to decode all rows. If you do that, then you won't need this helper function as it becomes a one-liner.

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
DecodedVector decoded;
exec::VectorReader<Any> reader(decode(decoded, *arrayVectorResult));
auto arrayReader = reader[0].castTo<Array<int32_t>>();
std::vector<int32_t> shuffledResult(kNums);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason not to use VectorReader::materialize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising it.
One gotcha: since my input and output arrays are both null-free, I found instead of calling reader[0], I need to call reader.readNullFree(0) to force it return a std::vector<int32_t>.

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
std::nullopt,
std::nullopt,
}});
std::vector<std::vector<std::optional<int64_t>>> expected{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use auto expected = actual"?

Also, can we modify testIsPermutation to testShuffle(input) and have it take a single ArrayVector as input, evaluate shuffle on it and verify that result is a permutation of input?

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
@darrenfu
Copy link
Contributor Author

darrenfu commented Jan 19, 2023

@mbasmanova thanks for your scrutiny on the unit test class. I addressed all your comments, and besides the shuffle correctness (is-a-permutation), I also added three more test cases to cover the shuffle randomness with different encoding inputs. Please see the details in: testShuffleRandomness()

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Some additional comments on the test. I'm seeing a number of method with more parameters then needed / used.

constexpr static const int32_t kNums = 10;
constexpr static const double kUniquenessRate = .7;

template <typename T, bool returnsOptionalValues = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the calls to testShuffle specify returnsOptionalValues as true. Hence, this template parameter is not needed. Let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
FYI: my original idea is to have only one testShuffle() function instead of two functions: testShuffle() + testShuffleRandomness(). Somehow, I need to deal with build errors caused by 1) having std::optional and nested std::vector in a folly::F14FastSet and 2) calculating the right sizes & thresholds for different shape of input array cases. Then, I ended up choosing to separate into two functions to keep brevity.

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved

for (auto i = 0; i < input->size(); i++) {
// Must materialize into std::vector, otherwise it will throw error
// because ArrayView doesn't support std::is_permutation() yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laithsakka Would it make sense to add support for std::is_permutation() and similar to Array reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I updated the origin enhancement task with this requirement:
#3426

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
}

TEST_F(ArrayShuffleTest, dictEncoding) {
// Considering the case of repeated elements: {1,2,3} x 3, {4,5} x 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case doesn't appear to be tested. Would you double check?

Copy link
Contributor Author

@darrenfu darrenfu Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "doesn't appear to be tested"?
Right now, I'm using std::is_permutation() to test if the shuffled output is a permutation of the origin input in test_shuffle() for the correctness purposes.
Regarding the randomness test, you will need to look at the test case: dictEncodingRandomness. It will generate a dictionary with dictionaryValues (size=200) and indices (size=100). The dictionaryValues' elements will all be identical: {0,1,2,...}. And it will test if the shuffled output contains at least X (= n * 70%) distinct permutations (e.g. {1,2,3},{1,2,3},... => {1,3,2}, {3,2,1}).
Putting in another word, if we shuffle only once for those repeated dict elements, it will yield one permutation (e.g. {1,2,3} => {1,3,2}), which will be captured by this assertion above. Does this new test case cover your question?

** UPDATE **
Will add duplicate elements in the dict indices to cover that.

velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/ArrayShuffleTest.cpp Outdated Show resolved Hide resolved
@darrenfu
Copy link
Contributor Author

darrenfu commented Jan 20, 2023

With the latest test cases: constantEncoding and constantEncodingRandomness, if revert to special handling on constant encoding (cc0cbd7), it will report the following error. It means these two test cases are necessary, and we shouldn't do any special handling on constant encoding as in other array UDFs.

unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Expression evaluation result is not of expected type: shuffle(C0) -> CONSTANT vector of type ARRAY<BIGINT>
Retriable: False
Expression: castedResult
Function: castEvaluateResult
File: /Users/dofu/git/velox/./velox/functions/prestosql/tests/utils/FunctionBaseTest.h
Line: 274

Thanks @mbasmanova to call it out.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Thank you, Darren. Consider updating PR description to clarify that shuffle is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.

@facebook-github-bot
Copy link
Contributor

@darrenfu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

darrenfu added a commit to darrenfu/velox that referenced this pull request Jan 23, 2023
Summary:
`shuffle(x) -> array`

> Generate a random permutation of the given array x.

For example:
```
SELECT shuffle(cast(array[1,2,2,3,4,null,5, null] as array(int))) -- [null, 1, 3, 4, 2, 5, 2, null] or any other permutation
SELECT shuffle(cast(array[null, null] as array(int))) -- [null, null]
SELECT shuffle(array['a', 'a', 'a']) -- ['a', 'a', 'a']
SELECT shuffle(cast(null as array(int))) -- null
```

NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.

Pull Request resolved: facebookincubator#3404

Differential Revision: D42646294

Pulled By: darrenfu

fbshipit-source-id: ca4052758ae485c5a4522298d2a35d445f2857d7
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42646294

Summary:
`shuffle(x) -> array`

> Generate a random permutation of the given array x.

For example:
```
SELECT shuffle(cast(array[1,2,2,3,4,null,5, null] as array(int))) -- [null, 1, 3, 4, 2, 5, 2, null] or any other permutation
SELECT shuffle(cast(array[null, null] as array(int))) -- [null, null]
SELECT shuffle(array['a', 'a', 'a']) -- ['a', 'a', 'a']
SELECT shuffle(cast(null as array(int))) -- null
```

NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.

Pull Request resolved: facebookincubator#3404

Differential Revision: D42646294

Pulled By: darrenfu

fbshipit-source-id: 04b23d67e0d82d708f12470b552c4772b1ecfc04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42646294

@facebook-github-bot
Copy link
Contributor

@darrenfu merged this pull request in 64f090d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants