-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shuffle Presto function #3404
Conversation
✅ Deploy Preview for meta-velox canceled.
|
0c576a9
to
b6277e4
Compare
b6277e4
to
95e3238
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Thank you for working on adding shuffle function. On top of regular usage, it could help enhance testing for array_sort function by allowing to generate random input then evaluate array_sort(x) and array_sort(shuffle(x)) and assert the results are the same.
CC: @kagamiori
velox/functions/prestosql/registration/ArrayFunctionsRegistration.cpp
Outdated
Show resolved
Hide resolved
32780e0
to
3580343
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite unit tests to better test the randomness and correctness of shuffle.
@darrenfu I'm concerned about not adding support for all types and not implementing this as zero-copy for strings and complex types. What's preventing us from implementing this properly from the start, perhaps, using Vector function? |
3580343
to
2ecc0c4
Compare
Good callout, @mbasmanova! Apart from the randomness test you mentioned, here I run
This borrows some ideas from Presto java version: https://github.com/prestodb/presto/blob/master/presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java#L4322. Let me know if the new tests suffices. |
2ecc0c4
to
b685a69
Compare
7428b07
to
814013b
Compare
I chatted with @laithsakka, looks like generic types are not supported yet in simple function api, I will re-implement it via vector function api. Thanks! |
2166cc7
to
b25d597
Compare
b25d597
to
f689aa8
Compare
67da804
to
0ec77f4
Compare
@mbasmanova - I've removed all type related unit test cases. Let me know if you have any other feedback about the tests. |
@darrenfu There are CI failures. Would you take a look? |
I rebased from main branch to include #3633 to skip non-deterministic udf verification on the fuzzer side. And the ci failure has gone away. @mbasmanova would you give a final stamp on this PR? |
49afc0d
to
a8e779a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu The implementation of the shuffle function looks great. I have some comments about the test.
namespace { | ||
DecodedVector* decode(DecodedVector& decoder, const BaseVector& vector) { | ||
SelectivityVector rows(vector.size()); | ||
decoder.decode(vector, rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use decoder.decode(vector) to decode all rows. If you do that, then you won't need this helper function as it becomes a one-liner.
DecodedVector decoded; | ||
exec::VectorReader<Any> reader(decode(decoded, *arrayVectorResult)); | ||
auto arrayReader = reader[0].castTo<Array<int32_t>>(); | ||
std::vector<int32_t> shuffledResult(kNums); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason not to use VectorReader::materialize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising it.
One gotcha: since my input and output arrays are both null-free, I found instead of calling reader[0]
, I need to call reader.readNullFree(0)
to force it return a std::vector<int32_t>
.
std::nullopt, | ||
std::nullopt, | ||
}}); | ||
std::vector<std::vector<std::optional<int64_t>>> expected{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use auto expected = actual
"?
Also, can we modify testIsPermutation to testShuffle(input)
and have it take a single ArrayVector as input, evaluate shuffle on it and verify that result is a permutation of input?
2fb6732
to
8996be8
Compare
@mbasmanova thanks for your scrutiny on the unit test class. I addressed all your comments, and besides the shuffle correctness (is-a-permutation), I also added three more test cases to cover the shuffle randomness with different encoding inputs. Please see the details in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Some additional comments on the test. I'm seeing a number of method with more parameters then needed / used.
constexpr static const int32_t kNums = 10; | ||
constexpr static const double kUniquenessRate = .7; | ||
|
||
template <typename T, bool returnsOptionalValues = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the calls to testShuffle specify returnsOptionalValues as true. Hence, this template parameter is not needed. Let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
FYI: my original idea is to have only one testShuffle() function instead of two functions: testShuffle() + testShuffleRandomness(). Somehow, I need to deal with build errors caused by 1) having std::optional and nested std::vector in a folly::F14FastSet and 2) calculating the right sizes & thresholds for different shape of input array cases. Then, I ended up choosing to separate into two functions to keep brevity.
|
||
for (auto i = 0; i < input->size(); i++) { | ||
// Must materialize into std::vector, otherwise it will throw error | ||
// because ArrayView doesn't support std::is_permutation() yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laithsakka Would it make sense to add support for std::is_permutation() and similar to Array reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I updated the origin enhancement task with this requirement:
#3426
} | ||
|
||
TEST_F(ArrayShuffleTest, dictEncoding) { | ||
// Considering the case of repeated elements: {1,2,3} x 3, {4,5} x 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case doesn't appear to be tested. Would you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "doesn't appear to be tested"?
Right now, I'm using std::is_permutation() to test if the shuffled output is a permutation of the origin input in test_shuffle() for the correctness purposes.
Regarding the randomness test, you will need to look at the test case: dictEncodingRandomness
. It will generate a dictionary with dictionaryValues (size=200) and indices (size=100). The dictionaryValues' elements will all be identical: {0,1,2,...}. And it will test if the shuffled output contains at least X (= n * 70%) distinct permutations (e.g. {1,2,3},{1,2,3},... => {1,3,2}, {3,2,1}).
Putting in another word, if we shuffle only once for those repeated dict elements, it will yield one permutation (e.g. {1,2,3} => {1,3,2}), which will be captured by this assertion above. Does this new test case cover your question?
** UPDATE **
Will add duplicate elements in the dict indices to cover that.
178a69c
to
015f69c
Compare
015f69c
to
56997cb
Compare
With the latest test cases:
Thanks @mbasmanova to call it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Thank you, Darren. Consider updating PR description to clarify that shuffle is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.
@darrenfu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: `shuffle(x) -> array` > Generate a random permutation of the given array x. For example: ``` SELECT shuffle(cast(array[1,2,2,3,4,null,5, null] as array(int))) -- [null, 1, 3, 4, 2, 5, 2, null] or any other permutation SELECT shuffle(cast(array[null, null] as array(int))) -- [null, null] SELECT shuffle(array['a', 'a', 'a']) -- ['a', 'a', 'a'] SELECT shuffle(cast(null as array(int))) -- null ``` NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector. Pull Request resolved: facebookincubator#3404 Differential Revision: D42646294 Pulled By: darrenfu fbshipit-source-id: ca4052758ae485c5a4522298d2a35d445f2857d7
b1aa69e
to
3f7437c
Compare
This pull request was exported from Phabricator. Differential Revision: D42646294 |
Summary: `shuffle(x) -> array` > Generate a random permutation of the given array x. For example: ``` SELECT shuffle(cast(array[1,2,2,3,4,null,5, null] as array(int))) -- [null, 1, 3, 4, 2, 5, 2, null] or any other permutation SELECT shuffle(cast(array[null, null] as array(int))) -- [null, null] SELECT shuffle(array['a', 'a', 'a']) -- ['a', 'a', 'a'] SELECT shuffle(cast(null as array(int))) -- null ``` NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector. Pull Request resolved: facebookincubator#3404 Differential Revision: D42646294 Pulled By: darrenfu fbshipit-source-id: 04b23d67e0d82d708f12470b552c4772b1ecfc04
This pull request was exported from Phabricator. Differential Revision: D42646294 |
3f7437c
to
5b75e78
Compare
shuffle(x) -> array
For example:
NOTE: this is a non-deterministic function, hence, it may return different results for the same input even when input is a constant or dictionary encoded vector.