-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add map_from_entries Presto function #3417
Add map_from_entries Presto function #3417
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@darrenfu CI is failing. Would you take a look? |
The CI error is a discrepancy between the two eval and evalSimplify paths: @pedroerp - since you are the author of #156, any advice to satisfy the fuzzer on both paths (while still throw a VeloxUserException for any nulls in the input array)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu I think this function needs to be implemented as a vector function. See comment below for details. Also, to test this new function more thoroughly, please, run the fuzzer. First, test function in isolation, then in combination with other functions.
velox_expression_fuzzer_test --logtostderr=1 --velox_fuzzer_enable_complex_types --only map_from_entries --duration_sec 60
velox_expression_fuzzer_test --logtostderr=1 --velox_fuzzer_enable_complex_types --enable_variadic_signatures --duration_sec 600
@darrenfu CI is red. Would you take a look? Make sure to rebase onto latest main to pick up all the recent fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Some comments about the implementation. I expect the fuzzer test to fail for this function because it doesn't handle encodings properly. Curious to know if it does.
For the CI failure, it relates to a fuzzer signature empty check. I will check with Wei. I will also add more unit tests to cover more encodings and unknown (empty array) input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Darren, thank you for iterating on this PR. Working with complex types and encodings is indeed a bit tricky. Here are some suggestions.
localResult = | ||
BaseVector::wrapInConstant(rows.size(), flatIndex, localResult); | ||
} else { | ||
exec::DecodedArgs decodedArgs(rows, args, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the only possibility is that input is flat. Hence, no need to decode.
The easiest way to explore possible encodings is by using VectorFuzzer to generate random inputs.
Here is a rough sketch of what it may look like to handle all encodings.
I modified the Fuzzer to workaround #3533 and also modified DecodedVector to workaround a bug that @kagamiori Wei is looking into (#3424)
velox_expression_fuzzer_test --enable_variadic_signatures --lazy_vector_generation_ratio=0.2 --velox_fuzzer_enable_complex_types --logtostderr=1 --only map_from_entries --duration_sec 60
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!! Your sketch is super helpful for me to decode such a nested complex vector, this comment really made my day!
FYI: I know the gist of different simple vector layouts when I was in Saber, but I am not super familiar with the surrounding helper / wrapper classes and the usage scenarios, like SelectivityVector, complex vector layouts (this is not contributed from Saber), DecodedVector (I wrote our own vector SerDe classes when I was in Saber). So excited to witness so many new velox features since then, and happy to refresh my arsenal.
Also thanks for your quick workaround for bug #3424 and #3533, I amended a comment - it actually affects a few map functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the Fuzzer to workaround #3533 and also modified DecodedVector to workaround a bug that @kagamiori Wei is looking into (#3424)
Since Wei is working on the fuzzer bugs, I will leave the fuzzer code unchanged with CI failures. Once these two bugs are fixed, I will rebase from main and land this PR. Does it sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Wei is working on the fuzzer bugs, I will leave the fuzzer code unchanged with CI failures. Once these two bugs are fixed, I will rebase from main and land this PR. Does it sound good?
Indeed, this PR has a dependency on fixing Fuzzer and DecodedVector. Just to clarify, after rebasing the PR, you'll need to re-run Fuzzer to make sure there are not failures and the PR will need to be reviewed before it is ready for landing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova - regarding the fuzzer failure, I did some debugging and found that the exception is thrown at this line:
VectorPtr wrappedKeys;
VectorPtr wrappedValues;
if (decodedRow->isIdentityMapping()) {
wrappedKeys = rowVector->childAt(0);
wrappedValues = rowVector->childAt(1);
} else {
wrappedKeys = decodedRow->wrap(
rowVector->childAt(0), *inputRowVector, inputRowVector->size()); // <---- THROW EXCEPTION:
// "Dictionary index must be less than base vector's size. Index:..."
// mostly reproduced at the simplified path + the duplicate-key case
wrappedValues = decodedRow->wrap(
rowVector->childAt(1), *inputRowVector, inputRowVector->size());
}
Question:
Why do you call decodedRow->wrap()
to wrap the keys and values vectors for non-flat encodings in your sketch code? Maybe I'm wrong, but I feel we should compute the total size, new indices, and call BaseVector::wrapInDictionary()
to wrap the keys and values. Refer to map()'s implementation. What do you think?
I noticed that this edge case usually happens when keys is a nested dictionary encoding or even more complex encoding where unit test cases don't cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the root cause, see #3687
e316a31
to
a584123
Compare
The current CI failure is suspicious to fuzzer's lack of support on nested complex types, aka. ARRAY(ROW<>). cc @bikramSingh91 for awareness. |
@darrenfu #3540 recently landed which now also picks return types for top level expressions from functions that have templated return types. therefore, it is now possible to pick zip with return type array(row(T,K,..)) and randomly bind the templated types to concrete types. |
Sorry I haven't landed #3540 yet. It was blocked by #3564. I expect to land #3540 soon. |
Ah, Interesting, let me dive into this tomorrow and try to figure out what is happening then. |
@darrenfu CircleCI is failing and the failure seems related to the new function. Would you take a look? |
I've detailed the root cause and fix of the issue here, please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darrenfu Darren, thank you for iterating on the PR.
The function implementation looks good to me. Some comments about the tests.
@darrenfu Darren, CI is failing on this PR. Would you take a look? |
12c4767
to
f321a5e
Compare
f321a5e
to
2653f6e
Compare
cc @kgpai for the benchmark result upload failure (503 error). I hope it's a transient error. |
@darrenfu It was transient, though we are still working on that job and need to update the keys that would allow us to upload the results. |
Thanks for mentioning that test approach. Yes, I just did locally. Nothing interesting is thrown. |
f9a6a7d
to
fe1f06a
Compare
fe1f06a
to
015c199
Compare
015c199
to
dcad9a3
Compare
Including the following changes: 1. use LocalSelectivityVector 2. add array and row encoding support 3. update map.rst with lex order 4. fix CI fuzzer failure by checking element.has_value() 5. fix issue facebookincubator#3581 6. fix CI fuzzer failure by set rows.end() as MapVector's length with unit test covered
dcad9a3
to
60e3240
Compare
Addressed all the comments for the unit tests and fuzz test. I believe the PR is in a ready-to-merge state. Let me know if there is anything else to address. |
@pranjalssh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
const auto& flatArray = constantArray->valueVector(); | ||
const auto flatIndex = constantArray->index(); | ||
|
||
exec::LocalSelectivityVector singleRow(context, flatIndex + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/singleRow/singleRowSelector/
const ArrayVector* inputArray, | ||
const TypePtr& outputType, | ||
exec::EvalCtx& context) const { | ||
auto& inputRowVector = inputArray->elements(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/inputRowVector/inputValueVector/
const TypePtr& outputType, | ||
exec::EvalCtx& context) const { | ||
auto& inputRowVector = inputArray->elements(); | ||
exec::LocalDecodedVector decodedRow(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/decodedRow/decodedValueVector/
auto& inputRowVector = inputArray->elements(); | ||
exec::LocalDecodedVector decodedRow(context); | ||
decodedRow.get()->decode(*inputRowVector); | ||
auto rowVector = decodedRow->base()->as<RowVector>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/rowVector/valueRowVector/
exec::LocalDecodedVector decodedRow(context); | ||
decodedRow.get()->decode(*inputRowVector); | ||
auto rowVector = decodedRow->base()->as<RowVector>(); | ||
auto rowKeyVector = rowVector->childAt(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/rowKeyVector/keyValueVector/
context.applyToSelectedNoThrow(rows, [&](vector_size_t row) { | ||
auto size = inputArray->sizeAt(row); | ||
auto offset = inputArray->offsetAt(row); | ||
for (auto i = 0; i < size; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply check if size is zero or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that by just checking size == 0? The purpose is to validate all map entries and map keys are not null here.
auto size = inputArray->sizeAt(row); | ||
auto offset = inputArray->offsetAt(row); | ||
for (auto i = 0; i < size; ++i) { | ||
bool isMapEntryNull = decodedRow->isNullAt(offset + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const bool isMapEntryNull?
dittos?
for (auto i = 0; i < size; ++i) { | ||
bool isMapEntryNull = decodedRow->isNullAt(offset + i); | ||
VELOX_USER_CHECK(!isMapEntryNull, "map entry cannot be null"); | ||
bool isMapKeyNull = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const bool isMapKeyNull =
auto size = inputArray->sizeAt(row); | ||
auto offset = inputArray->offsetAt(row); | ||
for (auto i = 0; i < size; ++i) { | ||
bool isMapEntryNull = decodedRow->isNullAt(offset + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const bool isMapEntryNull
wrappedKeys = rowVector->childAt(0); | ||
wrappedValues = rowVector->childAt(1); | ||
} else { | ||
wrappedKeys = decodedRow->wrap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where we handle the duplicate keys? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is handled via calling checkDuplicateKeys()
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This pull request has been reverted by 72ddf94. |
Summary: In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Differential Revision: D46700832 fbshipit-source-id: 7eae818d3e320c2196968aa098e367ac633ca850
…or#5257) Summary: Pull Request resolved: facebookincubator#5257 In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Differential Revision: D46700832 fbshipit-source-id: afa9c6740e58837e900f6bf65947027f5b173b86
…or#5257) Summary: Pull Request resolved: facebookincubator#5257 In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Reviewed By: xiaoxmeng Differential Revision: D46700832 fbshipit-source-id: 7ac8699415d2bae2c8785d7ca96dfef0e1ba18a2
…or#5257) Summary: Pull Request resolved: facebookincubator#5257 In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Reviewed By: xiaoxmeng Differential Revision: D46700832 fbshipit-source-id: d4c6fed12163032f2739f72f72cf4447d5917412
…or#5257) Summary: Pull Request resolved: facebookincubator#5257 In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Reviewed By: xiaoxmeng Differential Revision: D46700832 fbshipit-source-id: 0a5ca6282152bd56e3111b87550dd6cf5ae9a980
Summary: Pull Request resolved: #5257 In previous PR: #3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures. This change is to amend those BUCK changes. Reviewed By: xiaoxmeng Differential Revision: D46700832 fbshipit-source-id: 0e0e6833a8bead9dc4fccb5e08564ee0b8f7dae1
…okincubator#3418 What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Fixes: facebookincubator#3417 How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Summary
map_from_entries
Presto function to Veloxmap_from_entries(array(row(K, V))) -> map(K, V)
For example: