Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add map_from_entries Presto function #3417

Closed

Conversation

darrenfu
Copy link
Contributor

@darrenfu darrenfu commented Dec 5, 2022

Summary

  • Adding the map_from_entries Presto function to Velox

map_from_entries(array(row(K, V))) -> map(K, V)

Returns a map created from the given array of entries.

For example:

SELECT map_from_entries(ARRAY[(1, 'x'), (2, 'y')]); -- {1 -> 'x', 2 -> 'y'}
SELECT map_from_entries(ARRAY[(1, 'x'), (2, null)]); -- {1 -> 'x', 2 -> null}
SELECT map_from_entries(ARRAY[(1, 'x'), (1, 'y')]); -- duplicate key error
SELECT map_from_entries(ARRAY[(null, 'x'), (1, 'y')]); -- map key cannot be null error
SELECT map_from_entries(ARRAY[cast(null as ROW(int, varchar)), (1, 'y')]); -- map entry cannot be null error

@netlify
Copy link

netlify bot commented Dec 5, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 60e3240
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6486c382a05c920008c5fa2b

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2022
@darrenfu darrenfu marked this pull request as draft December 6, 2022 08:53
@darrenfu darrenfu mentioned this pull request Dec 9, 2022
@darrenfu darrenfu marked this pull request as ready for review December 12, 2022 18:23
@mbasmanova
Copy link
Contributor

@darrenfu CI is failing. Would you take a look?

@darrenfu
Copy link
Contributor Author

@darrenfu CI is failing. Would you take a look?

The CI error is a discrepancy between the two eval and evalSimplify paths:
The latter will remove all the initial nulls (see #156), thus it will not throw an VeloxUserException if there is null in the input array. I believe this is unexpected because udf map_from_entries does not take a null map entry.

@pedroerp - since you are the author of #156, any advice to satisfy the fuzzer on both paths (while still throw a VeloxUserException for any nulls in the input array)?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu I think this function needs to be implemented as a vector function. See comment below for details. Also, to test this new function more thoroughly, please, run the fuzzer. First, test function in isolation, then in combination with other functions.

velox_expression_fuzzer_test --logtostderr=1 --velox_fuzzer_enable_complex_types --only map_from_entries --duration_sec 60

velox_expression_fuzzer_test --logtostderr=1 --velox_fuzzer_enable_complex_types --enable_variadic_signatures --duration_sec 600

velox/docs/functions/map.rst Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

mbasmanova commented Dec 15, 2022

@darrenfu CI is red. Would you take a look? Make sure to rebase onto latest main to pick up all the recent fixes.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Some comments about the implementation. I expect the fuzzer test to fail for this function because it doesn't handle encodings properly. Curious to know if it does.

velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
@darrenfu
Copy link
Contributor Author

For the CI failure, it relates to a fuzzer signature empty check. I will check with Wei.

I will also add more unit tests to cover more encodings and unknown (empty array) input.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Darren, thank you for iterating on this PR. Working with complex types and encodings is indeed a bit tricky. Here are some suggestions.

localResult =
BaseVector::wrapInConstant(rows.size(), flatIndex, localResult);
} else {
exec::DecodedArgs decodedArgs(rows, args, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the only possibility is that input is flat. Hence, no need to decode.

The easiest way to explore possible encodings is by using VectorFuzzer to generate random inputs.

Here is a rough sketch of what it may look like to handle all encodings.

https://github.com/facebookincubator/velox/compare/main...mbasmanova:velox-1:map_from_entries_simpleudf?expand=1

I modified the Fuzzer to workaround #3533 and also modified DecodedVector to workaround a bug that @kagamiori Wei is looking into (#3424)

velox_expression_fuzzer_test --enable_variadic_signatures --lazy_vector_generation_ratio=0.2 --velox_fuzzer_enable_complex_types --logtostderr=1 --only map_from_entries --duration_sec 60

Copy link
Contributor Author

@darrenfu darrenfu Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!! Your sketch is super helpful for me to decode such a nested complex vector, this comment really made my day!

FYI: I know the gist of different simple vector layouts when I was in Saber, but I am not super familiar with the surrounding helper / wrapper classes and the usage scenarios, like SelectivityVector, complex vector layouts (this is not contributed from Saber), DecodedVector (I wrote our own vector SerDe classes when I was in Saber). So excited to witness so many new velox features since then, and happy to refresh my arsenal.

Also thanks for your quick workaround for bug #3424 and #3533, I amended a comment - it actually affects a few map functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the Fuzzer to workaround #3533 and also modified DecodedVector to workaround a bug that @kagamiori Wei is looking into (#3424)

Since Wei is working on the fuzzer bugs, I will leave the fuzzer code unchanged with CI failures. Once these two bugs are fixed, I will rebase from main and land this PR. Does it sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Wei is working on the fuzzer bugs, I will leave the fuzzer code unchanged with CI failures. Once these two bugs are fixed, I will rebase from main and land this PR. Does it sound good?

Indeed, this PR has a dependency on fixing Fuzzer and DecodedVector. Just to clarify, after rebasing the PR, you'll need to re-run Fuzzer to make sure there are not failures and the PR will need to be reviewed before it is ready for landing.

Copy link
Contributor Author

@darrenfu darrenfu Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova - regarding the fuzzer failure, I did some debugging and found that the exception is thrown at this line:

    VectorPtr wrappedKeys;
    VectorPtr wrappedValues;
    if (decodedRow->isIdentityMapping()) {
      wrappedKeys = rowVector->childAt(0);
      wrappedValues = rowVector->childAt(1);
    } else {
      wrappedKeys = decodedRow->wrap(
          rowVector->childAt(0), *inputRowVector, inputRowVector->size()); // <---- THROW EXCEPTION: 
          // "Dictionary index must be less than base vector's size. Index:..." 
          // mostly reproduced at the simplified path + the duplicate-key case
      wrappedValues = decodedRow->wrap(
          rowVector->childAt(1), *inputRowVector, inputRowVector->size());
    }

Question:
Why do you call decodedRow->wrap() to wrap the keys and values vectors for non-flat encodings in your sketch code? Maybe I'm wrong, but I feel we should compute the total size, new indices, and call BaseVector::wrapInDictionary() to wrap the keys and values. Refer to map()'s implementation. What do you think?

I noticed that this edge case usually happens when keys is a nested dictionary encoding or even more complex encoding where unit test cases don't cover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the root cause, see #3687

velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/MapFromEntries.cpp Outdated Show resolved Hide resolved
@darrenfu
Copy link
Contributor Author

The current CI failure is suspicious to fuzzer's lack of support on nested complex types, aka. ARRAY(ROW<>).
See more discussion: #3388 (comment)

cc @bikramSingh91 for awareness.

@bikramSingh91
Copy link
Contributor

The current CI failure is suspicious to fuzzer's lack of support on nested complex types, aka. ARRAY(ROW<>). See more discussion: #3388 (comment)

cc @bikramSingh91 for awareness.

@darrenfu #3540 recently landed which now also picks return types for top level expressions from functions that have templated return types. therefore, it is now possible to pick zip with return type array(row(T,K,..)) and randomly bind the templated types to concrete types.

@kagamiori
Copy link
Contributor

The current CI failure is suspicious to fuzzer's lack of support on nested complex types, aka. ARRAY(ROW<>). See more discussion: #3388 (comment)
cc @bikramSingh91 for awareness.

@darrenfu #3540 recently landed which now also picks return types for top level expressions from functions that have templated return types. therefore, it is now possible to pick zip with return type array(row(T,K,..)) and randomly bind the templated types to concrete types.

Sorry I haven't landed #3540 yet. It was blocked by #3564. I expect to land #3540 soon.

@bikramSingh91
Copy link
Contributor

The current CI failure is suspicious to fuzzer's lack of support on nested complex types, aka. ARRAY(ROW<>). See more discussion: #3388 (comment)
cc @bikramSingh91 for awareness.

@darrenfu #3540 recently landed which now also picks return types for top level expressions from functions that have templated return types. therefore, it is now possible to pick zip with return type array(row(T,K,..)) and randomly bind the templated types to concrete types.

Sorry I haven't landed #3540 yet. It was blocked by #3564. I expect to land #3540 soon.

Ah, Interesting, let me dive into this tomorrow and try to figure out what is happening then.

@mbasmanova
Copy link
Contributor

@darrenfu CircleCI is failing and the failure seems related to the new function. Would you take a look?

@bikramSingh91
Copy link
Contributor

@darrenfu CircleCI is failing and the failure seems related to the new function. Would you take a look?

I've detailed the root cause and fix of the issue here, please take a look.
#3581 (comment)

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenfu Darren, thank you for iterating on the PR.

The function implementation looks good to me. Some comments about the tests.

velox/functions/prestosql/tests/MapFromEntriesTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/MapFromEntriesTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/tests/MapFromEntriesTest.cpp Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

@darrenfu Darren, CI is failing on this PR. Would you take a look?

@darrenfu
Copy link
Contributor Author

cc @kgpai for the benchmark result upload failure (503 error). I hope it's a transient error.

@kgpai
Copy link
Contributor

kgpai commented May 26, 2023

@darrenfu It was transient, though we are still working on that job and need to update the keys that would allow us to upload the results.

@darrenfu
Copy link
Contributor Author

Hi Darren, Thank you for adding this function! It looks good to me overall. I left a few comments. Please also run fuzzer for 30 minutes to ensure no failures.

Thanks for mentioning that test approach. Yes, I just did locally. Nothing interesting is thrown.

@darrenfu darrenfu force-pushed the map_from_entries_simpleudf branch from f9a6a7d to fe1f06a Compare June 2, 2023 06:15
@darrenfu darrenfu marked this pull request as draft June 2, 2023 06:15
@darrenfu darrenfu requested a review from kagamiori June 12, 2023 06:59
@darrenfu darrenfu marked this pull request as ready for review June 12, 2023 07:01
Including the following changes:
1. use LocalSelectivityVector
2. add array and row encoding support
3. update map.rst with lex order
4. fix CI fuzzer failure by checking element.has_value()
5. fix issue facebookincubator#3581
6. fix CI fuzzer failure by set rows.end() as MapVector's length with unit test covered
@darrenfu
Copy link
Contributor Author

@darrenfu Darren, thank you for iterating on the PR.

The function implementation looks good to me. Some comments about the tests.

Addressed all the comments for the unit tests and fuzz test. I believe the PR is in a ready-to-merge state. Let me know if there is anything else to address.

@facebook-github-bot
Copy link
Contributor

@pranjalssh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

const auto& flatArray = constantArray->valueVector();
const auto flatIndex = constantArray->index();

exec::LocalSelectivityVector singleRow(context, flatIndex + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/singleRow/singleRowSelector/

const ArrayVector* inputArray,
const TypePtr& outputType,
exec::EvalCtx& context) const {
auto& inputRowVector = inputArray->elements();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/inputRowVector/inputValueVector/

const TypePtr& outputType,
exec::EvalCtx& context) const {
auto& inputRowVector = inputArray->elements();
exec::LocalDecodedVector decodedRow(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/decodedRow/decodedValueVector/

auto& inputRowVector = inputArray->elements();
exec::LocalDecodedVector decodedRow(context);
decodedRow.get()->decode(*inputRowVector);
auto rowVector = decodedRow->base()->as<RowVector>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rowVector/valueRowVector/

exec::LocalDecodedVector decodedRow(context);
decodedRow.get()->decode(*inputRowVector);
auto rowVector = decodedRow->base()->as<RowVector>();
auto rowKeyVector = rowVector->childAt(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rowKeyVector/keyValueVector/

context.applyToSelectedNoThrow(rows, [&](vector_size_t row) {
auto size = inputArray->sizeAt(row);
auto offset = inputArray->offsetAt(row);
for (auto i = 0; i < size; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply check if size is zero or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that by just checking size == 0? The purpose is to validate all map entries and map keys are not null here.

auto size = inputArray->sizeAt(row);
auto offset = inputArray->offsetAt(row);
for (auto i = 0; i < size; ++i) {
bool isMapEntryNull = decodedRow->isNullAt(offset + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const bool isMapEntryNull?
dittos?

for (auto i = 0; i < size; ++i) {
bool isMapEntryNull = decodedRow->isNullAt(offset + i);
VELOX_USER_CHECK(!isMapEntryNull, "map entry cannot be null");
bool isMapKeyNull =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const bool isMapKeyNull =

auto size = inputArray->sizeAt(row);
auto offset = inputArray->offsetAt(row);
for (auto i = 0; i < size; ++i) {
bool isMapEntryNull = decodedRow->isNullAt(offset + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const bool isMapEntryNull

wrappedKeys = rowVector->childAt(0);
wrappedValues = rowVector->childAt(1);
} else {
wrappedKeys = decodedRow->wrap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where we handle the duplicate keys? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is handled via calling checkDuplicateKeys()

@facebook-github-bot
Copy link
Contributor

@darrenfu merged this pull request in 0e6eb8f.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 0e6eb8fe.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 72ddf94.

darrenfu added a commit to darrenfu/velox that referenced this pull request Jun 13, 2023
Summary:
In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Differential Revision: D46700832

fbshipit-source-id: 7eae818d3e320c2196968aa098e367ac633ca850
darrenfu added a commit to darrenfu/velox that referenced this pull request Jun 14, 2023
…or#5257)

Summary:
Pull Request resolved: facebookincubator#5257

In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Differential Revision: D46700832

fbshipit-source-id: afa9c6740e58837e900f6bf65947027f5b173b86
darrenfu added a commit to darrenfu/velox that referenced this pull request Jun 14, 2023
…or#5257)

Summary:
Pull Request resolved: facebookincubator#5257

In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Reviewed By: xiaoxmeng

Differential Revision: D46700832

fbshipit-source-id: 7ac8699415d2bae2c8785d7ca96dfef0e1ba18a2
darrenfu added a commit to darrenfu/velox that referenced this pull request Jun 15, 2023
…or#5257)

Summary:
Pull Request resolved: facebookincubator#5257

In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Reviewed By: xiaoxmeng

Differential Revision: D46700832

fbshipit-source-id: d4c6fed12163032f2739f72f72cf4447d5917412
darrenfu added a commit to darrenfu/velox that referenced this pull request Jun 15, 2023
…or#5257)

Summary:
Pull Request resolved: facebookincubator#5257

In previous PR: facebookincubator#3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Reviewed By: xiaoxmeng

Differential Revision: D46700832

fbshipit-source-id: 0a5ca6282152bd56e3111b87550dd6cf5ae9a980
facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2023
Summary:
Pull Request resolved: #5257

In previous PR: #3417 (D46169043), the buck TARGETS file changes are overlooked. Thus it causes many internal build failures.

This change is to amend those BUCK changes.

Reviewed By: xiaoxmeng

Differential Revision: D46700832

fbshipit-source-id: 0e0e6833a8bead9dc4fccb5e08564ee0b8f7dae1
marin-ma pushed a commit to marin-ma/velox-oap that referenced this pull request Dec 15, 2023
…okincubator#3418

What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)

Fixes: facebookincubator#3417

How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants