Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully support nested types in cudf::contains #10656

Merged
merged 394 commits into from
Aug 17, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Apr 14, 2022

This extends the cudf::contains API to support nested types (lists + structs) with arbitrarily nested levels. As such, cudf::contains will work with literally any type of input data.

In addition, this fixes null handling of cudf::contains with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels.

Closes: #8965
Depends on:

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 14, 2022
@ttnghia ttnghia self-assigned this Apr 14, 2022
@codecov

This comment was marked as off-topic.

null_equality::EQUAL,
nan_equality::ALL_EQUAL,
stream,
mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this detail function returns a device_uvector instead of a column?
Will the result ever be larger than what a column supports?

Copy link
Contributor Author

@ttnghia ttnghia Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That detail::contains(table_view, table_view) function was initially refactored out of semi-join (#11100). Now, it is also used in several other places but the result is only used for temporary computation and never be returned to the user. Thus, device_uvector is enough at this time.

It is still unclear whether we will expose this function to the public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also say that (especially for internal APIs like this, but occasionally even otherwise) I prefer using a strongly-typed object like a device_uvector rather than the type-erased (and nullable) column. We have more information available and convey it. Even for public APIs I like uvectors when we don't need type erasure or nullability because the returned values are clearly defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion is also related to #11356.

cpp/src/search/contains_column.cu Outdated Show resolved Hide resolved
cpp/src/search/contains_column.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have only very minor suggestions at this point. Thank you!

cpp/src/search/contains_column.cu Outdated Show resolved Hide resolved
cpp/tests/search/search_list_test.cpp Outdated Show resolved Hide resolved
cpp/tests/search/search_list_test.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/search.hpp Outdated Show resolved Hide resolved
ttnghia and others added 4 commits August 16, 2022 21:55
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 127d574 into rapidsai:branch-22.10 Aug 17, 2022
@ttnghia ttnghia deleted the support_structs_in_contains branch August 17, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add nested struct support for cudf::contains
9 participants