Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support contains() on lists of primitives #7039

Merged
merged 34 commits into from
Jan 25, 2021

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Dec 18, 2020

Closes #6944.

This commit adds a method (contains()) to check whether each row of a LIST column contains the scalar value specified as an argument. The operation returns a BOOL8 column (with as many rows as the input LIST), each row indicating true if the value is found, false if not.

Output column[i] is set to null if even one of the following holds true (in line with the semantics of array_contains() in SQL):

  1. The search key skey is null
  2. The list row lists[i] is null
  3. The list row lists[i] contains even one null, and lists[i] does not contain the search key.

This implementation currently supports the operation on lists of numerics or strings.

@mythrocks mythrocks requested review from a team as code owners December 18, 2020 19:27
@mythrocks mythrocks self-assigned this Dec 18, 2020
@mythrocks mythrocks added 4 - Needs Review Waiting for reviewer to review or respond libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Dec 18, 2020
Update CHANGELOG, + headers yaml
@mythrocks mythrocks requested a review from a team as a code owner December 18, 2020 19:36
@mythrocks mythrocks added the feature request New feature or request label Dec 18, 2020
@mythrocks mythrocks marked this pull request as draft December 18, 2020 20:08
cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 18, 2020

Codecov Report

Merging #7039 (b3b5b6c) into branch-0.18 (8860baf) will increase coverage by 0.11%.
The diff coverage is 87.79%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7039      +/-   ##
===============================================
+ Coverage        82.09%   82.20%   +0.11%     
===============================================
  Files               97       98       +1     
  Lines            16474    16692     +218     
===============================================
+ Hits             13524    13722     +198     
- Misses            2950     2970      +20     
Impacted Files Coverage Δ
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/lists.py 91.66% <ø> (-0.09%) ⬇️
python/cudf/cudf/utils/ioutils.py 78.71% <ø> (ø)
python/cudf/cudf/io/orc.py 86.89% <63.63%> (-1.51%) ⬇️
python/dask_cudf/dask_cudf/groupby.py 92.15% <69.56%> (-2.90%) ⬇️
python/cudf/cudf/core/dataframe.py 90.49% <72.41%> (-0.22%) ⬇️
python/cudf/cudf/core/series.py 91.10% <80.00%> (-0.06%) ⬇️
python/cudf/cudf/core/dtypes.py 89.94% <84.00%> (-0.45%) ⬇️
python/cudf/cudf/core/column/column.py 88.04% <85.71%> (-0.11%) ⬇️
python/cudf/cudf/io/csv.py 91.66% <86.66%> (-1.67%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf0c37a...b3b5b6c. Read the comment docs.

Review feedback:
  1. Optimized case where skey is null
  2. Rephrased nested ternary if.
Review:
  3. Construct mutable output view only if required.
@mythrocks
Copy link
Contributor Author

Full disclosure: I should probably also implement the column_view skey version of this API, as per @revans2's request in #6944 (comment).

I'll follow the advice here for that as well.

cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
Fix null-mask for non-null list rows containing
null elements.
cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
@mythrocks mythrocks marked this pull request as ready for review December 24, 2020 06:46
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Seems odd to have an API that only works on 1 level lists. Just for the sake of completeness, I'd've tried to make this generic.
  2. The only requirement for supported types seems to be an equality operator. This can work for chrono types. Not sure why they're implicitly left out.

cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

cpp/tests/lists/contains_tests.cpp Outdated Show resolved Hide resolved
cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
cpp/tests/lists/contains_tests.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/contains.hpp Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

Hey, thanks for the review, @karthikeyann. I have rebased things.

Adding the Ready to Merge label now. Thanks, all.

@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 25, 2021
@mythrocks
Copy link
Contributor Author

(Argh. @karthikeyann, please ignore the request for re-review. That was by accident.)

@kkraus14
Copy link
Collaborator

@gpucibot merge

@kkraus14 kkraus14 removed the 4 - Needs Review Waiting for reviewer to review or respond label Jan 25, 2021
@rapids-bot rapids-bot bot merged commit b1e9e20 into rapidsai:branch-0.18 Jan 25, 2021
rapids-bot bot pushed a commit that referenced this pull request Jan 27, 2021
Adds JNI and Java side bindings for `list_contains` that is being added as part of #7039.

Authors:
  - Kuhu Shukla (@kuhushukla)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - MithunR (@mythrocks)

URL: #7125
mythrocks added a commit to mythrocks/cudf that referenced this pull request Oct 23, 2021
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
mythrocks added a commit to mythrocks/cudf that referenced this pull request Nov 5, 2021
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
mythrocks added a commit to mythrocks/cudf that referenced this pull request Nov 5, 2021
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
mythrocks added a commit to mythrocks/cudf that referenced this pull request Nov 23, 2021
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
rapids-bot bot pushed a commit that referenced this pull request Dec 20, 2021
Fixes #9164.

### Prelude
`lists::contains()` (introduced in #7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

### `index_of()`
This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with `lists::contains()`, nested types are not supported as search keys in `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, `output[i]` should contain a non-negative value.

### Semantic changes for `lists::contains()`
This commit also modifies the semantics of `lists::contains()`: it will now return nulls only for the following cases:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, a non-null bool is returned. Specifically `lists::contains()` no longer conforms to SQL semantics of returning `NULL` for list rows that don't contain the search key, while simultaneously containing nulls. In this case, `false` is returned.

### `lists::contains_null_elements()`
A new function has been introduced to check if each list row contains null elements. The semantics are similar to `lists::contains()`, in that the column returned is BOOL8 typed:
  1. If even 1 element in a list row is null, the returned row is `true`.
  2. If no element is null, the returned row is `false`.
  3. If the list row is null, the returned row is `null`.
  4. If the list row is empty, the returned row is `false`.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from `lists::contains()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jason Lowe (https://github.com/jlowe)
  - Mark Harris (https://github.com/harrism)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9510
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support list_contains() for looking up scalar values in list columns
8 participants