Skip to content

Commit

Permalink
Implement lists::index_of() for positions in list row
Browse files Browse the repository at this point in the history
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
  • Loading branch information
mythrocks committed Nov 23, 2021
1 parent 85df759 commit fba1a9f
Show file tree
Hide file tree
Showing 3 changed files with 951 additions and 424 deletions.
80 changes: 80 additions & 0 deletions cpp/include/cudf/lists/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,86 @@ std::unique_ptr<column> contains(
cudf::column_view const& search_keys,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Option to choose whether `index_of()` returns the first or last match
* of a search key in a list row
*/
enum class duplicate_find_option : int32_t {
FIND_FIRST = 0, ///< Finds first instance of a search key in a list row.
FIND_LAST ///< Finds last instance of a search key in a list row.
};

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* within each list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
* Output `column[i]` contains a 0-based index indicating the position of the search key
* in each list, counting from the beginning of the list.
* Note:
* 1. If the `search_key` is null, all output rows are set to null.
* 2. If the row `lists[i]` is null, `output[i]` is also null.
* 3. If the row `lists[i]` does not contain the `search_key`, `output[i]` is set to `-1`.
* 4. In all other cases, `output[i]` is set to a non-negative `size_type` index.
*
* If the `find_option` is set to `FIND_FIRST`, the position of the first match for
* `search_key` is returned.
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_key` type does not match the element type in `lists`
* @throw cudf::logic_error If `search_key` is of a nested type, or `lists` contains nested
* elements (LIST, STRUCT)
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
cudf::scalar const& search_key,
duplicate_find_option find_option = duplicate_find_option::FIND_FIRST,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* row within the corresponding list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
* Output `column[i]` contains a 0-based index indicating the position of each search key
* row in its corresponding list row, counting from the beginning of the list.
* Note:
* 1. If `search_keys[i]` is null, `output[i]` is also null.
* 2. If the row `lists[i]` is null, `output[i]` is also null.
* 3. If the row `lists[i]` does not contain `search_key[i]`, `output[i]` is set to `-1`.
* 4. In all other cases, `output[i]` is set to a non-negative `size_type` index.
*
* If the `find_option` is set to `FIND_FIRST`, the position of the first match for
* `search_key` is returned.
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys A column of search keys to be looked up in each corresponding row of
* `lists`
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::logic_error If `search_keys` type does not match the element type in `lists`
* @throw cudf::logic_error If `lists` or `search_keys` contains nested elements (LIST, STRUCT)
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
cudf::column_view const& search_keys,
duplicate_find_option find_option = duplicate_find_option::FIND_FIRST,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
} // namespace lists
} // namespace cudf
Loading

0 comments on commit fba1a9f

Please sign in to comment.