Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] drop_duplicates and distinct_count behavior/implementation is very inefficient #9413

Closed
jrhemstad opened this issue Oct 11, 2021 · 17 comments · Fixed by #10370
Closed
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

#9411 made me take a closer look at cudf::drop_duplicates and cudf::distinct_count.

Unlike std::unique, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).

On the surface, this seems convenient for a user because they don't have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.

Imagine if you were to call distinct_count and then drop_duplicates. The way these functions are currently implemented, they both require doing a sort of the inputs.

Instead, if distinct_count and drop_duplicates worked like std::unique and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.

The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.

(Furthermore, if the data isn't already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).

Describe the solution you'd like

I'd like to do two things:

  1. Update drop_duplicates/distinct_count to work like std::unique, i.e., it only considers contiguous equivalent elements.
    • This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at groupby for an example of how we've handled this there.
  2. Add unordered_drop_duplicates and unordered_distinct_count that behave like the current APIs.
    • We should look at using a hash-based implementation for the unordered_* algorithms as it will likely be much faster

Describe alternatives you've considered

Why not preserve the existing APIs and add ordered_drop_duplicates and ordered_distinct_count?

While this is certainly an option, I think that the behavior I described above would be more canonical for C++.

Why have both unordered_* and ordered versions?

If the data is already sorted, there's a good chance a sort based implementation could be faster than hash-based, but we can test that and see.

@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue labels Oct 11, 2021
@harrism
Copy link
Member

harrism commented Oct 12, 2021

Why not rename the new ordered version of drop_duplicates to cudf::unique?

@harrism
Copy link
Member

harrism commented Oct 12, 2021

Just for reference in thinking about naming and semantics, I also note that std (our eternal inspiration) does not have a unique_count. Typical approaches to implementing it:

std::vector<int> v{...};
int unique_count = std::set<int>(v.begin(), v.end()).size();
std::vector<int> v{...};
std::sort(v.begin(), v.end());
int unique_count = std::unique(v.begin(), v.end()) - v.begin();

I think we discussed this in the past, and decided that due to the costs of materializing columns / tables, it's better to provide an algorithm that actually counts the unique items with a scan than to require the user to build their own as in method 2. The hashing approach you mention would be similar to approach 1 and probably would be faster in a lot of not pre-sorted cases.

I can't remember why we decided to call it distinct_count instead of unique_count. I bet @codereport was involved or would have an opinion on the naming!

@shwina
Copy link
Contributor

shwina commented Oct 12, 2021

The Pandas API is explicit about not having sorting behaviour:

Signature: s.unique() -> 'ArrayLike'
Docstring:
Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique,
therefore does NOT sort.

So for Python, we'll likely have to do the regular trick of tacking on a sequence column and later sorting by it, at least until an unordered_disctinct_count is available.

@devavret
Copy link
Contributor

One more thing: I've found that for a very small subset (single column, non-nullable, integer representation) sorting is actually much faster than hashing. But I suppose that can become an implementation detail in unordered_drop_duplicates

@harrism
Copy link
Member

harrism commented Oct 13, 2021

Given the Python observation, why not call cudf::drop_duplicates just cudf::unique() to match both STL and Pandas and have it be unordered? Users need to sort first if they want it ordered.

@codereport
Copy link
Contributor

I bet @codereport was involved or would have an opinion on the naming!

I think it was actually called unique_count at one point and I pointed out that "unique" has semantics in C++ that didn't align with what the libcudf algorithm was doing. So we ended up changing it to distinct_count as "distinct" doesn't have any specific C++ semantics.

Other side note: it is regarded as a mistake that we called our containers:

  • map / set (for sorted tree implementations)
  • unordered_map / unordered_set (for hashing implementations)

instead of:

  • sorted_map / sorted_set (for sorted tree implementations)
  • map / set (for hashing implementations)

The "sortedness" is an extra requirement that the user should have to think about if they want it. "Unsortedness" should be the default (Python got it right with Set / Map and SortedSet / SortedMap

@harrism
Copy link
Member

harrism commented Oct 13, 2021

Other side note: it is regarded as a mistake that we called our containers:

map / set (for sorted tree implementations)
unordered_map / unordered_set (for hashing implementations)

I don't know of such classes in libcudf. "we"?

@harrism
Copy link
Member

harrism commented Oct 13, 2021

I think it was actually called unique_count at one point and I pointed out that "unique" has semantics in C++ that didn't align with what the libcudf algorithm was doing. So we ended up changing it to distinct_count as "distinct" doesn't have any specific C++ semantics.

But now Jake is suggesting changing the non-ordering semantics to match std::unique. Are there other semantics that should stop us using that name?

@jrhemstad
Copy link
Contributor Author

Given the Python observation, why not call cudf::drop_duplicates just cudf::unique() to match both STL and Pandas and have it be unordered? Users need to sort first if they want it ordered.

But the behavior of std::unique doesn't match Pandas unique. If we go with the unique/unordered_unique naming, Pandas would use the unordered_unique.

@harrism
Copy link
Member

harrism commented Oct 13, 2021

Hmmm, perhaps I misunderstood @shwina 's message.

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

@jrhemstad
Copy link
Contributor Author

https://pandas.pydata.org/docs/reference/api/pandas.unique.html

Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.

The fact that it calls out it uses a hash table means it's doing a total unique.

@shwina
Copy link
Contributor

shwina commented Oct 13, 2021

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

Pandas would return [4, 0, 1, 2].

@codereport
Copy link
Contributor

Other side note: it is regarded as a mistake that we called our containers:
map / set (for sorted tree implementations)
unordered_map / unordered_set (for hashing implementations)

I don't know of such classes in libcudf. "we"?

"We" refers to the C++ community at large. And it was just a general comment about the "unordered_" prefix that is used in the C++ Standard Library.

@harrism
Copy link
Member

harrism commented Oct 13, 2021

The fact that it calls out it uses a hash table means it's doing a total unique

Let's make sure to make our docs more clear. Understanding what our APIs do shouldn't require inference from implementation details.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Oct 15, 2021

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it's 100x faster than the current sort-based solution.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@ttnghia
Copy link
Contributor

ttnghia commented Dec 17, 2021

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it's 100x faster than the current sort-based solution.

@jrhemstad Is there any draft PR yet? I am implementing cudf::contains for structs and have similar need for structs hashing. I planned to modify the existsing unordered_multiset implementation to support structs but not sure if the solution above (using cuCollection) is worth to wait.

@PointKernel PointKernel self-assigned this Jan 11, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 15, 2022
This PR adds support for `cudf::contains` so we can check whether a structs column contains a scalar struct element.

Partially addresses #8965. This does not support checking if structs given in a structs column exist in another structs column. Such cases will be supported when the new data structure mentioned in #9413 is merged into cudf.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - MithunR (https://github.com/mythrocks)

URL: #9929
rapids-bot bot pushed a commit that referenced this issue Feb 2, 2022
Related to #9413.

This PR adds `unordered_drop_duplicates`/`unordered_distinct_count` APIs by using hash-based algorithms. It doesn't close the original issue since adding `std::unique`-like `drop_duplicates` is not addressed in this PR. It involves several changes:

- [x] Change the behavior of the existing `distinct_count`: counting the number of consecutive groups of equivalent rows instead of total unique.
- [x] Add hash-based `unordered_distinct_count`: this new API counts unique rows across the whole table by using a hash map. It requires a newer version of `cuco` with bug fixing: NVIDIA/cuCollections#132 and NVIDIA/cuCollections#138.
- [x] Add hash-based `unordered_drop_duplicates`: similar to `drop_duplicates`, but this API doesn't support `keep` option and the output is in an unspecified order.
- [x] Replace all the cpp-side `drop_duplicates`/`distinct_count` use cases with `unordered_` versions. 
- [x] Update and replace the existing compaction benchmark with `nvbench`.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #10030
rapids-bot bot pushed a commit that referenced this issue Mar 12, 2022
Closes #9413

Depending on #10387.

There are several changes involved in this PR:

- Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input.
- Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms.
- Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct`
- Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique`
- Updates corresponding tests and benchmarks.
- Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. 


Performance hints for `cudf::unique` and `cudf::distinct`:

- If the input is pre-sorted, use `cudf::unique`
- If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired:

  - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct`
  - If `keep` control is required, stable sort the input then `cudf::unique`

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/brandon-b-miller
  - MithunR (https://github.com/mythrocks)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants