Skip to content

Commit

Permalink
Improve memory footprint of isin by using contains
Browse files Browse the repository at this point in the history
Previously, isin was implemented using an inner join between the
column we are searching (the haystack) and the values we are searching
for (the needles). This had a large memory footprint when there were
repeated needles (since that blows up the cardinality of the merge).

To fix this, note that we don't need to do a merge at all, since
libcudf provides a primitive (contains) to search for many needles in
a haystack. The only thing we must bear in mind is that
left.isin(right) is asking for the locations in left that match an
entry in right, whereas contains(haystack, needles) provides a bool
mask that selects needles that are in the haystack. To get the
behaviour we want, we therefore need to do contains(right, left) and
treat the values to search for as the haystack.

As well as having a much better memory footprint, this hash-based
approach search is significantly faster than the previous merge-based
one.

While we are here, lower the memory footprint of MultiIndex.isin by
using a left-semi join (the implementation is separate from the isin
implementation on columns and looks a little more complicated to
unpick).

- Closes #14298
  • Loading branch information
wence- committed Nov 22, 2023
1 parent cfa2d51 commit 7848147
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 8 deletions.
12 changes: 5 additions & 7 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -916,13 +916,11 @@ def _obtain_isin_result(self, rhs: ColumnBase) -> ColumnBase:
Helper function for `isin` which merges `self` & `rhs`
to determine what values of `rhs` exist in `self`.
"""
ldf = cudf.DataFrame({"x": self, "orig_order": arange(len(self))})
rdf = cudf.DataFrame(
{"x": rhs, "bool": full(len(rhs), True, dtype="bool")}
)
res = ldf.merge(rdf, on="x", how="left").sort_values(by="orig_order")
res = res.drop_duplicates(subset="orig_order", ignore_index=True)
return res._data["bool"].fillna(False)
# We've already matched dtypes by now
result = libcudf.search.contains(rhs, self)
if result.null_count:
return result.fillna(False)
return result

def as_mask(self) -> Buffer:
"""Convert booleans to bitmask
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/multiindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -746,7 +746,7 @@ def isin(self, values, level=None):
)
self_df = self.to_frame(index=False).reset_index()
values_df = values_idx.to_frame(index=False)
idx = self_df.merge(values_df)._data["index"]
idx = self_df.merge(values_df, how="leftsemi")._data["index"]
res = cudf.core.column.full(size=len(self), fill_value=False)
res[idx] = True
result = res.values
Expand Down

0 comments on commit 7848147

Please sign in to comment.