-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid DataFrame conversion in MultiIndex.from_pandas
#14470
Conversation
python/cudf/cudf/core/multiindex.py
Outdated
[level._data[column_name]], col | ||
)[0] | ||
for (column_name, col), level in zip(codes._data.items(), levels): | ||
result_col = level._column.take(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 175 we already do partial bounds-checking. The treatment of -1
is AFAICT undocumented pandas behaviour. That is codes
do not do normal indexing but instead -1
is for "missing value".
But let us move all of the validation into one place:
import cudf._lib as libcudf
from cudf._lib.types import size_type_dtype
import numpy as np
lo, hi = libcudf.reduce.minmax(code)
if lo.value < -1 or hi.value > len(level) - 1:
raise ValueError(...)
if lo.value == -1:
# Now we can gather and insert null automatically
code[code == -1] = np.iinfo(size_type_dtype).min
result_col, = libcudf.copying.gather([level], code, nullify=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. Was able to consolidate the codes validate in this loop
@@ -158,7 +159,7 @@ def __init__( | |||
"codes and is inconsistent!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I am not sure that codes
can be np.int64
it should probably be size_type_dtype. But let us not fix that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion to retain the pandas issue reference in the xfail, otherwise looks great, thanks!
@mroeschke can we drop the |
MultiIndex.from_pandas
MultiIndex.from_pandas
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
/merge |
Description
This uncovered a bug where
MultiIndex.levels
could be a list of Series instead of a list of Index.Checklist