-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Index.get_loc
for Numerical, String Index support
#8489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few comments/questions
Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8489 +/- ##
===============================================
Coverage ? 82.96%
===============================================
Files ? 110
Lines ? 18178
Branches ? 0
===============================================
Hits ? 15082
Misses ? 3096
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty much good to go 😄 Just added a few suggestions!
return slice(x1, x2 + 1, x2 - x1) | ||
start, step = indices[0].item(), (indices[1] - indices[0]).item() | ||
stop = start + step * len(indices) | ||
if (indices == cp.arange(start, stop, step)).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use numpy here instead of cupy? Just thinking about our previous discussion on syncing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I found after running some benchmarks. On a very high level, _maybe_indices_to_slice
only make up for a small fraction of the total cost of get_loc
. The majority of the time is spent on sorting and and gathering. So overall it does not make a big difference optimizing for this section. (see red circle area, benchmarked at 100M rows)
But if we are to look at the details, we can compare just the post-processing part. That is, the time spent on judging if the indices can be converted into a slice, if not, create a boolean mask. And here are the results:
N | GPU | CPU |
---|---|---|
100000 | 0.0046520233154296875 | 0.0006105899810791016 |
1000000 | 0.001522064208984375 | 0.0025777816772460938 |
10000000 | 0.004399299621582031 | 0.020697832107543945 |
100000000 | 0.018871545791625977 | 0.21960783004760742 |
Using cupy here has some advantage when array is large. Plus we might further improve the performance if we can figure out how to call cupy.arange
without device syncs.
Benchmark details:
the CPU path of the _maybe_indices_to_slice
is:
def _maybe_indices_to_slice(indices: cp.ndarray) -> Union[slice, cp.ndarray]:
# cpu path:
indices = indices.get()
if len(indices) == 1:
x = indices[0]
return slice(x, x + 1)
if len(indices) == 2:
x1, x2 = indices[0], indices[1]
return slice(x1, x2 + 1, x2 - x1)
start, step = indices[0], (indices[1] - indices[0])
stop = start + step * len(indices)
if (indices == np.arange(start, stop, step)).all():
return slice(start, stop, step)
return indices
And later use np.full
instead of cp.cull
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I guess in this case the array could be quite large so that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to figure out the efficiency issues with MultiIndex in a separate PR? I think for regular indices this looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to figure out the efficiency issues with MultiIndex in a separate PR?
Yup, that sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great!
After benchmarking (On non-monotonic, non-unique indices) MultiIndex (2-level):
Index
|
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job Michael :D
@gpucibot merge |
Closes #3679
This PR adds
Index.get_loc
API, currently only numerical and string index/multiindex are tested. In general this API follows the return scheme forpd.get_loc
but deviates a little in some situations. ForMultiIndex
, if the index is neither lexicographically sorted nor unique, a best effort attempt is made to coerce the found indices into a slice. In some of these cases pandas returns a boolean mask instead.Deviation example:
[Opinion] Since this return scheme makes information compact and saves memory, it is arguably a meaningful deviation.
Benchmark snapshot, on non-monotonic, non-unique indices:
MultiIndex (2-level):
Index