Add `Index.get_loc` for Numerical, String Index support #8489

isVoid · 2021-06-10T21:12:06Z

This PR adds Index.get_loc API, currently only numerical and string index/multiindex are tested. In general this API follows the return scheme for pd.get_loc but deviates a little in some situations. For MultiIndex, if the index is neither lexicographically sorted nor unique, a best effort attempt is made to coerce the found indices into a slice. In some of these cases pandas returns a boolean mask instead.

Deviation example:

>>> import pandas as pd
>>> import cudf
>>> x = pd.MultiIndex.from_tuples(
            [(2, 1, 1), (1, 2, 3), (1, 2, 1), (1, 1, 1), (1, 1, 1), (2, 2, 1)]
        )
>>> x.get_loc(1)
array([False,  True,  True,  True,  True, False])
>>> cudf.from_pandas(x).get_loc(1)
slice(1, 5, 1)

[Opinion] Since this return scheme makes information compact and saves memory, it is arguably a meaningful deviation.

Benchmark snapshot, on non-monotonic, non-unique indices:
MultiIndex (2-level):

N	cudf	pandas	(cudf-pandas)/pandas
1000000	0.01686443090438843	0.008090078830718994	1.0845817769231336
10000000	0.10224630832672119	0.1835497260093689	-0.4429503625546014
100000000	1.0716976642608642	1.9078519105911256	-0.4382699944835806

Index

N	cudf	pandas	(cudf-pandas)/pandas
1000000	0.006694436073303223	0.00545426607131958	0.22737614662857875
10000000	0.015508973598480224	0.08476289510726928	-0.8170311009451332
100000000	0.06651490926742554	1.0927792191505432	-0.9391323442999493

…3679

charlesbluca

LGTM overall, just a few comments/questions

python/cudf/cudf/core/index.py

python/cudf/cudf/core/multiindex.py

python/cudf/cudf/tests/test_index.py

Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

codecov · 2021-06-10T23:27:01Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@cefc266). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #8489   +/-   ##
===============================================
  Coverage                ?   82.96%           
===============================================
  Files                   ?      110           
  Lines                   ?    18178           
  Branches                ?        0           
===============================================
  Hits                    ?    15082           
  Misses                  ?     3096           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cefc266...e9108ac. Read the comment docs.

marlenezw

Looks pretty much good to go 😄 Just added a few suggestions!

python/cudf/cudf/core/index.py

marlenezw · 2021-06-14T12:18:22Z

python/cudf/cudf/utils/utils.py

+        return slice(x1, x2 + 1, x2 - x1)
+    start, step = indices[0].item(), (indices[1] - indices[0]).item()
+    stop = start + step * len(indices)
+    if (indices == cp.arange(start, stop, step)).all():


Should we use numpy here instead of cupy? Just thinking about our previous discussion on syncing.

Here's what I found after running some benchmarks. On a very high level, _maybe_indices_to_slice only make up for a small fraction of the total cost of get_loc. The majority of the time is spent on sorting and and gathering. So overall it does not make a big difference optimizing for this section. (see red circle area, benchmarked at 100M rows)

But if we are to look at the details, we can compare just the post-processing part. That is, the time spent on judging if the indices can be converted into a slice, if not, create a boolean mask. And here are the results:

N GPU CPU

100000 0.0046520233154296875 0.0006105899810791016

1000000 0.001522064208984375 0.0025777816772460938

10000000 0.004399299621582031 0.020697832107543945

100000000 0.018871545791625977 0.21960783004760742

Using cupy here has some advantage when array is large. Plus we might further improve the performance if we can figure out how to call cupy.arange without device syncs.

Benchmark details:
the CPU path of the _maybe_indices_to_slice is:

def _maybe_indices_to_slice(indices: cp.ndarray) -> Union[slice, cp.ndarray]: # cpu path: indices = indices.get() if len(indices) == 1: x = indices[0] return slice(x, x + 1) if len(indices) == 2: x1, x2 = indices[0], indices[1] return slice(x1, x2 + 1, x2 - x1) start, step = indices[0], (indices[1] - indices[0]) stop = start + step * len(indices) if (indices == np.arange(start, stop, step)).all(): return slice(start, stop, step) return indices

And later use np.full instead of cp.cull.

Benchmark code

Sounds good! I guess in this case the array could be quite large so that makes sense.

Would you like to figure out the efficiency issues with MultiIndex in a separate PR? I think for regular indices this looks good to me.

Would you like to figure out the efficiency issues with MultiIndex in a separate PR?

Yup, that sounds good.

isVoid · 2021-06-14T22:22:34Z

After benchmarking get_loc in nvtx, I was a bit worried on the efficiency on current implementation and how it compares with doing the operation on host. The following benchmark result shows while cudf has performance advantage when N is large, it's less obvious for MultiIndex.

(On non-monotonic, non-unique indices)

MultiIndex (2-level):

N	cudf	pandas	(cudf-pandas)/pandas
1000000	0.01686443090438843	0.008090078830718994	1.0845817769231336
10000000	0.10224630832672119	0.1835497260093689	-0.4429503625546014
100000000	1.0716976642608642	1.9078519105911256	-0.4382699944835806

Index

N	cudf	pandas	(cudf-pandas)/pandas
1000000	0.006694436073303223	0.00545426607131958	0.22737614662857875
10000000	0.015508973598480224	0.08476289510726928	-0.8170311009451332
100000000	0.06651490926742554	1.0927792191505432	-0.9391323442999493

Benchmark code

isVoid · 2021-06-14T23:39:34Z

rerun tests

marlenezw

Great job Michael :D

isVoid · 2021-06-15T20:17:56Z

@gpucibot merge

numeric, strings support

47f9b4f

isVoid requested a review from a team as a code owner June 10, 2021 21:12

isVoid requested review from charlesbluca and marlenezw June 10, 2021 21:12

github-actions bot added the Python Affects Python cuDF API. label Jun 10, 2021

isVoid added 2 commits June 10, 2021 14:13

typo

048eca0

Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …

267ea98

…3679

isVoid added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Jun 10, 2021

charlesbluca reviewed Jun 10, 2021

View reviewed changes

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/multiindex.py Show resolved Hide resolved

python/cudf/cudf/tests/test_index.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/core/index.py

0dde4c7

Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

address review

ab1cc14

marlenezw reviewed Jun 14, 2021

View reviewed changes

docs, more examples

e9108ac

marlenezw approved these changes Jun 15, 2021

View reviewed changes

isVoid requested a review from charlesbluca June 15, 2021 17:41

charlesbluca approved these changes Jun 15, 2021

View reviewed changes

isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 15, 2021

rapids-bot bot merged commit 1314bd5 into rapidsai:branch-21.08 Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Index.get_loc` for Numerical, String Index support #8489

Add `Index.get_loc` for Numerical, String Index support #8489

isVoid commented Jun 10, 2021 •

edited

Loading

charlesbluca left a comment

codecov bot commented Jun 10, 2021 •

edited

Loading

marlenezw left a comment

marlenezw Jun 14, 2021

isVoid Jun 14, 2021 •

edited

Loading

marlenezw Jun 15, 2021

marlenezw Jun 15, 2021

isVoid Jun 15, 2021

marlenezw Jun 15, 2021

isVoid commented Jun 14, 2021 •

edited

Loading

isVoid commented Jun 14, 2021

marlenezw left a comment

isVoid commented Jun 15, 2021

N	GPU	CPU
100000	0.0046520233154296875	0.0006105899810791016
1000000	0.001522064208984375	0.0025777816772460938
10000000	0.004399299621582031	0.020697832107543945
100000000	0.018871545791625977	0.21960783004760742

Add Index.get_loc for Numerical, String Index support #8489

Add Index.get_loc for Numerical, String Index support #8489

Conversation

isVoid commented Jun 10, 2021 • edited Loading

charlesbluca left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 10, 2021 • edited Loading

Codecov Report

marlenezw left a comment

Choose a reason for hiding this comment

marlenezw Jun 14, 2021

Choose a reason for hiding this comment

isVoid Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

marlenezw Jun 15, 2021

Choose a reason for hiding this comment

marlenezw Jun 15, 2021

Choose a reason for hiding this comment

isVoid Jun 15, 2021

Choose a reason for hiding this comment

marlenezw Jun 15, 2021

Choose a reason for hiding this comment

isVoid commented Jun 14, 2021 • edited Loading

isVoid commented Jun 14, 2021

marlenezw left a comment

Choose a reason for hiding this comment

isVoid commented Jun 15, 2021

Add `Index.get_loc` for Numerical, String Index support #8489

Add `Index.get_loc` for Numerical, String Index support #8489

isVoid commented Jun 10, 2021 •

edited

Loading

codecov bot commented Jun 10, 2021 •

edited

Loading

isVoid Jun 14, 2021 •

edited

Loading

isVoid commented Jun 14, 2021 •

edited

Loading