Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Index.get_loc for Numerical, String Index support #8489

Merged
merged 6 commits into from
Jun 15, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jun 10, 2021

Closes #3679

This PR adds Index.get_loc API, currently only numerical and string index/multiindex are tested. In general this API follows the return scheme for pd.get_loc but deviates a little in some situations. For MultiIndex, if the index is neither lexicographically sorted nor unique, a best effort attempt is made to coerce the found indices into a slice. In some of these cases pandas returns a boolean mask instead.

Deviation example:

>>> import pandas as pd
>>> import cudf
>>> x = pd.MultiIndex.from_tuples(
            [(2, 1, 1), (1, 2, 3), (1, 2, 1), (1, 1, 1), (1, 1, 1), (2, 2, 1)]
        )
>>> x.get_loc(1)
array([False,  True,  True,  True,  True, False])
>>> cudf.from_pandas(x).get_loc(1)
slice(1, 5, 1)

[Opinion] Since this return scheme makes information compact and saves memory, it is arguably a meaningful deviation.

Benchmark snapshot, on non-monotonic, non-unique indices:
MultiIndex (2-level):

N cudf pandas (cudf-pandas)/pandas
1000000 0.01686443090438843 0.008090078830718994 1.0845817769231336
10000000 0.10224630832672119 0.1835497260093689 -0.4429503625546014
100000000 1.0716976642608642 1.9078519105911256 -0.4382699944835806

Index

N cudf pandas (cudf-pandas)/pandas
1000000 0.006694436073303223 0.00545426607131958 0.22737614662857875
10000000 0.015508973598480224 0.08476289510726928 -0.8170311009451332
100000000 0.06651490926742554 1.0927792191505432 -0.9391323442999493

@isVoid isVoid requested a review from a team as a code owner June 10, 2021 21:12
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 10, 2021
@isVoid isVoid added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Jun 10, 2021
Copy link
Member

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few comments/questions

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/multiindex.py Show resolved Hide resolved
python/cudf/cudf/tests/test_index.py Outdated Show resolved Hide resolved
Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@cefc266). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8489   +/-   ##
===============================================
  Coverage                ?   82.96%           
===============================================
  Files                   ?      110           
  Lines                   ?    18178           
  Branches                ?        0           
===============================================
  Hits                    ?    15082           
  Misses                  ?     3096           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cefc266...e9108ac. Read the comment docs.

Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty much good to go 😄 Just added a few suggestions!

python/cudf/cudf/core/index.py Show resolved Hide resolved
return slice(x1, x2 + 1, x2 - x1)
start, step = indices[0].item(), (indices[1] - indices[0]).item()
stop = start + step * len(indices)
if (indices == cp.arange(start, stop, step)).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use numpy here instead of cupy? Just thinking about our previous discussion on syncing.

Copy link
Contributor Author

@isVoid isVoid Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I found after running some benchmarks. On a very high level, _maybe_indices_to_slice only make up for a small fraction of the total cost of get_loc. The majority of the time is spent on sorting and and gathering. So overall it does not make a big difference optimizing for this section. (see red circle area, benchmarked at 100M rows)

image

But if we are to look at the details, we can compare just the post-processing part. That is, the time spent on judging if the indices can be converted into a slice, if not, create a boolean mask. And here are the results:

N GPU CPU
100000 0.0046520233154296875 0.0006105899810791016
1000000 0.001522064208984375 0.0025777816772460938
10000000 0.004399299621582031 0.020697832107543945
100000000 0.018871545791625977 0.21960783004760742

Using cupy here has some advantage when array is large. Plus we might further improve the performance if we can figure out how to call cupy.arange without device syncs.

Benchmark details:
the CPU path of the _maybe_indices_to_slice is:

def _maybe_indices_to_slice(indices: cp.ndarray) -> Union[slice, cp.ndarray]:
    # cpu path:
    indices = indices.get()
    if len(indices) == 1:
        x = indices[0]
        return slice(x, x + 1)
    if len(indices) == 2:
        x1, x2 = indices[0], indices[1]
        return slice(x1, x2 + 1, x2 - x1)
    start, step = indices[0], (indices[1] - indices[0])
    stop = start + step * len(indices)
    if (indices == np.arange(start, stop, step)).all():
        return slice(start, stop, step)
    return indices

And later use np.full instead of cp.cull.

Benchmark code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I guess in this case the array could be quite large so that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to figure out the efficiency issues with MultiIndex in a separate PR? I think for regular indices this looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to figure out the efficiency issues with MultiIndex in a separate PR?

Yup, that sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great!

@isVoid
Copy link
Contributor Author

isVoid commented Jun 14, 2021

After benchmarking get_loc in nvtx, I was a bit worried on the efficiency on current implementation and how it compares with doing the operation on host. The following benchmark result shows while cudf has performance advantage when N is large, it's less obvious for MultiIndex.

(On non-monotonic, non-unique indices)

MultiIndex (2-level):

N cudf pandas (cudf-pandas)/pandas
1000000 0.01686443090438843 0.008090078830718994 1.0845817769231336
10000000 0.10224630832672119 0.1835497260093689 -0.4429503625546014
100000000 1.0716976642608642 1.9078519105911256 -0.4382699944835806

Index

N cudf pandas (cudf-pandas)/pandas
1000000 0.006694436073303223 0.00545426607131958 0.22737614662857875
10000000 0.015508973598480224 0.08476289510726928 -0.8170311009451332
100000000 0.06651490926742554 1.0927792191505432 -0.9391323442999493

Benchmark code

@isVoid
Copy link
Contributor Author

isVoid commented Jun 14, 2021

rerun tests

Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job Michael :D

@isVoid isVoid requested a review from charlesbluca June 15, 2021 17:41
@isVoid isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 15, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Jun 15, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1314bd5 into rapidsai:branch-21.08 Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] index.get_loc
3 participants