Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPRECATION] Remove Series.set_index #9541

Closed
vyasr opened this issue Oct 27, 2021 · 6 comments · Fixed by #9945
Closed

[DEPRECATION] Remove Series.set_index #9541

vyasr opened this issue Oct 27, 2021 · 6 comments · Fixed by #9945
Assignees

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 27, 2021

pandas supports DataFrame.set_index to convert specified columns of a DataFrame into its index. To actually set a separate index object as the index of a DataFrame or Series, however, a user must set the index property (e.g. df.index = RangeIndex(10)). Series.set_index as implemented in cuDF (as an alias for setting the property) confuses the issue. The method was deprecated in #9529 for 21.12 and we will remove it in 22.02.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice bdice self-assigned this Dec 21, 2021
rapids-bot bot pushed a commit that referenced this issue Jan 26, 2022
This PR removes the deprecated method `Series.set_index`. Resolves #9541, follows up on #9529.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9945
@lmeyerov
Copy link

@vyasr is this a correct rewrite of impacted code?

df['x2'] = df['x'].set_index(df['idx']).sort_index()

=>

df['x2'] = df[['x']].set_index(df['idx']).sort_index()['x']

@vyasr
Copy link
Contributor Author

vyasr commented Apr 13, 2022

That is a valid rewrite. However, I'm not certain that ever did exactly what you intended unless the idx column exactly matched the elements in your DataFrame's index. Assignment in this fashion triggers index alignment, which could lead to unexpected side effects (in both pandas and cudf). For example (using pandas to demonstrate that this has always been expected behavior):

In [72]: df = cudf.DataFrame({'x': [3, 6, 2, 3, 90], 'idx': [5, 4, 3, 2, 1]})

In [73]: pdf = df.to_pandas()

In [74]: pdf['x2'] = pdf[['x']].set_index(pdf['idx']).sort_index()['x']

In [75]: pdf
Out[75]: 
    x  idx    x2
0   3    5   NaN
1   6    4  90.0
2   2    3   3.0
3   3    2   2.0
4  90    1   6.0

Note that because idx ranges from 1-5 while the DataFrame's index goes from 0-4, we end up with NaNs in the 0th element of column `x2.

Leaving aside that question, as far as the best replacement it depends on personal preference and how you want to trade off performance vs simplicity. You may get better mileage from one of the following depending on the number of rows and columns in your table:

# Option 1 (your code with an index reset)
In [76]: %timeit df['x2'] = df[['x']].set_index(df['idx']).sort_index()['x'].reset_index(drop=True)
1.14 ms ± 5.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 2 (sort the values rather than setting an index)
In [77]: %timeit df['x2'] = df.sort_values(by='idx')['x'].reset_index(drop=True)
755 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 3 (avoid sorting unnecessary columns, most valuable if you have a lot of columns)
In [78]: %timeit df['x2'] = df[['x', 'idx']].sort_values(by='idx')['x'].reset_index(drop=True)
882 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 4 (more lines of code, but the most direct translation of what your original code did)
In [80]: %%timeit
    ...: s = df['x']
    ...: s.index = df['idx']
    ...: df['x2'] = s.sort_index().reset_index(drop=True)
    ...: 
    ...: 
812 µs ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@lmeyerov
Copy link

Excellent, thanks! 🙌

Our case is a couple of permutations we're tacking on for generating some sort of funny ~reverse ~indices, and primarily for export, so (I think) we're good

An interesting bit these kinds of examples show that it'd be interesting (R&D-level) to do some sort of call tree recording to identify and recommend fusion + sorting optimizations, basically what a SQL optimizer would do, and advise on those!

@vyasr
Copy link
Contributor Author

vyasr commented Apr 14, 2022

You may be interested in expression templates, which are one way that e.g. linear algebra libraries do exactly what you're talking about to enable operator fusion and minimizing intermediates. You can do it in Python with lazy evaluation like what Dask and Vaex do. But there's no real way to offer both lazy and greedy evaluation at the same time (although you could feature flag it), so the best we could do with that is probably what you suggested: record ops and make recommendations. That would be cool though!

@lmeyerov
Copy link

lmeyerov commented Apr 14, 2022

we were playing with the idea of LINQ-based df expression trees -- a lot of our original core was rxjs -> dataframejs -- where the same relational structure can work for diff eval structures. we've moved to python as rapids grew though.

doing for dask_cudf DAGs is interesting. we have a lot of dgdf.map_partitions(lambda gdf: ...) though so not sure if would capture enough, so maybe still comes down to cudf expression tracking.

some big examples for us:

  • we basically never take advantage of sorted indexes yet have a lot of merges. an advisor would help!
  • we did a pass awhile back trying to identify extraneous copies, and i bet there are more again. we get hit by a lot of overhead, so these can add up.
  • there are likely some fusion & reuse opportunities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants