[DEPRECATION] Remove Series.set_index #9541

vyasr · 2021-10-27T22:12:57Z

pandas supports DataFrame.set_index to convert specified columns of a DataFrame into its index. To actually set a separate index object as the index of a DataFrame or Series, however, a user must set the index property (e.g. df.index = RangeIndex(10)). Series.set_index as implemented in cuDF (as an alias for setting the property) confuses the issue. The method was deprecated in #9529 for 21.12 and we will remove it in 22.02.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-26T23:02:51Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This PR removes the deprecated method `Series.set_index`. Resolves #9541, follows up on #9529. Authors: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Christopher Harris (https://github.com/cwharris) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9945

lmeyerov · 2022-04-13T05:38:13Z

@vyasr is this a correct rewrite of impacted code?

df['x2'] = df['x'].set_index(df['idx']).sort_index()

=>

df['x2'] = df[['x']].set_index(df['idx']).sort_index()['x']

vyasr · 2022-04-13T21:10:38Z

That is a valid rewrite. However, I'm not certain that ever did exactly what you intended unless the idx column exactly matched the elements in your DataFrame's index. Assignment in this fashion triggers index alignment, which could lead to unexpected side effects (in both pandas and cudf). For example (using pandas to demonstrate that this has always been expected behavior):

In [72]: df = cudf.DataFrame({'x': [3, 6, 2, 3, 90], 'idx': [5, 4, 3, 2, 1]})

In [73]: pdf = df.to_pandas()

In [74]: pdf['x2'] = pdf[['x']].set_index(pdf['idx']).sort_index()['x']

In [75]: pdf
Out[75]: 
    x  idx    x2
0   3    5   NaN
1   6    4  90.0
2   2    3   3.0
3   3    2   2.0
4  90    1   6.0

Note that because idx ranges from 1-5 while the DataFrame's index goes from 0-4, we end up with NaNs in the 0th element of column `x2.

Leaving aside that question, as far as the best replacement it depends on personal preference and how you want to trade off performance vs simplicity. You may get better mileage from one of the following depending on the number of rows and columns in your table:

# Option 1 (your code with an index reset)
In [76]: %timeit df['x2'] = df[['x']].set_index(df['idx']).sort_index()['x'].reset_index(drop=True)
1.14 ms ± 5.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 2 (sort the values rather than setting an index)
In [77]: %timeit df['x2'] = df.sort_values(by='idx')['x'].reset_index(drop=True)
755 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 3 (avoid sorting unnecessary columns, most valuable if you have a lot of columns)
In [78]: %timeit df['x2'] = df[['x', 'idx']].sort_values(by='idx')['x'].reset_index(drop=True)
882 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Option 4 (more lines of code, but the most direct translation of what your original code did)
In [80]: %%timeit
    ...: s = df['x']
    ...: s.index = df['idx']
    ...: df['x2'] = s.sort_index().reset_index(drop=True)
    ...: 
    ...: 
812 µs ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

lmeyerov · 2022-04-14T01:41:21Z

Excellent, thanks! 🙌

Our case is a couple of permutations we're tacking on for generating some sort of funny ~reverse ~indices, and primarily for export, so (I think) we're good

An interesting bit these kinds of examples show that it'd be interesting (R&D-level) to do some sort of call tree recording to identify and recommend fusion + sorting optimizations, basically what a SQL optimizer would do, and advise on those!

vyasr · 2022-04-14T18:22:43Z

You may be interested in expression templates, which are one way that e.g. linear algebra libraries do exactly what you're talking about to enable operator fusion and minimizing intermediates. You can do it in Python with lazy evaluation like what Dask and Vaex do. But there's no real way to offer both lazy and greedy evaluation at the same time (although you could feature flag it), so the best we could do with that is probably what you suggested: record ops and make recommendations. That would be cool though!

lmeyerov · 2022-04-14T23:53:05Z

we were playing with the idea of LINQ-based df expression trees -- a lot of our original core was rxjs -> dataframejs -- where the same relational structure can work for diff eval structures. we've moved to python as rapids grew though.

doing for dask_cudf DAGs is interesting. we have a lot of dgdf.map_partitions(lambda gdf: ...) though so not sure if would capture enough, so maybe still comes down to cudf expression tracking.

some big examples for us:

we basically never take advantage of sorted indexes yet have a lot of merges. an advisor would help!
we did a pass awhile back trying to identify extraneous copies, and i bet there are more again. we get hit by a lot of overhead, so these can add up.
there are likely some fusion & reuse opportunities

vyasr added the deprecation label Oct 27, 2021

vyasr added this to the Pandas API Alignment and Coverage milestone Oct 27, 2021

github-actions bot added the inactive-30d label Nov 26, 2021

bdice mentioned this issue Dec 21, 2021

Remove deprecated method Series.set_index. #9945

Merged

bdice self-assigned this Dec 21, 2021

rapids-bot bot closed this as completed in #9945 Jan 26, 2022

miguelusque mentioned this issue Mar 7, 2022

[DOC] set_index on series should has been removed from documentation #10390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEPRECATION] Remove Series.set_index #9541

[DEPRECATION] Remove Series.set_index #9541

vyasr commented Oct 27, 2021 •

edited

Loading

github-actions bot commented Nov 26, 2021

lmeyerov commented Apr 13, 2022

vyasr commented Apr 13, 2022

lmeyerov commented Apr 14, 2022

vyasr commented Apr 14, 2022

lmeyerov commented Apr 14, 2022 •

edited

Loading

[DEPRECATION] Remove Series.set_index #9541

[DEPRECATION] Remove Series.set_index #9541

Comments

vyasr commented Oct 27, 2021 • edited Loading

github-actions bot commented Nov 26, 2021

lmeyerov commented Apr 13, 2022

vyasr commented Apr 13, 2022

lmeyerov commented Apr 14, 2022

vyasr commented Apr 14, 2022

lmeyerov commented Apr 14, 2022 • edited Loading

vyasr commented Oct 27, 2021 •

edited

Loading

lmeyerov commented Apr 14, 2022 •

edited

Loading