Draft: feat(python)!: Change `DataFrame.getitem` semantics #5085

matteosantama · 2022-10-03T15:01:51Z

Closes #4924

BREAKING CHANGE

There are a few changes bundled in here.

`Series.getitem`

Fully removes support for boolean masks. We had previously added a deprecation warning for this in feat(python): deprecate boolean mask for Series indexing #5075
Adds fast-path for slice(None, None, None).

`DataFrame.getitem`

Semantically, the biggest change is that df[int | list[int]] no longer selects rows; it selects columns.
Documentation drastically improved.
Refactor so that we normalize all row accessors to pl.Series(dtype=UInt32 | UInt64) and all col accessors to list[pl.col].
- I think overall this reduces the complexity of the code. It completely decouples the row selection from the column selection, whereas previously we had it all mixed into the body of DataFrame.__getitem__.

`DataFrame.setitem`

General clean-up (for example, we had the function annotated to accept a str but then immediately raised an error).
Loosened restriction for setting multiple columns -- we no longer require numpy to be installed.

IMO this function should be dumbed down even more. I would propose two simple rules

df[X] = z is semantically equivalent to df[:, X] = z. This would then be consistent with DataFrame.__getitem__.
df[X, Y] is semantically equivalent to df[Y][X] = z, ie. extract the Y column and then delegate to s[X] = z.

If this is agreeable, I am happy to add it to the PR. What we have now is a bit strange...

df["col"] = range(5)        # disallowed
df[["col"]] = [range(5)]    # allowed

Marking this as a draft to allow time for people to leave feedback!

matteosantama · 2022-10-03T16:13:22Z

@ritchie46 perhaps it would be easier for you if I split this into two separate PRs, one dealing with __setitem__ and another for __getitem__. Let me know!

ritchie46 · 2022-10-03T16:15:29Z

@ritchie46 perhaps it would be easier for you if I split this into two separate PRs, one dealing with __setitem__ and another for __getitem__. Let me know!

Yes, please. :)

ml31415 · 2023-02-25T10:50:41Z

Honestly, I consider this as a very sad regression, and I'm surprised to see, that it got implemented in #5075 without any major discussion or a way for users to have this behaviour optionally. I suppose it was a largely overlooked change :(

In my opinion, this change goes against the self claimed goal right on page one of polars, with which I fully agree: It is claiming to be "Familiar from the start". Now what this change does, is to take away one of the most common methods of array access. Probably each of the millions numerical computing users out there got familiar with this kind of accessing arrays on day one of playing around with numpy and that like. Instead of being "familiar from the start", this is an strange attempt to reeducate millions of potential users.

What I saw so far from the discussion, there were two main reasons from this. Please correct me if I'm wrong:

Simplifying interface
Not in line with the preferred lazy architecture

Yes, the current __getitem__ implementation is surely a bit bloated. Though, if users are familiar with that "bloat" messing around with it only irritates potential new users. No one is harmed, if there are three more if-else clauses deep inside of the __getitem__ implementation. But if the above mentioned a[a > 0] is not working anymore, every single new user will have to scratch his head and refer to the manual or stack overflow, to get a simple array access sorted. The missing __setitem__ is a similar issue.

Does the user care, that it doesn't profit from lazy mechanics, parallelizing etc? I'd argue: If they were fine with it when they used numpy before, they'd still be fine with it now. And if they're not fine with it, THEN it's time to offer them another interface. But please don't force anyone! There are so many use cases in standard jupyter notebook day to day stuff, where this is perfectly fine as it was.

Arguably the by far most used interface to any array-like thing in python is the numpy interface. And mimicking the numpy-array behaviour and improving the interchangeability with it, would greatly benefit the acceptance of polars. I'd highly suggest to reevaluate, if mentioned reasons for the change really outweigh the amount of irritation to this large amount of current and potential new users!

ritchie46 · 2023-02-25T12:25:16Z

Can you please open a feature request regarding what you'd like changed. We can have a discussion there. This PR has been closed for a long time now.

ml31415 · 2023-02-25T17:51:26Z

Ok, will do.

matteosantama added 3 commits October 3, 2022 07:37

Remove Series.__getitem__ support for boolean mask

21ea38f

Fast-path for Series __getitem__

7eb8f13

Improve DataFrame.__setitem__ and change DataFrame.__getitem__ semantics

bacbdcd

matteosantama force-pushed the df-getitem-v2 branch from 39dc928 to bacbdcd Compare October 3, 2022 16:11

matteosantama closed this Oct 3, 2022

matteosantama mentioned this pull request Oct 3, 2022

feat(python)!: Change DataFrame.__getitem__ semantics #5087

Closed

ml31415 mentioned this pull request Feb 26, 2023

Interoperability improvements #7200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: feat(python)!: Change `DataFrame.getitem` semantics #5085

Draft: feat(python)!: Change `DataFrame.getitem` semantics #5085

matteosantama commented Oct 3, 2022 •

edited

Loading

matteosantama commented Oct 3, 2022

ritchie46 commented Oct 3, 2022

ml31415 commented Feb 25, 2023

ritchie46 commented Feb 25, 2023

ml31415 commented Feb 25, 2023

Draft: feat(python)!: Change DataFrame.__getitem__ semantics #5085

Draft: feat(python)!: Change DataFrame.__getitem__ semantics #5085

Conversation

matteosantama commented Oct 3, 2022 • edited Loading

Series.__getitem__

DataFrame.__getitem__

DataFrame.__setitem__

matteosantama commented Oct 3, 2022

ritchie46 commented Oct 3, 2022

ml31415 commented Feb 25, 2023

ritchie46 commented Feb 25, 2023

ml31415 commented Feb 25, 2023

Draft: feat(python)!: Change `DataFrame.getitem` semantics #5085

Draft: feat(python)!: Change `DataFrame.getitem` semantics #5085

matteosantama commented Oct 3, 2022 •

edited

Loading

`Series.getitem`

`DataFrame.getitem`

`DataFrame.setitem`