Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: feat(python)!: Change DataFrame.__getitem__ semantics #5085

Closed
wants to merge 3 commits into from

Conversation

matteosantama
Copy link
Contributor

@matteosantama matteosantama commented Oct 3, 2022

Closes #4924

BREAKING CHANGE

There are a few changes bundled in here.

Series.__getitem__

DataFrame.__getitem__

  • Semantically, the biggest change is that df[int | list[int]] no longer selects rows; it selects columns.
  • Documentation drastically improved.
  • Refactor so that we normalize all row accessors to pl.Series(dtype=UInt32 | UInt64) and all col accessors to list[pl.col].
    • I think overall this reduces the complexity of the code. It completely decouples the row selection from the column selection, whereas previously we had it all mixed into the body of DataFrame.__getitem__.

DataFrame.__setitem__

  • General clean-up (for example, we had the function annotated to accept a str but then immediately raised an error).
  • Loosened restriction for setting multiple columns -- we no longer require numpy to be installed.

IMO this function should be dumbed down even more. I would propose two simple rules

  1. df[X] = z is semantically equivalent to df[:, X] = z. This would then be consistent with DataFrame.__getitem__.
  2. df[X, Y] is semantically equivalent to df[Y][X] = z, ie. extract the Y column and then delegate to s[X] = z.

If this is agreeable, I am happy to add it to the PR. What we have now is a bit strange...

df["col"] = range(5)        # disallowed
df[["col"]] = [range(5)]    # allowed

Marking this as a draft to allow time for people to leave feedback!

@matteosantama
Copy link
Contributor Author

@ritchie46 perhaps it would be easier for you if I split this into two separate PRs, one dealing with __setitem__ and another for __getitem__. Let me know!

@ritchie46
Copy link
Member

@ritchie46 perhaps it would be easier for you if I split this into two separate PRs, one dealing with __setitem__ and another for __getitem__. Let me know!

Yes, please. :)

@ml31415
Copy link

ml31415 commented Feb 25, 2023

Honestly, I consider this as a very sad regression, and I'm surprised to see, that it got implemented in #5075 without any major discussion or a way for users to have this behaviour optionally. I suppose it was a largely overlooked change :(

In my opinion, this change goes against the self claimed goal right on page one of polars, with which I fully agree: It is claiming to be "Familiar from the start". Now what this change does, is to take away one of the most common methods of array access. Probably each of the millions numerical computing users out there got familiar with this kind of accessing arrays on day one of playing around with numpy and that like. Instead of being "familiar from the start", this is an strange attempt to reeducate millions of potential users.

What I saw so far from the discussion, there were two main reasons from this. Please correct me if I'm wrong:

  • Simplifying interface
  • Not in line with the preferred lazy architecture

Yes, the current __getitem__ implementation is surely a bit bloated. Though, if users are familiar with that "bloat" messing around with it only irritates potential new users. No one is harmed, if there are three more if-else clauses deep inside of the __getitem__ implementation. But if the above mentioned a[a > 0] is not working anymore, every single new user will have to scratch his head and refer to the manual or stack overflow, to get a simple array access sorted. The missing __setitem__ is a similar issue.

Does the user care, that it doesn't profit from lazy mechanics, parallelizing etc? I'd argue: If they were fine with it when they used numpy before, they'd still be fine with it now. And if they're not fine with it, THEN it's time to offer them another interface. But please don't force anyone! There are so many use cases in standard jupyter notebook day to day stuff, where this is perfectly fine as it was.

Arguably the by far most used interface to any array-like thing in python is the numpy interface. And mimicking the numpy-array behaviour and improving the interchangeability with it, would greatly benefit the acceptance of polars. I'd highly suggest to reevaluate, if mentioned reasons for the change really outweigh the amount of irritation to this large amount of current and potential new users!

@ritchie46
Copy link
Member

Can you please open a feature request regarding what you'd like changed. We can have a discussion there. This PR has been closed for a long time now.

@ml31415
Copy link

ml31415 commented Feb 25, 2023

Ok, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change DataFrame.__getitem__ to select columns on single input
3 participants