Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Should Index be made opt-in? #48880

Closed
MarcoGorelli opened this issue Sep 30, 2022 · 40 comments
Closed

API Should Index be made opt-in? #48880

MarcoGorelli opened this issue Sep 30, 2022 · 40 comments
Labels
API Design Index Related to the Index class or subclasses Needs Discussion Requires discussion from core team before further action

Comments

@MarcoGorelli
Copy link
Member

One of the points raised in pandas standardisation doc, in the context of the Consortium for Python Data API Standards, is about making indices opt-in

I've started putting together a write-up on this: https://hackmd.io/@mntOORP3TCesJJyvg-IdFQ/B1EqHfQfo

TLDR: the suggestion is to have a DefaultIndex which would always go from 0 to the length of the DataFrame, and then loc and iloc would be aligned. I think this'd be too big of a breaking change in pandas, and that there'd be a serious risk of silently ruining people's analyses/models. I'd be open to either introducing DefaultIndex but not have it be the default, or to make it the default but not give it a loc method

@MarcoGorelli MarcoGorelli added API Design Needs Discussion Requires discussion from core team before further action labels Sep 30, 2022
@phofl
Copy link
Member

phofl commented Sep 30, 2022

I agree, this is too big to change. Also, I am not convinced that this would be an improvement in general. This is what ignore_index was introduced for? This has lots of subtle implications in users code that introduce value dependent behavior.

Did you look into adding this as a metadata flag? Like with allow_duplicates? This would make it opt in and could be handled in something like __finalize__?

@rhshadrach
Copy link
Member

The doc linked in the OP talks about certain libraries only supporting "default indexing". Could this be recast as libraries only supporting iloc? If one restricts usage of pandas to only using iloc, then pandas behaves as if it has "default indexing" today.

Are there other ops like loc that there is no iloc equivalent?

@phofl
Copy link
Member

phofl commented Sep 30, 2022

A couple that come to my mind:

  • GroupBy is setting the group columns as index for example.
  • concat takes the index into account when setting axis=1.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 30, 2022

Doh - of course; anything that uses alignment will immediately be incompatible. For groupby, I think one can just declare something like as_index=True is not supported when there is a default index. There will be edge cases here that break most certainly (probably in groupby.apply), but I think the most common cases will be okay.

If we created a default index object, I think we'd need to implement alignment for such objects via ilocs. For simplicity, let's say alignment between a default index and non-default index raises. If the internals use of alignment were sufficiently abstracted, maybe this could work, but I'd have to imagine that isn't the case currently and everything would just break.

Edit: And to make them sufficiently abstracted I imagine would be quite a perf hit.

@phofl
Copy link
Member

phofl commented Sep 30, 2022

Alignment is a good point. Did not think about this. I don't really see the benefit on our side or for our users here

@shwina
Copy link
Contributor

shwina commented Sep 30, 2022

If we created a default index object, I think we'd need to implement alignment for such objects via ilocs. For simplicity, let's say alignment between a default index and non-default index raises.

Perhaps it's helpful to think about this request as less of a DefaultIndex, and more of a NoIndex? Any situation that would normally require index alignment (e.g., Series([1, 2]) + Series([1, 2, 3])) would simply raise because of a mismatch in size.

Perhaps a config option would control whether objects are created with an index or not, so that you never have a situation where "default" and "nondefault" indexes coexist.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 30, 2022

Would the idea then be to raise if index_col is used in read_csv? And to raise for all the other options which involve indices (like as_index mentioned above)?

I'm worried this would add a lot of complexity, and I'm also not really clear what the benefit to users would be (though I'll concede that I've not attended an API Consortium meeting yet, and that maybe if I'd followed the whole discussion then I'd see it)

@shwina
Copy link
Contributor

shwina commented Sep 30, 2022

Thanks @MarcoGorelli and others! While I'm happy to continue discussing the technical details - which there are many to iron out - I feel the discussion would be greatly benefited by some additional context for this issue.

First, I should clarify that I agree on both points below:

  • eliminating indexes entirely from Pandas is a non-starter in any reasonable time frame.
  • Allowing Pandas users to opt-in (or even opt-out) of indexes is not immediately beneficial to them

However, I'm hoping we can convince Pandas maintainers to make Pandas "index unaware" by default (whatever that looks like), because it would benefit the ecosystem of DataFrame-like libraries and their users in the longer term.

The goal of the Consortium is to develop a "standard" DataFrame API, based on Pandas, that all Pandas-like Python libraries can support. This will enable users and downstream packages to switch out one library for another. A recent example of success of this idea from the Array world can be seen here.

Many Pandas-like libraries today either do not support indexes at all, or support a subset of index functionality. Dask, for example does not support MultiIndex. Speaking as a cuDF developer, we do support both Index and MultiIndex. But our code for doing so is clunky, and it's difficult/impossible to match Pandas' behaviour 100% of the time.

If we could drop indexes from our API and still match Pandas' default behaviour, we would do so in a heartbeat. I believe many other library maintainers feel the same way.

Opt-in v/s opt-out

The success of a "standard API" depends on matching the default behaviour of Pandas, which is why the request is framed as making Indexes opt-in, rather than opt-out.

@MarcoGorelli
Copy link
Member Author

Thanks for sharing context! To clarify:

If we could drop indexes from our API and still match Pandas' default behaviour, we would do so in a heartbeat

would you then also drop .loc? Because my biggest issue is with the "proposal is to align loc and iloc" part

@shwina
Copy link
Contributor

shwina commented Sep 30, 2022

would you then also drop .loc? Because my biggest issue is with the "proposal is to align loc and iloc" part

Yes, the idea is that "standard API" would not include .loc, but Pandas could of course support a superset of the standard.

@MarcoGorelli
Copy link
Member Author

OK thanks that makes more sense to me then - removing loc by default (rather than changing its behaviour to be aligned with iloc, which is what I'd originally understood this it be about) would be a safer breaking change

I'll rework the proposal in light of this

@jreback
Copy link
Contributor

jreback commented Oct 3, 2022

i am pretty skeptical here that you can change this w/o massive changes - nor do i think you should

but happy to see a sketch

@TomAugspurger
Copy link
Contributor

I sympathize with the task of the dataframe standard, and that row labels make things complicated, but do we have actual users asking for this who wouldn't be better off thinking through their data model? Most (all?) of the times I've heard requests for not having row labels, the user would have been better of thinking through whether they have one or more columns that "should" be used as the index.


And to confirm the intent of the proposal: are you proposing to make indexes opt-in or opt-out? I'd probably be OK with opt-out (users can explicitly say DataFrame(..., index=pd.DefaultIndex()) or whatever we call it. I worry bit about the complexity this will add to the library, but maybe it's worth it. But I don't think we should change the default of RangeIndex to become this non-aligning / resetting index.

@MarcoGorelli
Copy link
Member Author

I think this comment by @shwina clarifies it best

If we could drop indexes from our API and still match Pandas' default behaviour

So, the request is speficically for Index to be opt-in, because otherwise, no-Index wouldn't be the default behaviour.

The options seem to be:

  1. pandas makes Index opt-in. This'd be a breaking change many pandas maintainers don't seem enthused by;
  2. pandas makes Index opt-out. This would still add lot's of complexity to pandas, so I'm not seeing much enthusiasm from pandas devs about this either;
  3. this is declared as too impactful to be changed in pandas, and is a point where pandas would deviate from the consortium's DataFrame API

@shwina could you please clarify - would you consider option 2 an improvement on the status-quo? Would it be good enough for cuDF to drop Index, or would options 2 and 3 be indifferent to you?

@shwina
Copy link
Contributor

shwina commented Oct 3, 2022

Thanks for the summary, @MarcoGorelli!

  • I think option 1 would be ideal (of course ;) but failing which, option 2 would still be great to see. At least it enables some alignment between Pandas and the standard, and would allow the standard to be tested against Pandas.

  • However, can we leave option 1 on the table for now? I still feel it is worth sketching out exactly how changing the default from RangeIndex to DefaultIndex would break the end user. So far, we have identified, e.g., groupby, concat, and I/O functions. I'd be happy to help here @MarcoGorelli - perhaps we can iterate on the hackmd doc?

@MarcoGorelli
Copy link
Member Author

perhaps we can iterate on the hackmd doc?

Sure, go ahead, I think I've given edit permissions

In terms of breakage - .loc is used in nearly half of the most upvoted notebooks on Kaggle, so there'd be a lot of code that'd no longer work as-is. Wouldn't be off-the-table, if the fix were clear and the error message clear and helpful, just highlighting that indexes are quite widely used

@TomAugspurger
Copy link
Contributor

I don't want to completely rule out option 1 either, but I suspect it's going to be difficult to get widespread agreement on it. Row labels are useful! It might be OK for a standard to ignore / defer them (though IMO it'll hurt the adoption of the standard by users even if makes the standard easier to implement).

But in general I think that pandas users are better off finding natural row labels for their data than thinking about how to get rid of it.

@phofl
Copy link
Member

phofl commented Oct 3, 2022

  • every Index related argument would change
  • Everything that uses alignment would change in some way (this is a lot, df1+df2 for example, would change)
  • rolling operations
  • reshaping would change (merge, pivot, ...)

Honestly, it's hard to come up with stuff that would not change and that is more than a single function. You'd probably have to rewrite everything, I don't think that there is a code base out there that does not use any index related functionality.

@mroeschke mroeschke added the Index Related to the Index class or subclasses label Oct 3, 2022
@mroeschke
Copy link
Member

Linking a related prior discussion on optional indexes (but for the "pandas2" rewrite): wesm/pandas2#17

And a point of comparison, how xarray transitioned to optional indexes pydata/xarray#1017. IIUC:

  • Transition was opt-in to an index (Option 1)
  • Alignment between objects where one or both don't have an index is based on position and fails if 2 objects don't have the same length (?)
  • loc falls back to positional indexing (for integer inputs) for objects without an index
  • Helper method get_index to return an index (RangeIndex(n)) to objects without one
  • Noted as a major breaking change

cc @shoyer for any correction or any further experiences with xarray transitioning to an optional index.

@MarcoGorelli
Copy link
Member Author

Thanks, that's useful

What I have the most difficulty with from that list is

loc falls back to positional indexing (for integer inputs) for objects without an index

, really think loc shouldn't exist for the case when there's no index. Worst thing is if users aren't aware of the change and use loc expecting it to do something different, even if noted as a major breaking change

But I'm warming to the idea of there being an Index-free mode, so long as any breakage is loud and clear

@shoyer
Copy link
Member

shoyer commented Oct 12, 2022

loc falls back to positional indexing (for integer inputs) for objects without an index

, really think loc shouldn't exist for the case when there's no index. Worst thing is if users aren't aware of the change and use loc expecting it to do something different, even if noted as a major breaking change

This was mostly a gesture towards backwards compatibility. In principle, I agree -- we could (and maybe should?) make .loc raise errors when there is no index available. This would certainly require a deprecation period.

@jorisvandenbossche
Copy link
Member

I am personally quite interested in seeing this explored and discussed. I think it could be an interesting improvement to the ease of use of pandas in many cases.

I want to say something about the framing of the discussion, though. Personally, I am in the first place interested in this discussion for pandas itself. To put it a bit bluntly: for this discussion, I don't care about the fact that other dataframe libraries have a hard time implementing Index, or that the Consortium would like a standard without a row index. I personally have been pondering about this from time time before the consortium existed, as have others (eg wesm/pandas2#17), and we should discuss for pandas' sake. If such a change in pandas would help the broader ecosystem (the other dataframe libraries) and the consortium / standard, then that's a good additional argument of course (and the consortium has also certainly been useful to trigger this discussion), but IMO it shouldn't be the primary argument.
To be clear, I do care about other dataframe libraries and the consortium ;), I think we should just not use this as a main motivation for a potential change, but rather because this is a good change for our users (if we decide we indeed think it is). For example, if at some point this discussion gets summarized in a PDEP, it should start from why this would be useful for pandas users, not from "this is raised in the consortium" IMO.

I think this part of "why would this be useful for our users" is also a bit missing from the discussion for now (unless that is clear already? but then it would still be useful to explicitly spell out)

The hackmd doc that Marco shared has a brief paragraph on this:

A benefit to pandas users is that they could avoid a lot of unnecessary reset_indexs in people’s code. It’s a fairly common operation. It would also avoid unexpected errors because of duplicates in the index, e.g. https://stackoverflow.com/questions/27236275/what-does-valueerror-cannot-reindex-from-a-duplicate-axis-mean .

But I think we should try to expand on this.
For me personally, when not explicitly setting a column as a meaningful index (but having a default RangeIndex), I have often found our current behaviour getting in the way. After doing some operation (eg a filtering operation) you get an index with basically random numbers that are not useful (and then often followed by a reset_index(drop=True) if you want to get rid of those, as mentioned above).

Another example where are default alignment can give surprising results is something like s[:-1] == s[1:](we actually raise an error for this to avoid unexpected results using alignment: https://pandas.pydata.org/docs/dev/whatsnew/v0.19.0.html#comparison-operators), where you have to resort to .values to get what you want.

#34576 raises the issue of pandas by default writing the index in file output, even if that index was meaningless (pandas of course doesn't know whether that's the case, but in my experience a default RangeIndex often is)

[@TomAugspurger] do we have actual users asking for this who wouldn't be better off thinking through their data model? Most (all?) of the times I've heard requests for not having row labels, the user would have been better of thinking through whether they have one or more columns that "should" be used as the index.

Tom, could you expand a bit on this? I personally find row labels very useful, in specific cases where I manually set an existing column as the index (timestamps, some actual existing identifier, ..). But I almost never found the index very useful in case of the default RangeIndex (or converted to integer index once you do some operation such as a filter). It might be that sometimes there could have been a column in my data that could be used as index, but 1) I am not sure that is always the case, and 2) even if there is one, it doesn't always make your life easier to set it (if you don't explicitly make use of functionality that relies on it).


really think loc shouldn't exist for the case when there's no index. Worst thing is if users aren't aware of the change and use loc expecting it to do something different, even if noted as a major breaking change

One note about .loc: this is often used when filtering data and/or selecting columns (eg df.loc[ mask, [...]]). For this specific case, .loc doesn't make any use of the fact that it is labeled based (assuming that the mask and df follow the alignment rules, eg both no index and of the same length), and there is not really any ambiguity in what the result should be.
So for backwards compatibility reasons, I think it could help a lot to have .loc still work for those cases (that would keep more code working as is). And we could still discuss whether actual label-based selection (passing a single label or sequence of labels) should raise an error or fallback to integer as a separate case.

@TomAugspurger
Copy link
Contributor

in specific cases where I manually set an existing column as the index (timestamps, some actual existing identifier, ..). But I almost never found the index very useful in case of the default RangeIndex

Yeah, that matches my experience. I was thinking primarily about "Index vs. no Index", in which case I'll always push people a bit to think about their data model. But if we're thinking about improving the ergonomics of using a DataFrame with just the default RangeIndex, then yes, I think that auto resetting RangeIndex is worth exploring.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Oct 13, 2022

Lots of really good points here, thanks for the discussion

loc for boolean-based indexing: would it be possible to use iloc for that? I find the simplicity of an error message like Cannot use .loc with DefaultIndex - please use .iloc instead appealing. It's not currently allowed, but would there be anything blocking this?

I'm -1 on auto-resetting RangeIndex though as it'd silently break the example I put in the writeup:

x_train = df[df['feature']>0].copy()
y_train = df.loc[x_train.index, 'target']

Even if df has a RangeIndex which isn't meaningful to the user, using this index to make sure labels and features are aligned is a common-enough pattern I've seen in data science code


Something to address would be methods which default to setting an index without a way to opt-out. value_counts is an example, and one after which users'd often call reset_index() - example

Perhaps to get to opting out of an index, we'd need to ensure that no method sets an index by default. This'd mean going through a deprecation cycle to make as_index=False the default in .groupby, to add such a argument to value_counts.

@jorisvandenbossche
Copy link
Member

But if we're thinking about improving the ergonomics of using a DataFrame with just the default RangeIndex, then yes, I think that auto resetting RangeIndex is worth exploring.

Yes, for me this discussion is about the above (being able to set an index and use it is super useful (and an important part of pandas), we are not planning to remove that, that's not the discussion, but can the default experience (currently RangeIndex) be improved?)

And thinking about this as a "auto-resetting RangeIndex" is an interesting way, as it would be very similar as "no index" option (it are the cases where it wouldn't be exactly the same that we need to further explore and discuss, eg .loc, alignment, etc)

loc for boolean-based indexing: would it be possible to use iloc for that?

It would be possible in some cases, but that doesn't help to reduce the amount of code that this would break.
And it does not work in the case where you want to select columns at the same time (for example, the most voted python notebook on kaggle includes df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1; if df_train has no index, there is nothing ambiguous about this line, and it would avoid a part of the breaking change to keep something like this working)

@MarcoGorelli
Copy link
Member Author

Right, good point - OK with keeping .loc then, so long as it errors if the first argument is an index

The simplest way to drive the conversation forward might be to just try implementing this as a POC and seeing what we run into - I'll do some work on this

@shwina
Copy link
Contributor

shwina commented Oct 13, 2022

The simplest way to drive the conversation forward might be to just try implementing this as a POC and seeing what we run into - I'll do some work on this

I'm happy to be of help here if you need a collaborator, or even just a sounding board. Logistically, one place we could chat is the RAPIDS slack, but I'm open to other suggestions as well.

@phofl
Copy link
Member

phofl commented Oct 13, 2022

but can the default experience (currently RangeIndex) be improved?)

In general this sounds good, but the details are tricky. This works as long as you don't get any meaningful data as index (groupby, stack, pivot, ...; anything that drops rows is not straightforward too, because you can use the index of the result to select the rows from another object for example). So if those operations don't return an Index anymore, this causes lots of trouble in places that might not be obvious.

@MarcoGorelli do you want to change RangeIndex or implement a new Index object?

@MarcoGorelli
Copy link
Member Author

New one, I'd feel really uneasy about changing RangeIndex, mainly because of the example given this comment (unless there's a way around it that wouldn't silently break code?)

@phofl
Copy link
Member

phofl commented Oct 13, 2022

Yeah that is what I was referring to basically, its not only about loc though, drop_duplicates has a similar effect for example

@jorisvandenbossche
Copy link
Member

I think there are several good reasons to have it as a separate concept from RangeIndex (while it is useful to think about is as a different RangeIndex). Even if we would eventually want to replace the default behaviour from RangeIndex to this new behaviour, we still need a way for people to try this out without breaking existing RangeIndex-based code.
And it is also just a sufficiently different concept than RangeIndex to warrant a different name and object.

Although it also doesn't necessarily have to be a different object (like DefaultIndex or NoIndex), it could also be implemented by allowing the .index to be None (and updating code paths to handle this or error for it if the code path requires an index)

[@phofl] This works as long as you don't get any meaningful data as index (groupby, stack, pivot, ...; anything that drops rows is not straightforward too, because you can use the index of the result to select the rows from another object for example). So if those operations don't return an Index anymore, this causes lots of trouble in places that might not be obvious.

I think that something like pivot could still return a DataFrame with an index. And in that case you again have an index, and everything keeps working as it is. We are not removing the concept of an index, so it is fine that some methods produce a result with an index.
Marco raises a good point above that if we would want to move to "no index" by default instead of RangeIndex, it might make sense to think about if more methods can have the option or default to not using an index in the result (like groupby currently has a as_index=True/False). But that's still somewhat separate (or at least can have varying decisions depending on the exact method)

Logistically, one place we could chat is the RAPIDS slack, but I'm open to other suggestions as well.

@shwina we nowadays also have a pandas slack! (https://mail.python.org/pipermail/pandas-dev/2022-October/001525.html) You are very welcome to join there as well!

@jbrockmendel
Copy link
Member

I'm with Jeff as skeptical-but-open-to-it. Making index opt-in rather than opt-out is probably a non-starter for the near future.

@mroeschke
Copy link
Member

As of now, I'm +1 for optional Indexes but favor opt-out Indexes

@MarcoGorelli
Copy link
Member Author

Thanks everyone for the discussion

I've amended the write-up to include a proposal for how this might happen: https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ

In short:

  • if you construct a DataFrame without an Index, then it'll stay without an index unless you explicitly call set_index
  • Index would be opt-out in some version of pandas 2.x.0, and depending on how that goes, it could be made opt-in in 3.0.0.

How to clearly communicate such a breaking change in 3.0.0 would depend on decisions taken when making the index opt-out, so I'd rather have such a conversation now rather than once it's already opt-out

@mroeschke
Copy link
Member

I've amended the write-up to include a proposal for how this might happen: https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ

How to opt out of Index?

How should a user opt-out of an object that has already has an index?
df_with_index.set_index(None/pd.NoIndex)
df_with_index.index = pd.NoIndex

Also if we move forward with the global config option, maybe df_with_index.reset_index(drop=True) completely eliminates the index in the result.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Oct 14, 2022

Good point, thanks - I've added an example of that. I think df_with_index.reset_index(drop=True) (as long as the option mode.index is False) and df_with_index.set_index(None) should be fine

In [4]: df
Out[4]: 
            a  b
2000-01-01  1  4
2000-01-02  2  5
2000-01-03  3  6

In [5]: df.reset_index(drop=True)
Out[5]: 
 a  b
 1  4
 2  5
 3  6

There's also the question of whether the no-index should appear in the repr, and I think it looks better if it doesn't (like this it also signals to users that there's something different about this DataFrame the ones they're used to)

@jbrockmendel
Copy link
Member

From an annotation perspective, it might be nice to make index-less objects a different class (i've seen "Table" mentioned elsewhere), which I think DataFrame could then subclass. This would require something like #48126

@jreback
Copy link
Contributor

jreback commented Oct 15, 2022

While i am sympathetic to adding a NoIndex there are some concerns

  • this would fundamentally change pandas and i don't think this would ever be possible as opt-out;
    it would always need to be opt-in

  • we already have a lot going on in pandas that are systematic changes : copy on write and array manager

i would be more amenable to starting something new if we finish things first (which practically would mean killing array manager)

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Oct 31, 2022

I've gone through with another iteration of the writeup: https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ

TL;DR: The proposal is to have:

I think #49069 is uncontroversial, so if it's OK I'll let the commenter know that it's OK to get started on it if they're still interested

EDIT: still not sure if we really want #49069 , and it may even be orthogonal to this issue, so holding off work on that til there's been further discussion

@MarcoGorelli
Copy link
Member Author

closing, as discussion has moved to #49694

I'll update the PDEP soon, but in short, the proposal in its current state wouldn't make it as the default, and there's reluctance to add it as non-default because of the testing and maintenance burden that adding extra options entails

I still think it should be possible to do something here, but it'd require a more comprehensive set of changes to be made at the same time. We're talking maybe about pandas 4.0.0 here, it'd be quite a long way away and would take considerable thought and discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Index Related to the Index class or subclasses Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

10 participants