Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-5: NoRowIndex #49694

Merged
merged 16 commits into from
Mar 2, 2023
Merged

PDEP-5: NoRowIndex #49694

merged 16 commits into from
Mar 2, 2023

Conversation

MarcoGorelli
Copy link
Member

Initial draft of a proposal to have a "no index by default" mode

@pandas-dev/pandas-core @pandas-dev/pandas-triage would appreciate any thoughts / feedback, thanks 🙏

@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 14, 2022 12:46
## Detailed Description

This would require 3 steps:
1. creation of a ``NoIndex`` object, which would be a subclass of ``RangeIndex`` on which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why NoIndex be a subclass of RangeIndex? Does it just make things easier to implement?

IMO it feels like NoIndex should be an abstract-like class that Index and RangeIndex should subclass (and implement the "index-like" functionality

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or why not just None? (i.e. df.index is None)

In any case I agree with @mroeschke that it seems strange to make this a subclass of RangeIndex, since we explicitly don't want to have it act as a RangeIndex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that under the hood (e.g. at the Manager level) we'd still have an Index object (until/unless we refactor to get the indexes out of the Managers) and a NoIndex object makes sense there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that could be an internal implementation detail (to avoid having to check internally for if index is not None: too often). I think what matters here is what we want for the user. In the public .index property, we could convert a "NoIndex" object into None, if we think that is the better user API.

(although in the internals at least we don't do much with the index, we mostly pass it through, and passing through None could also work; I suppose it is mostly outside of the internals that a NoIndex object could make an initial implementation easier (less places to update))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding whether it could just be None - just as an example, if you have

ser = Series([1,2,3])
ser.loc[ser>1] = 4

then it'll call Index.get_loc.

loc = self.index.get_loc(key)

So rather than re-engineering _set_with_engine to not use self.index if index is None, I thought it'd be simpler to have some NoIndex object which overrides get_loc

Then, with get_loc, append, and join overridden, very little would need changing outside of the NoIndex class itself


As for whether NoIndex needs to subclass RangeIndex - perhaps not, but my reasoning was that for aligning and joining, it would act just like RangeIndex, but might just need to santise its inputs.

join, for example, could be:

if not isinstance(other, NoIndex):
   raise TypeError("Can't join NoIndex with Index")
if not len(self) == len(other):
   raise TypeError("Can't join NoIndex objects of different lengths")
return super().join(other, how=how, return_indexers=return_indexers, sort=sort)

Not sure I follow the suggestion about it being an abstract class:

  • where would the methods be implemented? If NoIndex is an abstract class, what would the concrete class be?
  • would Index then be defined as class Index(NoIndex, IndexOpsMixin, PandasObject)?

An alternative would be to have class NoIndex(IndexOpsMixin, PandasObject), though it would then require re-implementing many of the Index methods (like __len__)

I'll look further into this and will amend the proposal - thanks all for your comments!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the suggestion about it being an abstract class:

Sorry I think "abstract-like" was probably a loaded term to use: I meant more akin to a skeleton class (that you alluded to) where you can define the necessary methods and raise a NotImeplementedError to "disable" indexing-like behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather than re-engineering _set_with_engine to not use self.index if index is None, I thought it'd be simpler to have some NoIndex object which overrides get_loc

Then, with get_loc, append, and join overridden, very little would need changing outside of the NoIndex class itself

For me, I would say this is still an implementation detail. We could decide to internally use for example ._index instead of .index, which would be ensured to return an Index (sub)class instance, so that this can be implemented this way without too much changes (apart from adding the underscore). And this way the public property could still return None.

For the public API, the question is: would it be useful for the user to have a NoIndex object? (to be clear, I am not necessarily saying the answer is definitly "no", we should probably think through some use cases) How would the user make use of this?
And do we want that the user starts relying on the fact that they can still call all the Index methods on this NoIndex object?

(and for this PDEP, I don't think we necessarily have to agree on the exact implementation approach, but I would like to see a concrete API proposal for the public API)

@rhshadrach
Copy link
Member

rhshadrach commented Nov 14, 2022

Is there a proposed result for df.sum()?

print(pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).sum())
# a     4
# b    12
# dtype: int64

I suppose this must no longer be a Series? Edit: Or is the result the values without any column labels?

Another one is transpose - which I think will have to lose the column labels. We use transpose internally at times to do axis=1 computations, but we can track the column labels there.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if its better to have this as a dataframe / series constructor argument rather than a global setting. Also @rhshadrach would probably have some good input as to how this aligns with some of the groupby parameters he has been working on

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member Author

Thanks all for your comments!

Is there a proposed result for df.sum()?

Ah, that's a really good one. I hadn't tried that out, and I don't have a good solution.

It'd be nice to do

In [11]: df
Out[11]: 
 a  b  c
 1  4  7
 2  5  8
 3  6  9

In [12]: df.sum()
Out[12]: 
 a  b  c
 6 15 24

but I think that'd be too much of a breaking change from the regular pandas case.

Not keen on having:

In [14]: df.sum()
Out[14]: 
 6
15
24
dtype: int64

because here the row labels would be really meaningful, but also not keen on having and extra column like

In [19]: df.sum()
Out[19]: 
column  sum
     a    6
     b   15
     c   24

I can't see how to get this to conform with "3. Nobody should get an index unless they ask for one" without it being too big of a change to current behaviour.

I think I may need to drop "3." from this proposal, and only describe how a NoIndex object would look like

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 15, 2022

I can't see how to get this to conform with "3. Nobody should get an index unless they ask for one" without it being too big of a change to current behaviour.

I think I may need to drop "3." from this proposal, and only describe how a NoIndex object would look like

I would personally find that the better option.

Specifically for the reduction example: a DataFrame has named columns, so if those end up in the index in a certain operation, we use those names. So you could say that the user "asks for this", when it does an operation where the columns names are used for row labels.
Of course it could be avoided that those column names end up in the row labels, by returning an object of a different shape (DataFrame with 1 row vs Series), but considering to change the output type / shape of reductions seems indeed a bigger / separate discussion.

I personally also think that having an option to have to no index to start with (i.e, in IO, constructors, etc) is already useful in itself (and is a smaller scoped change than the "with this option enabled, no index is ever created", since we just have quite some functionality that introduce meaningful indices even when starting from a default range index / no index)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we don't want to call this NoRowIndex? even internally

#as by definition we are allowing this for axis=0
(i guess we could allow no column index but then it's just a 2D numpy array)

@MarcoGorelli
Copy link
Member Author

NoRowIndex is probably better, thanks - column labels would always be there (and would be explicitly prohibited from being NoRowIndex)

Thanks everyone for your comments, really appreciate them - I'll revise the proposal trying to account for everyone's thoughts

For now it'll focus exclusively on the NoRowIndex object - discussions around as_index defaults and pd.set_option can happen separately

@mroeschke mroeschke added the PDEP pandas enhancement proposal label Nov 16, 2022
@WillAyd
Copy link
Member

WillAyd commented Nov 17, 2022

Do you have a point of view as to what should happen when objects without an index that are a different length get added? Do they truncate to the shortest length, raise an error, or do something else?

@MarcoGorelli
Copy link
Member Author

that'd raise - aiming to push a revision with plenty of examples tomorrow

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Nov 18, 2022

Thanks all for your reviews

I've pushed another commit with a first revision. I've resolved most of the comments as I think I've addressed them in the revision, either in the body or in the FAQ section

I've left open the discussion about whether NoRowIndex should be part of the public API, as I'm not sure about that yet

@MarcoGorelli MarcoGorelli changed the title PDEP0005 No-default-index mode PDEP0005 NoRowIndex Nov 18, 2022
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One global comment. You may want to avoid contractions, i.e., instead of "wouldn't", say "would not", instead of "didn't", use "did not", etc.


The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
a ``RangeIndex``, but some methods would be stricter. This would be one
step towards enabling users who don't want to think about indices to not have to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This would be one step towards enabling users who don't want to think about indices." I think the phrase "to not have to" is awkward.

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved
@blazespinnaker
Copy link

blazespinnaker commented Nov 21, 2022

My one general issue is this might be throwing out the baby with the bath water, though I get why it might be easy to implement. It also feels quite dangerous as it will be inconsistent with pretty much how pandas is documented everywhere to behave. This will make stack overflow questions hard to answer as you'll need to know whether this option was set.

Row Indices are also broadly useful and declarative. The point of pandas over numpy is that it adds meta-data which controls its behavior. I can't imagine turning them off and retaining the value of Pandas.

The problem with indices occurs with implicit and rather arbitrary data alignment during various operations, such as sum above. In other data QLs you need to specify this behavior explicitly via a join clause.

Better I think would be to leave indices largely alone but have an option (which can be globally set) to turn it off/on with one of the following: strict mode / error on unaligned indices, off with ignore error (which this seems to be, and it seems very dangerous), or on (current behavior). The dangerous mode, I only include it to be friendly, but I'd advise against it. For stack overflow questions, an answer could often be - try setting strict mode to help debug your problem.

Data alignment may be confusing, but it's extremely important as it ensures that the user isn't incorrectly collating unaligned data, which can easily happen. Implicit behavior like this is often the source of confusion, especially when it's not very consistent.

My biggest complaint with how it currently automagically works and doesn't warn you when probably you made a mistake.

eg:

#49617

@MarcoGorelli
Copy link
Member Author

Thanks for taking a look - I'll respond to some of your points:

  • StackOverflow requires minimal reproducible examples (which'd need to include any options which were set), so this shouldn't be an issue there. There's already lot's of options which can be set in pandas, I've not seen them cause confusion on SO;
  • this wouldn't remove the value of pandas over numpy because column names would still be there, along with the operations such as groupby which aren't in numpy

This wouldn't be the dangerous mode you describe, because without row labels, there'd be no concept of misalignment. If you try to sum (or take .corr, or whatever) between two DataFrames without row labels, then:

  • either they're the same length, in which case the operation succeeds (and aligns by position);
  • or they're not, in which case it fails loudly and clearly.

My biggest complaint with how it currently automagically works and doesn't warn you when probably you made a mistake.

pandas doesn't know what you're trying to do, so it can't warn when you "probably made a mistake". An option to warn whenever indexes aren't aligned would be a separate discussion - if you want to take it further, please open a new issue and I'd be happy to take a look. Let's keep this issue's discussion focused on not having row labels.

@jbrockmendel
Copy link
Member

There's already lot's of options which can be set in pandas

These checks aren't expensive, but they do add up (xref #49771). I'm more concerned about the maintenance/testing burde they add up to.

@MarcoGorelli
Copy link
Member Author

Thanks all for your reviews and comments, I really appreciate it 🙏

After several discussions, it's become clear that there's not enough support for this to pass, at least in its current state. Hence, I've marked the proposal as 'rejected', and have explained the reasons why in a section at the bottom of the page.

As mentioned in the PDEP, this has still been a useful exercise which has resulted in 2 related proposals. It may be that the idea of not having an index can make it into pandas in the future, but it would require a more comprehensive set of changes to be made together. This particular version of the proposal, however, doesn't have a chance of making it.

No hard feelings, and thank you all for the time you dedicated to reviewing this 🤗

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIthdrawn status is okay with me. Thanks for exploring this @MarcoGorelli!

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were some comments from my previous review that weren't addressed, and I added new ones here.

```python
In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))

In [13]: df.loc[df['a']>1, 'b']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, would be useful to have spaces around the > operator here and a few lines below


### Columns cannot be NoRowIndex

This proposal deals exclusively with letting users not have to think about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to write "This proposal deals exclusively with allowing users to not have to think about"

then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
all call ``default_index``):

- ``df.reset_index(drop=True)``;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't df.reset_index() (i.e., with drop=False) also do the same thing? I see no need to include drop=True in this example.

**Q: Are indices not really powerful?**

**A:** Yes! And they're also confusing to many users, even experienced developers.
It's fairly common to see pandas code with ``.reset_index`` scattered around every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, we often end up using .reset_index() because we have a MultiIndex, and there is some data in the index that we need to use in a computation. E.g.:

df = pd.DataFrame({"a":[1,2,3,4]}, index=pd.MultiIndex.from_product([[10, 20], [30, 40]], names=["x", "y"]))
df.reset_index().assign(z=lambda df: df["x"] + df["y"]).set_index(df.index.names)

So your comment about "It's fairly common to see pandas code with .reset_index() scattered around every other line" implies that the reason that people are doing reset_index() is because people are worried about indices and alignment.

It might be better to say "Often users are using .reset_index to avoid issues with indices and alignment."


**A:** This PDEP would only allow it via the constructor, passing
``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
making that the default, but would be out-of-scope for the current PDEP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with what you wrote above about having the mode.no_row_index option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? even when I mention it above, I make it clear that this PDEP only deals with the NoRowIndex class, and that this mode would be out-of-scope for this PDEP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is text in the PDEP that refers to the mode that makes it seem like it is part of the proposal.

For example, lines 297-298 read:

Conversely, to get rid of the index, then (so long as one has enabled the ``mode.no_row_index`` option)
  one could simply do ``df.reset_index(drop=True)``.

When I read that, it seems like the mode is part of the proposal. So I think you may need to qualify all references to mode.no_row_index to describe what would happen if the mode.no_row_index option is not available.

Or make a decision to include mode.no_row_index as part of the proposal. (Which I think makes it easier to understand, IMHO)

**Q: Would ``.loc`` stop working?**

**A:** No. It would only raise if used for label-based selection. Other uses
of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[mask, col_1]``, would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use boolean_mask here to make it clear you're referring to that kind of mask

@MarcoGorelli
Copy link
Member Author

there were some comments from my previous review that weren't addressed

Apologies, and thanks for your patience + comments

@MarcoGorelli MarcoGorelli changed the title PDEP0005 LabellessIndex PDEP0005 NoRowIndex Jan 3, 2023
@datapythonista datapythonista changed the title PDEP0005 NoRowIndex PDEP-5: NoRowIndex Jan 18, 2023
@jorisvandenbossche jorisvandenbossche removed this from the 2.0 milestone Feb 8, 2023
@MarcoGorelli
Copy link
Member Author

I think I've addressed the comments - this is withdrawn anyway so arguably doesn't need to be perfect

@Dr-Irv is it OK to put this out of misery merge?

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge, since it is withdrawn

@Dr-Irv Dr-Irv merged commit a344dc9 into pandas-dev:main Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.