PDEP-5: NoRowIndex #49694

MarcoGorelli · 2022-11-14T12:46:11Z

Initial draft of a proposal to have a "no index by default" mode

@pandas-dev/pandas-core @pandas-dev/pandas-triage would appreciate any thoughts / feedback, thanks 🙏

mroeschke · 2022-11-14T20:02:16Z

web/pandas/pdeps/0005-no-default-index-mode.md

+## Detailed Description
+
+This would require 3 steps:
+1. creation of a ``NoIndex`` object, which would be a subclass of ``RangeIndex`` on which


Curious why NoIndex be a subclass of RangeIndex? Does it just make things easier to implement?

IMO it feels like NoIndex should be an abstract-like class that Index and RangeIndex should subclass (and implement the "index-like" functionality

Or why not just None? (i.e. df.index is None)

In any case I agree with @mroeschke that it seems strange to make this a subclass of RangeIndex, since we explicitly don't want to have it act as a RangeIndex

I expect that under the hood (e.g. at the Manager level) we'd still have an Index object (until/unless we refactor to get the indexes out of the Managers) and a NoIndex object makes sense there.

Yes, but that could be an internal implementation detail (to avoid having to check internally for if index is not None: too often). I think what matters here is what we want for the user. In the public .index property, we could convert a "NoIndex" object into None, if we think that is the better user API.

(although in the internals at least we don't do much with the index, we mostly pass it through, and passing through None could also work; I suppose it is mostly outside of the internals that a NoIndex object could make an initial implementation easier (less places to update))

Regarding whether it could just be None - just as an example, if you have

ser = Series([1,2,3]) ser.loc[ser>1] = 4

then it'll call Index.get_loc.

pandas/pandas/core/series.py

Line 1132 in dbb2adc

loc = self.index.get_loc(key)

So rather than re-engineering _set_with_engine to not use self.index if index is None, I thought it'd be simpler to have some NoIndex object which overrides get_loc

Then, with get_loc, append, and join overridden, very little would need changing outside of the NoIndex class itself

As for whether NoIndex needs to subclass RangeIndex - perhaps not, but my reasoning was that for aligning and joining, it would act just like RangeIndex, but might just need to santise its inputs.

join, for example, could be:

if not isinstance(other, NoIndex): raise TypeError("Can't join NoIndex with Index") if not len(self) == len(other): raise TypeError("Can't join NoIndex objects of different lengths") return super().join(other, how=how, return_indexers=return_indexers, sort=sort)

Not sure I follow the suggestion about it being an abstract class:

where would the methods be implemented? If NoIndex is an abstract class, what would the concrete class be?

would Index then be defined as class Index(NoIndex, IndexOpsMixin, PandasObject)?

An alternative would be to have class NoIndex(IndexOpsMixin, PandasObject), though it would then require re-implementing many of the Index methods (like __len__)

I'll look further into this and will amend the proposal - thanks all for your comments!

Not sure I follow the suggestion about it being an abstract class:

Sorry I think "abstract-like" was probably a loaded term to use: I meant more akin to a skeleton class (that you alluded to) where you can define the necessary methods and raise a NotImeplementedError to "disable" indexing-like behavior

So rather than re-engineering _set_with_engine to not use self.index if index is None, I thought it'd be simpler to have some NoIndex object which overrides get_loc

Then, with get_loc, append, and join overridden, very little would need changing outside of the NoIndex class itself

For me, I would say this is still an implementation detail. We could decide to internally use for example ._index instead of .index, which would be ensured to return an Index (sub)class instance, so that this can be implemented this way without too much changes (apart from adding the underscore). And this way the public property could still return None.

For the public API, the question is: would it be useful for the user to have a NoIndex object? (to be clear, I am not necessarily saying the answer is definitly "no", we should probably think through some use cases) How would the user make use of this?
And do we want that the user starts relying on the fact that they can still call all the Index methods on this NoIndex object?

(and for this PDEP, I don't think we necessarily have to agree on the exact implementation approach, but I would like to see a concrete API proposal for the public API)

web/pandas/pdeps/0005-no-default-index-mode.md

rhshadrach · 2022-11-14T22:10:12Z

Is there a proposed result for df.sum()?

print(pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).sum())
# a     4
# b    12
# dtype: int64

I suppose this must no longer be a Series? Edit: Or is the result the values without any column labels?

Another one is transpose - which I think will have to lose the column labels. We use transpose internally at times to do axis=1 computations, but we can track the column labels there.

WillAyd

Wondering if its better to have this as a dataframe / series constructor argument rather than a global setting. Also @rhshadrach would probably have some good input as to how this aligns with some of the groupby parameters he has been working on

web/pandas/pdeps/0005-no-default-index-mode.md

MarcoGorelli · 2022-11-15T14:15:30Z

Thanks all for your comments!

Is there a proposed result for df.sum()?

Ah, that's a really good one. I hadn't tried that out, and I don't have a good solution.

It'd be nice to do

In [11]: df
Out[11]: 
 a  b  c
 1  4  7
 2  5  8
 3  6  9

In [12]: df.sum()
Out[12]: 
 a  b  c
 6 15 24

but I think that'd be too much of a breaking change from the regular pandas case.

Not keen on having:

In [14]: df.sum()
Out[14]: 
 6
15
24
dtype: int64

because here the row labels would be really meaningful, but also not keen on having and extra column like

In [19]: df.sum()
Out[19]: 
column  sum
     a    6
     b   15
     c   24

I can't see how to get this to conform with "3. Nobody should get an index unless they ask for one" without it being too big of a change to current behaviour.

I think I may need to drop "3." from this proposal, and only describe how a NoIndex object would look like

jorisvandenbossche · 2022-11-15T15:05:33Z

I can't see how to get this to conform with "3. Nobody should get an index unless they ask for one" without it being too big of a change to current behaviour.

I think I may need to drop "3." from this proposal, and only describe how a NoIndex object would look like

I would personally find that the better option.

Specifically for the reduction example: a DataFrame has named columns, so if those end up in the index in a certain operation, we use those names. So you could say that the user "asks for this", when it does an operation where the columns names are used for row labels.
Of course it could be avoided that those column names end up in the row labels, by returning an object of a different shape (DataFrame with 1 row vs Series), but considering to change the output type / shape of reductions seems indeed a bigger / separate discussion.

I personally also think that having an option to have to no index to start with (i.e, in IO, constructors, etc) is already useful in itself (and is a smaller scoped change than the "with this option enabled, no index is ever created", since we just have quite some functionality that introduce meaningful indices even when starting from a default range index / no index)

web/pandas/pdeps/0005-no-default-index-mode.md

jreback

is there a reason we don't want to call this NoRowIndex? even internally

#as by definition we are allowing this for axis=0
(i guess we could allow no column index but then it's just a 2D numpy array)

MarcoGorelli · 2022-11-16T11:00:00Z

NoRowIndex is probably better, thanks - column labels would always be there (and would be explicitly prohibited from being NoRowIndex)

Thanks everyone for your comments, really appreciate them - I'll revise the proposal trying to account for everyone's thoughts

For now it'll focus exclusively on the NoRowIndex object - discussions around as_index defaults and pd.set_option can happen separately

WillAyd · 2022-11-17T22:01:31Z

Do you have a point of view as to what should happen when objects without an index that are a different length get added? Do they truncate to the shortest length, raise an error, or do something else?

MarcoGorelli · 2022-11-17T22:15:51Z

that'd raise - aiming to push a revision with plenty of examples tomorrow

MarcoGorelli · 2022-11-18T14:41:48Z

Thanks all for your reviews

I've pushed another commit with a first revision. I've resolved most of the comments as I think I've addressed them in the revision, either in the body or in the FAQ section

I've left open the discussion about whether NoRowIndex should be part of the public API, as I'm not sure about that yet

Dr-Irv

One global comment. You may want to avoid contractions, i.e., instead of "wouldn't", say "would not", instead of "didn't", use "did not", etc.

web/pandas/pdeps/0005-no-default-index-mode.md

Dr-Irv · 2022-11-18T15:29:00Z

web/pandas/pdeps/0005-no-default-index-mode.md

+
+The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
+a ``RangeIndex``, but some methods would be stricter. This would be one
+step towards enabling users who don't want to think about indices to not have to.


"This would be one step towards enabling users who don't want to think about indices." I think the phrase "to not have to" is awkward.

web/pandas/pdeps/0005-no-default-index-mode.md

blazespinnaker · 2022-11-21T00:13:35Z

My one general issue is this might be throwing out the baby with the bath water, though I get why it might be easy to implement. It also feels quite dangerous as it will be inconsistent with pretty much how pandas is documented everywhere to behave. This will make stack overflow questions hard to answer as you'll need to know whether this option was set.

Row Indices are also broadly useful and declarative. The point of pandas over numpy is that it adds meta-data which controls its behavior. I can't imagine turning them off and retaining the value of Pandas.

The problem with indices occurs with implicit and rather arbitrary data alignment during various operations, such as sum above. In other data QLs you need to specify this behavior explicitly via a join clause.

Better I think would be to leave indices largely alone but have an option (which can be globally set) to turn it off/on with one of the following: strict mode / error on unaligned indices, off with ignore error (which this seems to be, and it seems very dangerous), or on (current behavior). The dangerous mode, I only include it to be friendly, but I'd advise against it. For stack overflow questions, an answer could often be - try setting strict mode to help debug your problem.

Data alignment may be confusing, but it's extremely important as it ensures that the user isn't incorrectly collating unaligned data, which can easily happen. Implicit behavior like this is often the source of confusion, especially when it's not very consistent.

My biggest complaint with how it currently automagically works and doesn't warn you when probably you made a mistake.

eg:

#49617

MarcoGorelli · 2022-11-21T10:36:40Z

Thanks for taking a look - I'll respond to some of your points:

StackOverflow requires minimal reproducible examples (which'd need to include any options which were set), so this shouldn't be an issue there. There's already lot's of options which can be set in pandas, I've not seen them cause confusion on SO;
this wouldn't remove the value of pandas over numpy because column names would still be there, along with the operations such as groupby which aren't in numpy

This wouldn't be the dangerous mode you describe, because without row labels, there'd be no concept of misalignment. If you try to sum (or take .corr, or whatever) between two DataFrames without row labels, then:

either they're the same length, in which case the operation succeeds (and aligns by position);
or they're not, in which case it fails loudly and clearly.

My biggest complaint with how it currently automagically works and doesn't warn you when probably you made a mistake.

pandas doesn't know what you're trying to do, so it can't warn when you "probably made a mistake". An option to warn whenever indexes aren't aligned would be a separate discussion - if you want to take it further, please open a new issue and I'd be happy to take a look. Let's keep this issue's discussion focused on not having row labels.

jbrockmendel · 2022-11-21T16:10:42Z

There's already lot's of options which can be set in pandas

These checks aren't expensive, but they do add up (xref #49771). I'm more concerned about the maintenance/testing burde they add up to.

MarcoGorelli · 2022-12-04T17:21:51Z

Thanks all for your reviews and comments, I really appreciate it 🙏

After several discussions, it's become clear that there's not enough support for this to pass, at least in its current state. Hence, I've marked the proposal as 'rejected', and have explained the reasons why in a section at the bottom of the page.

As mentioned in the PDEP, this has still been a useful exercise which has resulted in 2 related proposals. It may be that the idea of not having an index can make it into pandas in the future, but it would require a more comprehensive set of changes to be made together. This particular version of the proposal, however, doesn't have a chance of making it.

No hard feelings, and thank you all for the time you dedicated to reviewing this 🤗

web/pandas/pdeps/0005-no-default-index-mode.md

mroeschke

WIthdrawn status is okay with me. Thanks for exploring this @MarcoGorelli!

Dr-Irv

there were some comments from my previous review that weren't addressed, and I added new ones here.

web/pandas/pdeps/0005-no-default-index-mode.md

Dr-Irv · 2022-12-30T16:11:38Z

web/pandas/pdeps/0005-no-default-index-mode.md

+```python
+In [12]: df = pd.DataFrame({'a': [1,  2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
+
+In [13]: df.loc[df['a']>1, 'b']


For readability, would be useful to have spaces around the > operator here and a few lines below

Dr-Irv · 2022-12-30T16:13:08Z

web/pandas/pdeps/0005-no-default-index-mode.md

+
+### Columns cannot be NoRowIndex
+
+This proposal deals exclusively with letting users not have to think about


Better to write "This proposal deals exclusively with allowing users to not have to think about"

Dr-Irv · 2022-12-30T16:15:12Z

web/pandas/pdeps/0005-no-default-index-mode.md

+then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
+all call ``default_index``):
+
+- ``df.reset_index(drop=True)``;


Wouldn't df.reset_index() (i.e., with drop=False) also do the same thing? I see no need to include drop=True in this example.

Dr-Irv · 2022-12-30T16:25:12Z

web/pandas/pdeps/0005-no-default-index-mode.md

+**Q: Are indices not really powerful?**
+
+**A:** Yes! And they're also confusing to many users, even experienced developers.
+  It's fairly common to see pandas code with ``.reset_index`` scattered around every


For what it's worth, we often end up using .reset_index() because we have a MultiIndex, and there is some data in the index that we need to use in a computation. E.g.:

df = pd.DataFrame({"a":[1,2,3,4]}, index=pd.MultiIndex.from_product([[10, 20], [30, 40]], names=["x", "y"])) df.reset_index().assign(z=lambda df: df["x"] + df["y"]).set_index(df.index.names)

So your comment about "It's fairly common to see pandas code with .reset_index() scattered around every other line" implies that the reason that people are doing reset_index() is because people are worried about indices and alignment.

It might be better to say "Often users are using .reset_index to avoid issues with indices and alignment."

Dr-Irv · 2022-12-30T16:26:56Z

web/pandas/pdeps/0005-no-default-index-mode.md

+
+**A:** This PDEP would only allow it via the constructor, passing
+  ``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
+  making that the default, but would be out-of-scope for the current PDEP.


This is inconsistent with what you wrote above about having the mode.no_row_index option

how so? even when I mention it above, I make it clear that this PDEP only deals with the NoRowIndex class, and that this mode would be out-of-scope for this PDEP

There is text in the PDEP that refers to the mode that makes it seem like it is part of the proposal.

For example, lines 297-298 read:

Conversely, to get rid of the index, then (so long as one has enabled the ``mode.no_row_index`` option) one could simply do ``df.reset_index(drop=True)``.

When I read that, it seems like the mode is part of the proposal. So I think you may need to qualify all references to mode.no_row_index to describe what would happen if the mode.no_row_index option is not available.

Or make a decision to include mode.no_row_index as part of the proposal. (Which I think makes it easier to understand, IMHO)

Dr-Irv · 2022-12-30T16:27:31Z

web/pandas/pdeps/0005-no-default-index-mode.md

+**Q: Would ``.loc`` stop working?**
+
+**A:** No. It would only raise if used for label-based selection. Other uses
+  of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[mask, col_1]``, would


might want to use boolean_mask here to make it clear you're referring to that kind of mask

MarcoGorelli · 2023-01-03T16:39:06Z

there were some comments from my previous review that weren't addressed

Apologies, and thanks for your patience + comments

MarcoGorelli · 2023-02-28T11:58:56Z

I think I've addressed the comments - this is withdrawn anyway so arguably doesn't need to be perfect

@Dr-Irv is it OK to ~~put this out of misery~~ merge?

Dr-Irv

OK to merge, since it is withdrawn

MarcoGorelli marked this pull request as ready for review November 14, 2022 12:46

mroeschke reviewed Nov 14, 2022

View reviewed changes

jbrockmendel reviewed Nov 14, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

WillAyd reviewed Nov 15, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

Dr-Irv reviewed Nov 15, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

Dr-Irv reviewed Nov 15, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Show resolved Hide resolved

Dr-Irv reviewed Nov 15, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

Dr-Irv reviewed Nov 15, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Show resolved Hide resolved

bashtage reviewed Nov 16, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

bashtage reviewed Nov 16, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Show resolved Hide resolved

jreback reviewed Nov 16, 2022

View reviewed changes

mroeschke added the PDEP pandas enhancement proposal label Nov 16, 2022

MarcoGorelli force-pushed the pdep5 branch from 72cf0d9 to 5bd10e4 Compare November 18, 2022 13:49

MarcoGorelli changed the title ~~PDEP0005 No-default-index mode~~ PDEP0005 NoRowIndex Nov 18, 2022

Dr-Irv requested changes Nov 18, 2022

View reviewed changes

mroeschke reviewed Nov 18, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

mroeschke reviewed Nov 18, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Show resolved Hide resolved

MarcoGorelli mentioned this pull request Nov 18, 2022

ENH: ignore_index for Series corr #49617

Closed

3 tasks

MarcoGorelli force-pushed the pdep5 branch from 5bd10e4 to 646e081 Compare November 19, 2022 08:26

Dr-Irv reviewed Nov 20, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Show resolved Hide resolved

MarcoGorelli force-pushed the pdep5 branch from f3ebe8e to f4c0101 Compare December 4, 2022 17:22

mroeschke reviewed Dec 5, 2022

View reviewed changes

web/pandas/pdeps/0005-no-default-index-mode.md Outdated Show resolved Hide resolved

mroeschke approved these changes Dec 13, 2022

View reviewed changes

MarcoGorelli requested a review from Dr-Irv December 30, 2022 09:49

Dr-Irv requested changes Dec 30, 2022

View reviewed changes

MarcoGorelli added 10 commits January 3, 2023 16:07

[skip ci] pdep-5 initial draft

c73892d

[skip ci] first revision

573bf75

[skip ci] note about multiindex

2284d2a

[skip ci] clarify some points as per reviews

a2dfde5

[skip ci] fix typo

de7dbce

[skip ci] withdraw

69680e7

[skip ci] typos

8155793

[skip ci] clarify benefit to users part

997d601

[skip ci] reword, reformat

75056a2

[skip ci] further reword

443da87

MarcoGorelli force-pushed the pdep5 branch from 4698889 to 75056a2 Compare January 3, 2023 16:35

MarcoGorelli changed the title ~~PDEP0005 LabellessIndex~~ PDEP0005 NoRowIndex Jan 3, 2023

MarcoGorelli and others added 3 commits January 6, 2023 14:35

Merge branch 'main' into pdep5

ca01f61

status withdrawn

dfbd494

Merge branch 'pdep5' of github.com:MarcoGorelli/pandas into pdep5

7463b1e

datapythonista changed the title ~~PDEP0005 NoRowIndex~~ PDEP-5: NoRowIndex Jan 18, 2023

jorisvandenbossche removed this from the 2.0 milestone Feb 8, 2023

MarcoGorelli added 3 commits February 28, 2023 11:48

Merge remote-tracking branch 'upstream/main' into pdep5

e2d6c58

clarify that mode.no_row_index would have been separate

1434726

summarise revisions / withdrawal reasons

ef0a5b2

Dr-Irv approved these changes Mar 2, 2023

View reviewed changes

Dr-Irv merged commit a344dc9 into pandas-dev:main Mar 2, 2023


		### Columns cannot be NoRowIndex

		This proposal deals exclusively with letting users not have to think about

PDEP-5: NoRowIndex #49694

PDEP-5: NoRowIndex #49694

Conversation

MarcoGorelli commented Nov 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Nov 14, 2022 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Nov 15, 2022

jorisvandenbossche commented Nov 15, 2022 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Nov 16, 2022

WillAyd commented Nov 17, 2022

MarcoGorelli commented Nov 17, 2022

MarcoGorelli commented Nov 18, 2022 • edited Loading

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blazespinnaker commented Nov 21, 2022 • edited Loading

MarcoGorelli commented Nov 21, 2022

jbrockmendel commented Nov 21, 2022

MarcoGorelli commented Dec 4, 2022

mroeschke left a comment

Choose a reason for hiding this comment

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 3, 2023

MarcoGorelli commented Feb 28, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

rhshadrach commented Nov 14, 2022 •

edited

Loading

jorisvandenbossche commented Nov 15, 2022 •

edited

Loading

MarcoGorelli commented Nov 18, 2022 •

edited

Loading

blazespinnaker commented Nov 21, 2022 •

edited

Loading