BUG: RangeIndex is being materialized to a column while writing to parquet file when `index=True` #37896

galipremsagar · 2020-11-16T18:07:34Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[20]: import pandas as pd
In[21]: import pyarrow as pa
In[22]: pdf = pd.DataFrame({'a':[1, 2, 3], 'b':[10, 11, 12]}, index=pd.RangeIndex(2, 5, 1))
In[23]: pdf.to_parquet('temp', index=True)
In[24]: pdf
Out[24]: 
   a   b
2  1  10
3  2  11
4  3  12
In[25]: pd.read_parquet('temp')
Out[25]: 
   a   b
2  1  10
3  2  11
4  3  12
In[26]: x = pa.parquet.read_metadata('temp')
In[27]: x.metadata
Out[27]: 
{b'ARROW:schema': b'/////3ADAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAIQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAE4CAAB7ImluZGV4X2NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6ICJhIiwgImZpZWxkX25hbWUiOiAiYSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJiIiwgImZpZWxkX25hbWUiOiAiYiIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogIl9faW5kZXhfbGV2ZWxfMF9fIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICIxLjAuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4xLjQifQAAAwAAAIQAAABEAAAABAAAAJj///8AAAECHAAAAAwAAAAEAAAAAAAAAIj///8AAAABQAAAABEAAABfX2luZGV4X2xldmVsXzBfXwAAANT///8AAAECHAAAAAwAAAAEAAAAAAAAAMT///8AAAABQAAAAAEAAABiAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAAAAAAAAgADAAIAAcACAAAAAAAAAFAAAAAAQAAAGEAAAAAAAAA',
 b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.4"}'}
In[28]: pd.read_parquet('temp').index
Out[28]: Int64Index([2, 3, 4], dtype='int64')
In[29]: pdf.index
Out[29]: RangeIndex(start=2, stop=5, step=1)

Problem description

The RangeIndex object should not be materialized and stored in the parquet file. Instead pandas should be storing the rangeIndex in metadata under index_columns like below when we pass index=None.

Expected Output

In[32]: pdf.to_parquet('temp', index=None)
In[33]: pd.read_parquet('temp')
Out[33]: 
   a   b
2  1  10
3  2  11
4  3  12
In[34]: pd.read_parquet('temp').index
Out[34]: RangeIndex(start=2, stop=5, step=1)
In[35]: x = pa.parquet.read_metadata('temp')
In[36]: x.metadata
Out[36]: 
{b'ARROW:schema': b'/////+gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEACAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAAgCAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAyLCAic3RvcCI6IDUsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiYiIsICJmaWVsZF9uYW1lIjogImIiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIxLjEuNCJ9AAAAAAIAAABEAAAABAAAANT///8AAAECHAAAAAwAAAAEAAAAAAAAAMT///8AAAABQAAAAAEAAABiAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAAAAAAAAgADAAIAAcACAAAAAAAAAFAAAAAAQAAAGEAAAA=',
 b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 2, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.4"}'}

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 67a3d42
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-53-generic
Version : #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.2
hypothesis : 5.41.1
sphinx : 3.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

jreback · 2020-11-16T18:52:03Z

@jorisvandenbossche but this likely a pyarrow issue

jorisvandenbossche · 2020-11-18T13:55:53Z

See the documentation of the index keyword of to_parquet:

index : bool, default None
    If ``True``, include the dataframe's index(es) in the file output.
    If ``False``, they will not be written to the file.
    If ``None``, similar to ``True`` the dataframe's index(es)
    will be saved. However, instead of being saved as values,
    the RangeIndex will be stored as a range in the metadata so it
    doesn't require much space and is faster. Other indexes will
    be included as columns in the file output.

So with specifying index=True, the index is always materialized as actual column in the parquet file.

The RangeIndex object should not be materialized and stored in the parquet file. Instead pandas should be storing the rangeIndex in metadata under index_columns like below when we pass index=None.

As explained above, that's the exact purpose of having index=True vs index=None that they do something differently in this case.

galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2020

galipremsagar mentioned this issue Nov 16, 2020

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

Closed

3 tasks

jorisvandenbossche closed this as completed Nov 18, 2020

jorisvandenbossche added IO Parquet parquet, feather Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2020

jorisvandenbossche added this to the No action milestone Nov 18, 2020

galipremsagar mentioned this issue Dec 1, 2020

[FEA] RangeIndex must be materialized in parquet writer when index=True in parquet writer rapidsai/cudf#6873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: RangeIndex is being materialized to a column while writing to parquet file when `index=True` #37896

BUG: RangeIndex is being materialized to a column while writing to parquet file when `index=True` #37896

galipremsagar commented Nov 16, 2020

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

jreback commented Nov 16, 2020

jorisvandenbossche commented Nov 18, 2020

BUG: RangeIndex is being materialized to a column while writing to parquet file when index=True #37896

BUG: RangeIndex is being materialized to a column while writing to parquet file when index=True #37896

Comments

galipremsagar commented Nov 16, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag] INSTALLED VERSIONS

jreback commented Nov 16, 2020

jorisvandenbossche commented Nov 18, 2020

BUG: RangeIndex is being materialized to a column while writing to parquet file when `index=True` #37896

BUG: RangeIndex is being materialized to a column while writing to parquet file when `index=True` #37896

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS