Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: RangeIndex is being materialized to a column while writing to parquet file when index=True #37896

Closed
2 of 3 tasks
galipremsagar opened this issue Nov 16, 2020 · 2 comments
Labels

Comments

@galipremsagar
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[20]: import pandas as pd
In[21]: import pyarrow as pa
In[22]: pdf = pd.DataFrame({'a':[1, 2, 3], 'b':[10, 11, 12]}, index=pd.RangeIndex(2, 5, 1))
In[23]: pdf.to_parquet('temp', index=True)
In[24]: pdf
Out[24]: 
   a   b
2  1  10
3  2  11
4  3  12
In[25]: pd.read_parquet('temp')
Out[25]: 
   a   b
2  1  10
3  2  11
4  3  12
In[26]: x = pa.parquet.read_metadata('temp')
In[27]: x.metadata
Out[27]: 
{b'ARROW:schema': b'/////3ADAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAIQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAE4CAAB7ImluZGV4X2NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6ICJhIiwgImZpZWxkX25hbWUiOiAiYSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJiIiwgImZpZWxkX25hbWUiOiAiYiIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogIl9faW5kZXhfbGV2ZWxfMF9fIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICIxLjAuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4xLjQifQAAAwAAAIQAAABEAAAABAAAAJj///8AAAECHAAAAAwAAAAEAAAAAAAAAIj///8AAAABQAAAABEAAABfX2luZGV4X2xldmVsXzBfXwAAANT///8AAAECHAAAAAwAAAAEAAAAAAAAAMT///8AAAABQAAAAAEAAABiAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAAAAAAAAgADAAIAAcACAAAAAAAAAFAAAAAAQAAAGEAAAAAAAAA',
 b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.4"}'}
In[28]: pd.read_parquet('temp').index
Out[28]: Int64Index([2, 3, 4], dtype='int64')
In[29]: pdf.index
Out[29]: RangeIndex(start=2, stop=5, step=1)

Problem description

The RangeIndex object should not be materialized and stored in the parquet file. Instead pandas should be storing the rangeIndex in metadata under index_columns like below when we pass index=None.

Expected Output

In[32]: pdf.to_parquet('temp', index=None)
In[33]: pd.read_parquet('temp')
Out[33]: 
   a   b
2  1  10
3  2  11
4  3  12
In[34]: pd.read_parquet('temp').index
Out[34]: RangeIndex(start=2, stop=5, step=1)
In[35]: x = pa.parquet.read_metadata('temp')
In[36]: x.metadata
Out[36]: 
{b'ARROW:schema': b'/////+gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEACAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAAgCAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAyLCAic3RvcCI6IDUsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiYiIsICJmaWVsZF9uYW1lIjogImIiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIxLjEuNCJ9AAAAAAIAAABEAAAABAAAANT///8AAAECHAAAAAwAAAAEAAAAAAAAAMT///8AAAABQAAAAAEAAABiAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAAAAAAAAgADAAIAAcACAAAAAAAAAFAAAAAAQAAAGEAAAA=',
 b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 2, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.4"}'}

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 67a3d42
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-53-generic
Version : #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.2
hypothesis : 5.41.1
sphinx : 3.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

@jreback
Copy link
Contributor

jreback commented Nov 16, 2020

@jorisvandenbossche but this likely a pyarrow issue

@jorisvandenbossche
Copy link
Member

See the documentation of the index keyword of to_parquet:

index : bool, default None
    If ``True``, include the dataframe's index(es) in the file output.
    If ``False``, they will not be written to the file.
    If ``None``, similar to ``True`` the dataframe's index(es)
    will be saved. However, instead of being saved as values,
    the RangeIndex will be stored as a range in the metadata so it
    doesn't require much space and is faster. Other indexes will
    be included as columns in the file output.

So with specifying index=True, the index is always materialized as actual column in the parquet file.

The RangeIndex object should not be materialized and stored in the parquet file. Instead pandas should be storing the rangeIndex in metadata under index_columns like below when we pass index=None.

As explained above, that's the exact purpose of having index=True vs index=None that they do something differently in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants