[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

cdeotte · 2021-06-16T17:57:38Z

Describe the bug
RAPIDS cudf 0.19.1 cannot read multiple parquet files at once

Steps/Code to reproduce bug

import os, cudf
os.makedirs('tmp', exist_ok=True)
cudf.DataFrame({'a':[1,2,3]}).to_parquet('tmp/df1.parquet')
cudf.DataFrame({'a':[4,5,6]}).to_parquet('tmp/df2.parquet')
df = cudf.read_parquet('tmp')

Expected behavior
We expect df to contain the concatenation of two dataframes. Instead we get error

ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements

Environment overview (please complete the following information)
RAPIDS cudf 0.19.1

Additional context
In Slack it was discovered that adding index=False or index=True when writing the parquet to disk will allow the subsequent read from disk to work correctly

The text was updated successfully, but these errors were encountered:

quasiben · 2021-06-16T20:22:12Z

The cudf reader may be making some assumptions when reading multiple files which are not always valid. Can you also test with dask_cudf ?

cdeotte · 2021-06-16T21:27:14Z

The cudf reader may be making some assumptions when reading multiple files which are not always valid. Can you also test with dask_cudf ?

dask_cudf works if index is not specified or if index=False when saving the parquets. If you specify index=True, then dask_cudf fails when reading. It will read all but the last column

os.makedirs('tmp', exist_ok=True)
cudf.DataFrame({'a':[1,2,3],'b':[5,5,5]}).to_parquet('tmp/df1.parquet',index=True)
cudf.DataFrame({'a':[4,5,6],'b':[6,6,6]}).to_parquet('tmp/df2.parquet',index=True)
df = dask_cudf.read_parquet('tmp')
df.compute()

will display the index, column "a", but not column "b". Interestingly, cudf gets this case correct

galipremsagar · 2021-06-17T21:51:23Z

Just triaged this issue, it appears we are not incorrectly setting index while reading that is causing this issue. Assigning it to myself.

shwina · 2021-06-18T20:18:01Z

@galipremsagar Spent some time looking at this earlier too. I tracked the issue down to these lines. But I wasn't sure what behaviour we should implement. Here are some things I tried in case it's useful to you:

Reading two Parquet files, where the first Parquet file includes a RangeIndex in its metadata section. In this case, Pandas seems to ignore all the indexes:

In [25]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [26]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet')
In [27]: pd.read_parquet('tmp')
Out[27]:
   a
0  1
1  2
2  3
3  1
4  2
5  3

The same as the example above, but with the ordering of the files switched. Now, Pandas doesn't ignore the indexes, but it doesn't read the RangeIndex metadata either, using NaNs instead:

# this time, switch the ordering of `df1` and `df2`:
In [28]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [29]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df1.parquet')
In [30]: pd.read_parquet('tmp')
Out[30]:
     a
1.0  1
3.0  2
5.0  3
NaN  1
NaN  2
NaN  3

Reading just a single file that includes a RangeIndex in its metadata. This time, Pandas does read the RangeIndex metadata correctly:

# read just one DataFrame:

In [32]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [33]: pd.read_parquet('tmp/df2.parquet')
Out[33]:
   a
0  1
2  2
4  3

I would expect that in all three cases, the RangeIndex would be read and used. Thus, for the first case, I would expect:

And for the second:

The only situation in which I would expect the indexes to be ignored is if there is no index metadata written to the Parquet file:

# this makes sense:
In [42]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [43]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [44]: pd.read_parquet('tmp')
Out[44]:
   a
0  1
1  2
2  3
3  1
4  2
5  3

Not sure what to do in this situation. Again, Pandas ignores all the indexes:

In [52]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [53]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [54]: pd.read_parquet('tmp')
Out[54]:
   a
0  1
1  2
2  3
3  1
4  2
5  3

beckernick · 2021-07-20T14:20:37Z

cc @randerzander do you have any thoughts here?

beckernick · 2021-08-25T21:31:53Z

Perhaps the potentially inconsistent behavior observed in @shwina and @galipremsagar 's test might be a bug in pandas?

github-actions · 2021-11-23T22:03:18Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

galipremsagar · 2022-07-13T15:15:22Z

This issue is resolved by: #11105

In [1]: import cudf

In [2]: import os

In [3]: os.makedirs('tmp', exist_ok=True)
   ...: cudf.DataFrame({'a':[1,2,3]}).to_parquet('tmp/df1.parquet')
   ...: cudf.DataFrame({'a':[4,5,6]}).to_parquet('tmp/df2.parquet')

In [4]: df = cudf.read_parquet('tmp')

In [5]: df
Out[5]: 
   a
0  1
1  2
2  3
3  4
4  5
5  6

In [6]: cudf.__version__
Out[6]: '22.08.00a+215.gf94146b59f'

cdeotte added Needs Triage Need team to review and classify bug Something isn't working labels Jun 16, 2021

quasiben added the Python Affects Python cuDF API. label Jun 16, 2021

galipremsagar removed the Needs Triage Need team to review and classify label Jun 17, 2021

galipremsagar self-assigned this Jun 17, 2021

galipremsagar linked a pull request Jul 14, 2021 that will close this issue

[WIP] Proper index handling for chunked reading multiple parquet files #8726

Closed

galipremsagar mentioned this issue Jul 19, 2021

[WIP] Proper index handling for chunked reading multiple parquet files #8726

Closed

github-actions bot added the inactive-90d label Nov 23, 2021

galipremsagar closed this as completed Jul 13, 2022

galipremsagar removed the inactive-90d label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

cdeotte commented Jun 16, 2021

quasiben commented Jun 16, 2021

cdeotte commented Jun 16, 2021 •

edited

Loading

galipremsagar commented Jun 17, 2021

shwina commented Jun 18, 2021

beckernick commented Jul 20, 2021

beckernick commented Aug 25, 2021

github-actions bot commented Nov 23, 2021

galipremsagar commented Jul 13, 2022

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

Comments

cdeotte commented Jun 16, 2021

quasiben commented Jun 16, 2021

cdeotte commented Jun 16, 2021 • edited Loading

galipremsagar commented Jun 17, 2021

shwina commented Jun 18, 2021

beckernick commented Jul 20, 2021

beckernick commented Aug 25, 2021

github-actions bot commented Nov 23, 2021

galipremsagar commented Jul 13, 2022

cdeotte commented Jun 16, 2021 •

edited

Loading