-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536
Comments
The cudf reader may be making some assumptions when reading multiple files which are not always valid. Can you also test with |
dask_cudf works if
will display the index, column "a", but not column "b". Interestingly, cudf gets this case correct |
Just triaged this issue, it appears we are not incorrectly setting index while reading that is causing this issue. Assigning it to myself. |
@galipremsagar Spent some time looking at this earlier too. I tracked the issue down to these lines. But I wasn't sure what behaviour we should implement. Here are some things I tried in case it's useful to you:
In [25]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [26]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet')
In [27]: pd.read_parquet('tmp')
Out[27]:
a
0 1
1 2
2 3
3 1
4 2
5 3
# this time, switch the ordering of `df1` and `df2`:
In [28]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [29]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df1.parquet')
In [30]: pd.read_parquet('tmp')
Out[30]:
a
1.0 1
3.0 2
5.0 3
NaN 1
NaN 2
NaN 3
# read just one DataFrame:
In [32]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [33]: pd.read_parquet('tmp/df2.parquet')
Out[33]:
a
0 1
2 2
4 3 I would expect that in all three cases, the a
0 1
2 2
4 3
1 1
3 2
5 3 And for the second: a
1 1
3 2
5 3
0 1
2 2
4 3 The only situation in which I would expect the indexes to be ignored is if there is no index metadata written to the Parquet file: # this makes sense:
In [42]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [43]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [44]: pd.read_parquet('tmp')
Out[44]:
a
0 1
1 2
2 3
3 1
4 2
5 3 Not sure what to do in this situation. Again, Pandas ignores all the indexes: In [52]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [53]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [54]: pd.read_parquet('tmp')
Out[54]:
a
0 1
1 2
2 3
3 1
4 2
5 3 |
cc @randerzander do you have any thoughts here? |
Perhaps the potentially inconsistent behavior observed in @shwina and @galipremsagar 's test might be a bug in pandas? |
This issue has been labeled |
This issue is resolved by: #11105 In [1]: import cudf
In [2]: import os
In [3]: os.makedirs('tmp', exist_ok=True)
...: cudf.DataFrame({'a':[1,2,3]}).to_parquet('tmp/df1.parquet')
...: cudf.DataFrame({'a':[4,5,6]}).to_parquet('tmp/df2.parquet')
In [4]: df = cudf.read_parquet('tmp')
In [5]: df
Out[5]:
a
0 1
1 2
2 3
3 4
4 5
5 6
In [6]: cudf.__version__
Out[6]: '22.08.00a+215.gf94146b59f' |
Describe the bug
RAPIDS cudf 0.19.1 cannot read multiple parquet files at once
Steps/Code to reproduce bug
Expected behavior
We expect
df
to contain the concatenation of two dataframes. Instead we get errorEnvironment overview (please complete the following information)
RAPIDS cudf 0.19.1
Additional context
In Slack it was discovered that adding
index=False
orindex=True
when writing the parquet to disk will allow the subsequent read from disk to work correctlyThe text was updated successfully, but these errors were encountered: