Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

Closed
cdeotte opened this issue Jun 16, 2021 · 8 comments
Closed

[BUG] cudf 0.19.1 cannot read multiple parquet at once #8536

cdeotte opened this issue Jun 16, 2021 · 8 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@cdeotte
Copy link

cdeotte commented Jun 16, 2021

Describe the bug
RAPIDS cudf 0.19.1 cannot read multiple parquet files at once

Steps/Code to reproduce bug

import os, cudf
os.makedirs('tmp', exist_ok=True)
cudf.DataFrame({'a':[1,2,3]}).to_parquet('tmp/df1.parquet')
cudf.DataFrame({'a':[4,5,6]}).to_parquet('tmp/df2.parquet')
df = cudf.read_parquet('tmp')

Expected behavior
We expect df to contain the concatenation of two dataframes. Instead we get error

ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements

Environment overview (please complete the following information)
RAPIDS cudf 0.19.1

Additional context
In Slack it was discovered that adding index=False or index=True when writing the parquet to disk will allow the subsequent read from disk to work correctly

@cdeotte cdeotte added Needs Triage Need team to review and classify bug Something isn't working labels Jun 16, 2021
@quasiben
Copy link
Member

The cudf reader may be making some assumptions when reading multiple files which are not always valid. Can you also test with dask_cudf ?

@quasiben quasiben added the Python Affects Python cuDF API. label Jun 16, 2021
@cdeotte
Copy link
Author

cdeotte commented Jun 16, 2021

The cudf reader may be making some assumptions when reading multiple files which are not always valid. Can you also test with dask_cudf ?

dask_cudf works if index is not specified or if index=False when saving the parquets. If you specify index=True, then dask_cudf fails when reading. It will read all but the last column

os.makedirs('tmp', exist_ok=True)
cudf.DataFrame({'a':[1,2,3],'b':[5,5,5]}).to_parquet('tmp/df1.parquet',index=True)
cudf.DataFrame({'a':[4,5,6],'b':[6,6,6]}).to_parquet('tmp/df2.parquet',index=True)
df = dask_cudf.read_parquet('tmp')
df.compute()

will display the index, column "a", but not column "b". Interestingly, cudf gets this case correct

@galipremsagar galipremsagar removed the Needs Triage Need team to review and classify label Jun 17, 2021
@galipremsagar galipremsagar self-assigned this Jun 17, 2021
@galipremsagar
Copy link
Contributor

Just triaged this issue, it appears we are not incorrectly setting index while reading that is causing this issue. Assigning it to myself.

@shwina
Copy link
Contributor

shwina commented Jun 18, 2021

@galipremsagar Spent some time looking at this earlier too. I tracked the issue down to these lines. But I wasn't sure what behaviour we should implement. Here are some things I tried in case it's useful to you:

  1. Reading two Parquet files, where the first Parquet file includes a RangeIndex in its metadata section. In this case, Pandas seems to ignore all the indexes:
In [25]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [26]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet')
In [27]: pd.read_parquet('tmp')
Out[27]:
   a
0  1
1  2
2  3
3  1
4  2
5  3
  1. The same as the example above, but with the ordering of the files switched. Now, Pandas doesn't ignore the indexes, but it doesn't read the RangeIndex metadata either, using NaNs instead:
# this time, switch the ordering of `df1` and `df2`:
In [28]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [29]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df1.parquet')
In [30]: pd.read_parquet('tmp')
Out[30]:
     a
1.0  1
3.0  2
5.0  3
NaN  1
NaN  2
NaN  3
  1. Reading just a single file that includes a RangeIndex in its metadata. This time, Pandas does read the RangeIndex metadata correctly:
# read just one DataFrame:

In [32]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df2.parquet')
In [33]: pd.read_parquet('tmp/df2.parquet')
Out[33]:
   a
0  1
2  2
4  3

I would expect that in all three cases, the RangeIndex would be read and used. Thus, for the first case, I would expect:

   a
0  1
2  2
4  3
1  1
3  2
5  3

And for the second:

   a
1  1
3  2
5  3
0  1
2  2
4  3

The only situation in which I would expect the indexes to be ignored is if there is no index metadata written to the Parquet file:

# this makes sense:
In [42]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [43]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [44]: pd.read_parquet('tmp')
Out[44]:
   a
0  1
1  2
2  3
3  1
4  2
5  3

Not sure what to do in this situation. Again, Pandas ignores all the indexes:

In [52]: pd.DataFrame({'a': [1, 2, 3]}, index=range(0, 5, 2)).to_parquet('tmp/df1.parquet')
In [53]: pd.DataFrame({'a': [1, 2, 3]}, index=[1, 3, 5]).to_parquet('tmp/df2.parquet', index=False)
In [54]: pd.read_parquet('tmp')
Out[54]:
   a
0  1
1  2
2  3
3  1
4  2
5  3

@beckernick
Copy link
Member

cc @randerzander do you have any thoughts here?

@beckernick
Copy link
Member

Perhaps the potentially inconsistent behavior observed in @shwina and @galipremsagar 's test might be a bug in pandas?

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@galipremsagar
Copy link
Contributor

This issue is resolved by: #11105

In [1]: import cudf

In [2]: import os

In [3]: os.makedirs('tmp', exist_ok=True)
   ...: cudf.DataFrame({'a':[1,2,3]}).to_parquet('tmp/df1.parquet')
   ...: cudf.DataFrame({'a':[4,5,6]}).to_parquet('tmp/df2.parquet')

In [4]: df = cudf.read_parquet('tmp')

In [5]: df
Out[5]: 
   a
0  1
1  2
2  3
3  4
4  5
5  6

In [6]: cudf.__version__
Out[6]: '22.08.00a+215.gf94146b59f'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants