Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected OOM writing a DataFrame w/ strings to ORC files #7588

Closed
randerzander opened this issue Mar 12, 2021 · 0 comments · Fixed by #7605
Closed

[BUG] Unexpected OOM writing a DataFrame w/ strings to ORC files #7588

randerzander opened this issue Mar 12, 2021 · 0 comments · Fixed by #7605
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@randerzander
Copy link
Contributor

randerzander commented Mar 12, 2021

On a 32GB V100, the below snippet completes successfully with the RAPIDS 0.18 release, but fails w/ an OOM in the latest 0.19 nightly:

import cudf

df = cudf.datasets.randomdata(nrows=20_000_000)
df['teststr'] = 'teststr'
df.to_orc('test.orc', compression='snappy')

Trace:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-1-33735a6b5c55> in <module>
      3 df = cudf.datasets.randomdata(nrows=20_000_000)
      4 df['teststr'] = 'teststr'
----> 5 df.to_orc('test.orc', compression='snappy')

~/conda/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/dataframe.py in to_orc(self, fname, compression, *args, **kwargs)
   7396         from cudf.io import orc as orc
   7397 
-> 7398         orc.to_orc(self, fname, compression, *args, **kwargs)
   7399 
   7400     def stack(self, level=-1, dropna=True):

~/conda/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/io/orc.py in to_orc(df, fname, compression, enable_statistics, **kwargs)
    329             liborc.write_orc(df, file_obj, compression, enable_statistics)
    330     else:
--> 331         liborc.write_orc(df, path_or_buf, compression, enable_statistics)
    332 
    333 

cudf/_lib/orc.pyx in cudf._lib.orc.write_orc()

cudf/_lib/orc.pyx in cudf._lib.orc.write_orc()

MemoryError: std::bad_alloc: CUDA error at: /home/rgelhausen/conda/envs/rapids-gpu-bdb/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

Interestingly, w/ nrows=100_000_000 doesn't OOM.

@randerzander randerzander added bug Something isn't working cuIO cuIO issue labels Mar 12, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 17, 2021
Closes #7588

The stream size used to be calculated incorrectly, leading to huge allocation for the encoded data buffer.

This PR fixes the stream size computation to count each row group only once.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
  - Kumar Aatish (@kaatish)
  - Devavret Makkar (@devavret)

URL: #7605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants