cudf should spil to main memory when running out of gpu memory #129

jangorecki · 2020-01-09T04:33:21Z

according to comment in rapidsai/cudf#2288 (comment) one could spil to main memory without actually using dask-cudf.
related #126

jangorecki · 2020-01-09T10:07:09Z

Solving this issue allows cudf to compute medium size data (5GB). AFAIK 50 GB was failing due to OOM (main memory), thus I filled new FR in cudf to handle such cases: rapidsai/cudf#3740

jangorecki · 2020-01-10T05:20:47Z

@datametrician I would appriciate if you have any clues what might be a problem.

Since I switched to using managed memory I started to get the following error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

It causes cuda drivers to hang (I assume), trying to use cudf in another session hangs that session as well. I cannot even kill the process (from nvidia-smi list) using kill -9. I also tried nvidia-smi -r but it gives

GPU Reset couldn't run because GPU 00000000:02:00.0 is the primary GPU.

The only way seems to be hard reboot, which is not an option at the moment.

jangorecki · 2020-01-10T10:12:36Z

more complete output @datametrician

Traceback (most recent call last):
  File "./cudf/join-cudf.py", line 34, in <module>
    x = cu.read_csv(src_jn_x, header=0, dtype=['int32','int32','int32','str','st
r','str','float64'])
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 82, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 41, in cudf._lib.csv.read_csv
  File "cudf/_lib/csv.pyx", line 205, in cudf._lib.csv.read_csv
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

taureandyernv · 2020-01-10T20:49:45Z

@jangorecki, can you join the RAPIDS Go-AI Slack channel. We do have a feature for this, dask_cudf, and I can show you how to use dask_cudf to get around this. Is there a reason why using our dask_cudf is insufficient for your benchmarks? Looking forward to chat!

kkraus14 · 2020-01-10T21:03:32Z

@jangorecki this issue looks like managed memory eating up the system memory to the point that that driver context is corrupted where unfortunately the only option is to restart the machine. UVM only supports spilling to host memory because the migration from host --> GPU occurs via a page fault mechanism that won't work with disks.

As Taurean pointed out, dask-cudf has a different mechanism for managing memory that involves chunking the workload and monitoring the memory usage and spilling from GPU --> host --> disk as needed. If your workload is larger than system memory I would highly recommend using dask-cudf.

jangorecki · 2020-01-11T03:29:44Z

@taureandyernv Thanks for you comment. It is not that dask-cudf is insufficient. I want to use dask-cudf in benchmarks. Problem is that I found documentation lacking my use case (see "dask_cudf.read_csv docstring").
I know I could try to figure out myself by asking on GH (what I actually did), or reading existing GH comments, but

it takes much more time than just reading documentation,
it doesn't guarantee to be successful, as some parts might have not been implemented yet,
its API is not guaranteed to be stable; everything what is not in documentation should be considered as subject to change without notice, and adapting code to changes that don't have to be listed in changelog as breaking changes is even more time consuming.

@kkraus14 Thanks for your comment. It helps a lot. It is quite bad that it is so easy to corrupt driver context. IMO it is good reason to warn users before using managed memory in cudf only, but of course not in dask-cudf as you explained. Hopefully I will move to dask-cudf soon.

jangorecki · 2020-02-25T03:35:18Z

spilling to main memory cannot be realiably made without using dask-cudf. Currently implemented spilling was rolled back so we can stil run cudf benchmarks. Re-opening this issue to wait for dask-cudf support

taureandyernv · 2020-03-26T00:51:17Z

Hey @jangorecki , we use dask_cudf and RMM. dask_cudf nor cudf by themselves are not designed to spill to main memory. Happy to show you an example

jangorecki · 2020-03-26T13:23:12Z

@taureandyernv Thanks for trying to help. Altought spilling cudf to main mem works, it is not reliable because it can corrupt driver context and then whole machine has to be rebooted. So agree it only make sense to use it with dask_cudf, which AFAIU is not affected by that issue.
Your example is good, but what could be even better, if you could contribute it to cudf repository as documentation. I am now waiting for rapidsai/cudf#2277 and rapidsai/cudf#2288 (you are even mentioned there). If your example does not cover those cases, it won't help much to push this issue forward.

jangorecki added the cudf label Jan 9, 2020

jangorecki mentioned this issue Jan 9, 2020

use dask-cudf to utilize multiple GPUs #116

Closed

jangorecki changed the title ~~cudf should spli to main memory when running out of gpu memory~~ cudf should spil to main memory when running out of gpu memory Jan 9, 2020

jangorecki added a commit that referenced this issue Jan 9, 2020

cudf should now use main memory when OOM gpu mem, #129

26864ed

jangorecki closed this as completed Jan 9, 2020

jangorecki mentioned this issue Feb 8, 2020

cudf groupby q6 median #121

Closed

jangorecki mentioned this issue Feb 24, 2020

extend GPU memory to run cuDF for medium and big data #97

Closed

jangorecki reopened this Feb 25, 2020

jangorecki added a commit that referenced this issue Feb 25, 2020

cudf no spilling to main mem again, #129

e17c106

jangorecki added the no documentation label May 13, 2020

jangorecki mentioned this issue May 14, 2020

pending issues on cuDF #148

Closed

8 tasks

jangorecki mentioned this issue May 27, 2021

cudf use dask #219

Merged

jangorecki closed this as completed in #219 May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudf should spil to main memory when running out of gpu memory #129

cudf should spil to main memory when running out of gpu memory #129

jangorecki commented Jan 9, 2020 •

edited

Loading

jangorecki commented Jan 9, 2020

jangorecki commented Jan 10, 2020 •

edited

Loading

jangorecki commented Jan 10, 2020

taureandyernv commented Jan 10, 2020 •

edited

Loading

kkraus14 commented Jan 10, 2020

jangorecki commented Jan 11, 2020 •

edited

Loading

jangorecki commented Feb 25, 2020 •

edited

Loading

taureandyernv commented Mar 26, 2020 •

edited

Loading

jangorecki commented Mar 26, 2020 •

edited

Loading

cudf should spil to main memory when running out of gpu memory #129

cudf should spil to main memory when running out of gpu memory #129

Comments

jangorecki commented Jan 9, 2020 • edited Loading

jangorecki commented Jan 9, 2020

jangorecki commented Jan 10, 2020 • edited Loading

jangorecki commented Jan 10, 2020

taureandyernv commented Jan 10, 2020 • edited Loading

kkraus14 commented Jan 10, 2020

jangorecki commented Jan 11, 2020 • edited Loading

jangorecki commented Feb 25, 2020 • edited Loading

taureandyernv commented Mar 26, 2020 • edited Loading

jangorecki commented Mar 26, 2020 • edited Loading

jangorecki commented Jan 9, 2020 •

edited

Loading

jangorecki commented Jan 10, 2020 •

edited

Loading

taureandyernv commented Jan 10, 2020 •

edited

Loading

jangorecki commented Jan 11, 2020 •

edited

Loading

jangorecki commented Feb 25, 2020 •

edited

Loading

taureandyernv commented Mar 26, 2020 •

edited

Loading

jangorecki commented Mar 26, 2020 •

edited

Loading