Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf should spil to main memory when running out of gpu memory #129

Closed
jangorecki opened this issue Jan 9, 2020 · 9 comments · Fixed by #219
Closed

cudf should spil to main memory when running out of gpu memory #129

jangorecki opened this issue Jan 9, 2020 · 9 comments · Fixed by #219

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Jan 9, 2020

according to comment in rapidsai/cudf#2288 (comment) one could spil to main memory without actually using dask-cudf.
related #126

@jangorecki jangorecki added the cudf label Jan 9, 2020
@jangorecki jangorecki changed the title cudf should spli to main memory when running out of gpu memory cudf should spil to main memory when running out of gpu memory Jan 9, 2020
@jangorecki
Copy link
Contributor Author

Solving this issue allows cudf to compute medium size data (5GB). AFAIK 50 GB was failing due to OOM (main memory), thus I filled new FR in cudf to handle such cases: rapidsai/cudf#3740

@jangorecki
Copy link
Contributor Author

jangorecki commented Jan 10, 2020

@datametrician I would appriciate if you have any clues what might be a problem.

Since I switched to using managed memory I started to get the following error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

It causes cuda drivers to hang (I assume), trying to use cudf in another session hangs that session as well. I cannot even kill the process (from nvidia-smi list) using kill -9. I also tried nvidia-smi -r but it gives

GPU Reset couldn't run because GPU 00000000:02:00.0 is the primary GPU.

The only way seems to be hard reboot, which is not an option at the moment.

@jangorecki
Copy link
Contributor Author

more complete output @datametrician

Traceback (most recent call last):
  File "./cudf/join-cudf.py", line 34, in <module>
    x = cu.read_csv(src_jn_x, header=0, dtype=['int32','int32','int32','str','st
r','str','float64'])
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 82, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 41, in cudf._lib.csv.read_csv
  File "cudf/_lib/csv.pyx", line 205, in cudf._lib.csv.read_csv
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

@taureandyernv
Copy link

taureandyernv commented Jan 10, 2020

@jangorecki, can you join the RAPIDS Go-AI Slack channel. We do have a feature for this, dask_cudf, and I can show you how to use dask_cudf to get around this. Is there a reason why using our dask_cudf is insufficient for your benchmarks? Looking forward to chat!

@kkraus14
Copy link

@jangorecki this issue looks like managed memory eating up the system memory to the point that that driver context is corrupted where unfortunately the only option is to restart the machine. UVM only supports spilling to host memory because the migration from host --> GPU occurs via a page fault mechanism that won't work with disks.

As Taurean pointed out, dask-cudf has a different mechanism for managing memory that involves chunking the workload and monitoring the memory usage and spilling from GPU --> host --> disk as needed. If your workload is larger than system memory I would highly recommend using dask-cudf.

@jangorecki
Copy link
Contributor Author

jangorecki commented Jan 11, 2020

@taureandyernv Thanks for you comment. It is not that dask-cudf is insufficient. I want to use dask-cudf in benchmarks. Problem is that I found documentation lacking my use case (see "dask_cudf.read_csv docstring").
I know I could try to figure out myself by asking on GH (what I actually did), or reading existing GH comments, but

  1. it takes much more time than just reading documentation,
  2. it doesn't guarantee to be successful, as some parts might have not been implemented yet,
  3. its API is not guaranteed to be stable; everything what is not in documentation should be considered as subject to change without notice, and adapting code to changes that don't have to be listed in changelog as breaking changes is even more time consuming.

@kkraus14 Thanks for your comment. It helps a lot. It is quite bad that it is so easy to corrupt driver context. IMO it is good reason to warn users before using managed memory in cudf only, but of course not in dask-cudf as you explained. Hopefully I will move to dask-cudf soon.

@jangorecki
Copy link
Contributor Author

jangorecki commented Feb 25, 2020

spilling to main memory cannot be realiably made without using dask-cudf. Currently implemented spilling was rolled back so we can stil run cudf benchmarks. Re-opening this issue to wait for dask-cudf support

@taureandyernv
Copy link

taureandyernv commented Mar 26, 2020

Hey @jangorecki , we use dask_cudf and RMM. dask_cudf nor cudf by themselves are not designed to spill to main memory. Happy to show you an example

@jangorecki
Copy link
Contributor Author

jangorecki commented Mar 26, 2020

@taureandyernv Thanks for trying to help. Altought spilling cudf to main mem works, it is not reliable because it can corrupt driver context and then whole machine has to be rebooted. So agree it only make sense to use it with dask_cudf, which AFAIU is not affected by that issue.
Your example is good, but what could be even better, if you could contribute it to cudf repository as documentation. I am now waiting for rapidsai/cudf#2277 and rapidsai/cudf#2288 (you are even mentioned there). If your example does not cover those cases, it won't help much to push this issue forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants