Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate cuDF memory usage #94

Closed
jangorecki opened this issue Aug 21, 2019 · 10 comments
Closed

investigate cuDF memory usage #94

jangorecki opened this issue Aug 21, 2019 · 10 comments
Labels

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Aug 21, 2019

Grouping benchmark is running fine for cuDF on 1e7 rows (0.5 GB csv) data.
When trying to run 1e8 rows (5 GB csv) data it is failing during/after loading data with

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

This is surprising because there should be enough memory, we have 21 GB of memory in GPU:

> nvidia-smi
Wed Aug 21 02:33:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:02:00.0 Off |                  N/A |
| 23%   33C    P8    16W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:81:00.0 Off |                  N/A |
| 23%   37C    P8    12W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

cuDF 0.8.0

Related issue rapidsai/cudf#2478

@datametrician
Copy link

cuDF is single GPU only but can use both GPUs with Dask (Dask-cuDF). With 5GB you’re probably running OOM. With Dask-cuDF you can break the loading into smaller chunks which will make this fit on a 1080ti.

@jangorecki
Copy link
Contributor Author

jangorecki commented Nov 14, 2019

Thanks @datametrician, this is very valuable information. Also surprising one, the fact that cuDF does not have that built-in. I filled a question in rapidsai/cudf#3374
I opened new issue #116

@datametrician
Copy link

All of our algorithms and kernels use Dask for scaling to multiple GPU within a machine and multi-node. It's in a way no different than pandas and SKL being single core or single node by default, and you need something like dask to scale them distributed. This make the programming model more straight forward where single node multi-gpu code is the same as multi-node multi-gpu.

@jangorecki
Copy link
Contributor Author

It is surely an advantage when you might want to scale to multiple machines, but when you are only interested in a single node environment the you are likely to get better performance when developing parallelism for specially for a single node. According the linked question in cudf it is not on a roadmap and dask cudf is going to be standard way to parallelise also on a single machine.

@datametrician
Copy link

I actually don't think you get more performance even on a single node. UCX optimizes transport with RDMA so the performance on a single node should be identical, plus it also scales as an upside :)

@jangorecki
Copy link
Contributor Author

jangorecki commented Dec 11, 2019

I investigated what are the limits we are currently capable to compute using our 2x GeForce GTX 1080 Ti.

I used following query while running various data sizes (note that cardinality didn't change).

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out

cudf

Using 11GB we are able to compute groupby benchmark up to 5e7 rows, which is 2.3 GB csv. And we are starting to run out of memory when using 6e7 rows, 2.8 GB csv data.
Groupby on 1e7 rows, 0.45GB csv required 2.25 GB of memory. Using dask-cudf (#116), might or may not allow to compute 1e8 data 4.7 GB csv. Assuming scaling required memory linearly it is likely it will run out of memory.
max_mem-cudf

@datametrician
Copy link

With Dask-cuDF you can read chunks of a csv, so you can squeeze more in. You can read 1M rows at a time, and should be able to get more in memory (plus you can spill to sys mem, but that hurts perf)

@jangorecki
Copy link
Contributor Author

jangorecki commented Dec 13, 2019

@datametrician Thanks for info. Any idea if processing by chunks requires micromanaging (like merging groups from different chunks manually - re-aggregate). Or is it just a matter of setting a parameter?
I recently added support for dask on-disk data storage using parquet. The only change was to use different function to load the data. No extra post-query chunks micromanaging was required.
As a result dak can compute 1e9 data size groupby. The only problem is that it is up to 100 times slower than other tools.

@datametrician
Copy link

Just a matter of setting a parameter. Dask, in this case, will use system memory instead of Disk. Hopefully, it's not 100x slower. That being said, I think the easiest thing to do is upgrade the GPU to RTX8000s. This is what the NVIDIA DSWS ships with.

@jangorecki
Copy link
Contributor Author

jangorecki commented May 14, 2020

posting code I used in this issue for future reference

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out
library(data.table)
rbindlist(
  lapply(setNames(nm=list.files(pattern="^gpu_G1_.*\\.out$")), fread, col.names=
c("timestamp","gpu","total","used","free")),
  idcol="in_rows"
)[, "in_rows":=sapply(strsplit(in_rows, "_", fixed=TRUE), `[[`, 3L)][
  ] -> d
d[, timestamp:=as.POSIXct(timestamp)]

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]

subset.used = function(dt) {
  r = range(which(dt$used > 1))
  dt[r[1L]:r[2L]]
}
dd = d[in_rows%in%c("1e7","3e7","5e7","6e7"), subset.used(.SD), by="in_rows"]
dd[, timestamp:=seq_along(timestamp)]

lattice::xyplot(
  used ~ timestamp | in_rows,
  dd, type="l", grid=TRUE, groups=gpu,
  main="cudf GPU memory usage",
  xlab = "timestamp",
  ylab = "MB",
  scales=list(y=list(
    relation="free"#,
    #limits=rep(ld[solution==s, .(ylim=max(c(0, time_sec_1), na.rm=TRUE)), in_rows][ylim>0, list(list(c(0, ylim))), in_rows]$V1, each=3)
  )),
  auto.key=list(points=FALSE, lines=TRUE)
)

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]
#grep G1_5e7 time.csv | cut -d"," -f 5,7,10,14,15
data.table::fread("
G1_5e7_1e2_0_0,sum v1 by id1,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1,cudf,2,0.05
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,2,0.058
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,1,0.278
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,2,0.278
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,1,0.125
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,2,0.117
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,1,0.303
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,2,0.303
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,1,0.707
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,2,0.705
G1_5e7_1e2_0_0,sum v1 by id1,data.table,1,0.54
G1_5e7_1e2_0_0,sum v1 by id1,data.table,2,0.405
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,1,0.533
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,2,0.486
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,1,0.672
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,2,0.639
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,1,0.837
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,2,0.854
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,1,0.704
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,2,0.697
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,1,5.003
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,2,4.822
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,1,2.397
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,2,2.388
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,1,4.414
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,2,4.342
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,1,2.697
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,2,2.232
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,1,3.196
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,2,2.828",
col.names=c("data","question","solution","run","time")) -> d
d[, iquestion:=as.integer(factor(question, levels=unique(question)))]
d[solution=="data.table"][d[solution=="cudf"], on=c("data","iquestion","run"), nomatch=NULL][, .(iquestion, run, dt=time, cudf=i.time)] -> dd
dd[, cudf2dt := cudf/dt][] -> ddd
ddd
rbind(ddd, list(NA, NA, sum(ddd$dt), sum(ddd$cudf), mean(ddd$cudf2dt)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants