investigate cuDF memory usage #94

jangorecki · 2019-08-21T09:37:16Z

Grouping benchmark is running fine for cuDF on 1e7 rows (0.5 GB csv) data.
When trying to run 1e8 rows (5 GB csv) data it is failing during/after loading data with

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

This is surprising because there should be enough memory, we have 21 GB of memory in GPU:

> nvidia-smi
Wed Aug 21 02:33:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:02:00.0 Off |                  N/A |
| 23%   33C    P8    16W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:81:00.0 Off |                  N/A |
| 23%   37C    P8    12W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

cuDF 0.8.0

Related issue rapidsai/cudf#2478

The text was updated successfully, but these errors were encountered:

datametrician · 2019-11-13T18:00:27Z

cuDF is single GPU only but can use both GPUs with Dask (Dask-cuDF). With 5GB you’re probably running OOM. With Dask-cuDF you can break the loading into smaller chunks which will make this fit on a 1080ti.

jangorecki · 2019-11-14T04:11:27Z

Thanks @datametrician, this is very valuable information. Also surprising one, the fact that cuDF does not have that built-in. I filled a question in rapidsai/cudf#3374
I opened new issue #116

datametrician · 2019-11-14T05:24:14Z

All of our algorithms and kernels use Dask for scaling to multiple GPU within a machine and multi-node. It's in a way no different than pandas and SKL being single core or single node by default, and you need something like dask to scale them distributed. This make the programming model more straight forward where single node multi-gpu code is the same as multi-node multi-gpu.

jangorecki · 2019-11-14T11:44:37Z

It is surely an advantage when you might want to scale to multiple machines, but when you are only interested in a single node environment the you are likely to get better performance when developing parallelism for specially for a single node. According the linked question in cudf it is not on a roadmap and dask cudf is going to be standard way to parallelise also on a single machine.

datametrician · 2019-11-14T20:58:00Z

I actually don't think you get more performance even on a single node. UCX optimizes transport with RDMA so the performance on a single node should be identical, plus it also scales as an upside :)

jangorecki · 2019-12-11T02:47:12Z

I investigated what are the limits we are currently capable to compute using our 2x GeForce GTX 1080 Ti.

I used following query while running various data sizes (note that cardinality didn't change).

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out

Using 11GB we are able to compute groupby benchmark up to 5e7 rows, which is 2.3 GB csv. And we are starting to run out of memory when using 6e7 rows, 2.8 GB csv data.
Groupby on 1e7 rows, 0.45GB csv required 2.25 GB of memory. Using dask-cudf (#116), might or may not allow to compute 1e8 data 4.7 GB csv. Assuming scaling required memory linearly it is likely it will run out of memory.

datametrician · 2019-12-12T23:42:06Z

With Dask-cuDF you can read chunks of a csv, so you can squeeze more in. You can read 1M rows at a time, and should be able to get more in memory (plus you can spill to sys mem, but that hurts perf)

jangorecki · 2019-12-13T00:07:58Z

@datametrician Thanks for info. Any idea if processing by chunks requires micromanaging (like merging groups from different chunks manually - re-aggregate). Or is it just a matter of setting a parameter?
I recently added support for dask on-disk data storage using parquet. The only change was to use different function to load the data. No extra post-query chunks micromanaging was required.
As a result dak can compute 1e9 data size groupby. The only problem is that it is up to 100 times slower than other tools.

datametrician · 2019-12-13T05:06:11Z

Just a matter of setting a parameter. Dask, in this case, will use system memory instead of Disk. Hopefully, it's not 100x slower. That being said, I think the easiest thing to do is upgrade the GPU to RTX8000s. This is what the NVIDIA DSWS ships with.

jangorecki · 2020-05-14T23:56:43Z

posting code I used in this issue for future reference

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out

library(data.table)
rbindlist(
  lapply(setNames(nm=list.files(pattern="^gpu_G1_.*\\.out$")), fread, col.names=
c("timestamp","gpu","total","used","free")),
  idcol="in_rows"
)[, "in_rows":=sapply(strsplit(in_rows, "_", fixed=TRUE), `[[`, 3L)][
  ] -> d
d[, timestamp:=as.POSIXct(timestamp)]

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]

subset.used = function(dt) {
  r = range(which(dt$used > 1))
  dt[r[1L]:r[2L]]
}
dd = d[in_rows%in%c("1e7","3e7","5e7","6e7"), subset.used(.SD), by="in_rows"]
dd[, timestamp:=seq_along(timestamp)]

lattice::xyplot(
  used ~ timestamp | in_rows,
  dd, type="l", grid=TRUE, groups=gpu,
  main="cudf GPU memory usage",
  xlab = "timestamp",
  ylab = "MB",
  scales=list(y=list(
    relation="free"#,
    #limits=rep(ld[solution==s, .(ylim=max(c(0, time_sec_1), na.rm=TRUE)), in_rows][ylim>0, list(list(c(0, ylim))), in_rows]$V1, each=3)
  )),
  auto.key=list(points=FALSE, lines=TRUE)
)

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]
#grep G1_5e7 time.csv | cut -d"," -f 5,7,10,14,15
data.table::fread("
G1_5e7_1e2_0_0,sum v1 by id1,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1,cudf,2,0.05
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,2,0.058
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,1,0.278
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,2,0.278
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,1,0.125
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,2,0.117
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,1,0.303
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,2,0.303
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,1,0.707
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,2,0.705
G1_5e7_1e2_0_0,sum v1 by id1,data.table,1,0.54
G1_5e7_1e2_0_0,sum v1 by id1,data.table,2,0.405
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,1,0.533
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,2,0.486
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,1,0.672
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,2,0.639
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,1,0.837
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,2,0.854
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,1,0.704
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,2,0.697
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,1,5.003
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,2,4.822
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,1,2.397
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,2,2.388
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,1,4.414
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,2,4.342
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,1,2.697
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,2,2.232
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,1,3.196
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,2,2.828",
col.names=c("data","question","solution","run","time")) -> d
d[, iquestion:=as.integer(factor(question, levels=unique(question)))]
d[solution=="data.table"][d[solution=="cudf"], on=c("data","iquestion","run"), nomatch=NULL][, .(iquestion, run, dt=time, cudf=i.time)] -> dd
dd[, cudf2dt := cudf/dt][] -> ddd
ddd
rbind(ddd, list(NA, NA, sum(ddd$dt), sum(ddd$cudf), mean(ddd$cudf2dt)))

This was referenced Aug 21, 2019

extend GPU memory to run cuDF for medium and big data #97

Closed

Add cudf (RAPIDS) #44

Closed

jangorecki added the cudf label Oct 16, 2019

jangorecki closed this as completed Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate cuDF memory usage #94

investigate cuDF memory usage #94

jangorecki commented Aug 21, 2019 •

edited

Loading

datametrician commented Nov 13, 2019

jangorecki commented Nov 14, 2019 •

edited

Loading

datametrician commented Nov 14, 2019

jangorecki commented Nov 14, 2019

datametrician commented Nov 14, 2019

jangorecki commented Dec 11, 2019 •

edited

Loading

datametrician commented Dec 12, 2019

jangorecki commented Dec 13, 2019 •

edited

Loading

datametrician commented Dec 13, 2019

jangorecki commented May 14, 2020 •

edited

Loading

investigate cuDF memory usage #94

investigate cuDF memory usage #94

Comments

jangorecki commented Aug 21, 2019 • edited Loading

datametrician commented Nov 13, 2019

jangorecki commented Nov 14, 2019 • edited Loading

datametrician commented Nov 14, 2019

jangorecki commented Nov 14, 2019

datametrician commented Nov 14, 2019

jangorecki commented Dec 11, 2019 • edited Loading

datametrician commented Dec 12, 2019

jangorecki commented Dec 13, 2019 • edited Loading

datametrician commented Dec 13, 2019

jangorecki commented May 14, 2020 • edited Loading

jangorecki commented Aug 21, 2019 •

edited

Loading

jangorecki commented Nov 14, 2019 •

edited

Loading

jangorecki commented Dec 11, 2019 •

edited

Loading

jangorecki commented Dec 13, 2019 •

edited

Loading

jangorecki commented May 14, 2020 •

edited

Loading