use on-disk data storage if OOM happens #126

jangorecki · 2019-12-01T05:17:59Z

Solutions that runs OOM and are capable to use on-disk storage should use it. AFAIU it is now possible for pydatatable, spark and dask.

pydatatable 1e9 join
spark 1e9 join
dask 1e9 join
dask 1e9 groupby

jangorecki · 2019-12-04T02:47:48Z

Timeout for join has been increased from 60 to 120 minutes due to the much longer processing time for newly added spark 1e9 join that uses on-disk data storage.
I played a little with spark memory limit, timing of 1e9 join

 90G -  9h
100G -  9h
110G - 14h
120G - spark crash: "There is insufficient memory for the Java Runtime Environment to continue."

jangorecki · 2019-12-04T16:15:04Z

Timeout has to be increase much further. Looking at data.table and spark and groupby (1e7, k=100) vs join, latter seems to be taking 2-8x longer. This is likely caused by loading data. Groupby requires load 45gb once, while join requires to load 55gb x2. Due to those data sizes groupby still can be computed in memory, but join needs on-disk data storage, this contributes to longer computation time even more. To reduce the total amount of time that benchmark will be spending on join task, we can make timeout parameter granular for different data size. So 1e7 could have 30 minutes, 1e8 could have 2h (both should fit into memory), and 1e9 8h (on disk). Then at least we won't wait 8h on some slow solution trying to solve 1e7 size.

jangorecki · 2019-12-10T11:42:21Z

All 4 scripts for 3 solutions has been implemented and are already reflected on benchplot.

spark join 1e9: despite extending 1e9 join timeout to 8h is not able to finish q5 (big to big) join. From previous tests it needs 9h.
dask join 1e9: despite using on-disk data storage is running out of memory for 1e9 join.
Data are being loaded (precisely speaking mapped to on-disk tables) but python script is being killed by OS OOM killer before even finishing first run of first question
dask groupby 1e9: it resolves some questions where previously it was failing on data read due to memory limitation, but on-disk processing goes so slow that it timeouts eventually, using extended timeout already - 3h.

I am leaving this issue open because we should also carry out information about on-disk/in-memory to benchplot.

jangorecki · 2019-12-24T06:14:39Z

It is now marked on benchplot with an * suffix, related note is added below the plot in the report. This issue is resolved. Note that there is a related one to use RAM memory when OOVM (video memory) happens for cudf #116.

jangorecki · 2020-11-21T11:06:02Z

There seems to be a regression in performance for dask 2.30 because it is no longer able to compute out-of-memory groupby questions using parquet format, going to revert to in-memory format to reduce maintenance and size of data files as no other solution uses parquet. Ideally we want to replace parquet with arrow, which can be re-used by more solutions.

jangorecki mentioned this issue Dec 1, 2019

not enough memory to read 1e9 data #111

Open

jangorecki added dask pydatatable spark labels Dec 1, 2019

jangorecki added a commit that referenced this issue Dec 4, 2019

spark join 1e9 using on-disk storage, #126

2358064

jangorecki mentioned this issue Dec 4, 2019

Timeout has to be more flexible #127

Closed

jangorecki added a commit that referenced this issue Dec 5, 2019

new log field on_disk, #126, cleanup helpers

1d94318

jangorecki closed this as completed in c1b8a59 Dec 24, 2019

jangorecki mentioned this issue Jan 10, 2020

cudf should spil to main memory when running out of gpu memory #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use on-disk data storage if OOM happens #126

use on-disk data storage if OOM happens #126

jangorecki commented Dec 1, 2019 •

edited

Loading

jangorecki commented Dec 4, 2019 •

edited

Loading

jangorecki commented Dec 4, 2019 •

edited

Loading

jangorecki commented Dec 10, 2019 •

edited

Loading

jangorecki commented Dec 24, 2019

jangorecki commented Nov 21, 2020 •

edited

Loading

use on-disk data storage if OOM happens #126

use on-disk data storage if OOM happens #126

Comments

jangorecki commented Dec 1, 2019 • edited Loading

jangorecki commented Dec 4, 2019 • edited Loading

jangorecki commented Dec 4, 2019 • edited Loading

jangorecki commented Dec 10, 2019 • edited Loading

jangorecki commented Dec 24, 2019

jangorecki commented Nov 21, 2020 • edited Loading

jangorecki commented Dec 1, 2019 •

edited

Loading

jangorecki commented Dec 4, 2019 •

edited

Loading

jangorecki commented Dec 4, 2019 •

edited

Loading

jangorecki commented Dec 10, 2019 •

edited

Loading

jangorecki commented Nov 21, 2020 •

edited

Loading