Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lower GPU memory reserve to 256MB #4046

Merged
merged 2 commits into from
Nov 8, 2021
Merged

Conversation

rongou
Copy link
Collaborator

@rongou rongou commented Nov 5, 2021

The current reserve is set to 1GB, this may be too high causing some queries to run out of memory. Lowering it to a more reasonable value. If a user runs out of memory launching large kernels, they can still tweak this parameter.

Fixes #4045

Signed-off-by: Rong Ou rong.ou@gmail.com

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou rongou added bug Something isn't working documentation Improvements or additions to documentation ease of use Makes the product simpler to use or configure labels Nov 5, 2021
@rongou rongou requested review from jlowe and abellina November 5, 2021 21:40
@rongou rongou self-assigned this Nov 5, 2021
@jlowe
Copy link
Member

jlowe commented Nov 5, 2021

The reason this was set to a large value originally is because a customer query was casting strings to timestamps, and that's not a rare thing for a query to do. Back then, the string-to-timestamp casting code involved quite a few complicated regex replace with back references, and that kernel had a large thread stack space requirement. So large that the query would not run at all unless the reserve was set to 1GB.

I totally get the desire to lower the amount of reserve memory (it's like getting a bigger GPU for free! 😄), but IMO there's a significant associated risk. The default reserve setting could end up being too low to the point where it would crash on some relatively common query operations. That leads to a very poor first impression, as users would not be told when it crashes that they could try tuning this reserve config or, counterintuitively, lower the amount of memory the plugin memory pool is using to fix the out of memory issue. (This is insufficient memory for the driver to complete a kernel launch rather than insufficient memory for a plugin allocation request.)

The timestamp casting code has changed quite a bit since the reserve was tuned, and there may have been libcudf improvements in the memory utilization of the regex backref kernel. We might be able to get away with a significantly lower setting. However I think we should test the proposed setting given that casting strings to timestamps is probably not a rare occurrence in many ETL pipelines.

@rongou
Copy link
Collaborator Author

rongou commented Nov 6, 2021

For q93, we tried 128MB and 64MB, and both succeeded with 10 runs. Running tests locally on my desktop, it fails with 64MB but passes with 128MB. So I guess we are pretty close to the boundary.

For reference, when I originally worked on the arena allocator, I had to reserve some memory too when the max is not specified, and some testing also led to 64MB: https://github.com/rapidsai/rmm/blob/branch-21.12/include/rmm/mr/device/detail/arena.hpp#L295

I think the reason it worked before was more or less a coincidence, as setting the max pool size to total memory minus a 1GB reserve was not very sensical, but just happened to work.

I'm hoping the changes I'm making to the arena allocator would make it more resistant to memory fragmentation, so maybe we can raise this reserve later.

@rongou
Copy link
Collaborator Author

rongou commented Nov 6, 2021

build

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou
Copy link
Collaborator Author

rongou commented Nov 6, 2021

build

@jlowe jlowe changed the title lower gpu memory reserve to 128MB lower gpu memory reserve to 256MB Nov 8, 2021
@jlowe jlowe changed the title lower gpu memory reserve to 256MB lower GPU memory reserve to 256MB Nov 8, 2021
@abellina abellina merged commit 1b731ae into NVIDIA:branch-21.12 Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation ease of use Makes the product simpler to use or configure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] q93 failed in this week's NDS runs
3 participants