Make better use of pinned memory pool #4497

rongou · 2022-01-11T02:41:15Z

Currently the RapidsHostMemoryStore only knows about the size of the unpinned memory pool, and puts pinned and unpinned memory buffers at the same spill priority. This likely causes pinned memory buffers to stick around longer than necessary when spilling.

This change allows the store to keep total buffer size under the combined size of the pinned and unpinned memory pools, and try to spill pinned buffers first (lower priorities).

Testing on my desktop with UCX enabled on q50, performance increases with a larger pinned memory pool (up to ~60% with 64GB pinned memory), and performs slightly better with the common configuration (8GB pinned, 32GB unpinned).

As far as I can tell this doesn't have any effect when UCX is turned off.

Signed-off-by: Rong Ou rong.ou@gmail.com

Signed-off-by: Rong Ou <rong.ou@gmail.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsBufferCatalog.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala

revans2

Essentially the same comments as Jason. I think some of the gains are coming from us deciding that we are extending the size of the pool with pinned memory, instead of letting it be a bonus. That said I understand why we want to do that, but if we do start to spill, then it will leave no pinned memory for other operations. So I am conflicted on it.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala

…-buffer-priority

Signed-off-by: Rong Ou <rong.ou@gmail.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala

Signed-off-by: Rong Ou <rong.ou@gmail.com>

rongou · 2022-01-12T00:06:48Z

build

Signed-off-by: Rong Ou <rong.ou@gmail.com>

rongou · 2022-01-12T02:24:59Z

build

revans2 · 2022-01-12T16:07:21Z

The code looks good to me from a code perspective. My main concern is with UCX and something like the TPC-DS power benchmark.

For normal operations the Spilled data should be short lived, so using pinned memory before the others, and spilling/freeing it first sounds fine.

For UCX on the TPC-DS power benchmark we will try to run lots of queries one right after the other. Spark does not quickly release the shuffle data once a query finishes. It relies on the RDDs being garbage collected on the driver. That means with this change we are likely to have all of the pinned memory pool and all of the spill pool holding old stale shuffle data. There will likely be no memory left for operations like a buffer for reading input data. This is likely to slow down those queries because they have to get non-pinned memory allocated directly from the OS. That in and of itself would probably be okay.

But, if we are running under cgroups (on YARN or Kubernetes) then we have now increased memory pressure, and made it more likely that we will be shot for using too much memory, and also made it difficult for a user to predict how much memory we really will use. I think we already have this problem today with UCX because we allocate the entire non-pinned pool and then try to use as much pinned memory as possible first. But because the OS does not actually give you pages until they are used, it is more likely to not allocate some of them from the non-pinned pool at all...

I am probably just blowing this out of proportion, especially with UCX being off by default.

abellina · 2022-01-12T18:34:45Z

Agree with @revans2. If I understand the change correctly, given that now target size is larger (by including both the pinned and pageable memory) this makes host memory spill less aggressive. If there existed a trigger to spill host pinned memory (no OOM like in the device case) in case where pinned memory is running low that would be different, but we don't have such a trigger, just this limit that we are now raising.

rongou · 2022-01-12T18:35:09Z

So there are two behavioral changes in this PR:

With the first spilling, instead of trying to get under the size of the pageable memory pool, we now try to stay below the combined size of pinned and pageable memory pools.
Pinned memory buffers now have lower priorities, so will be spilled first.

2 doesn't introduce more pressure on the pinned memory pool, it should lessen it. 1 may possibly lead to us holding onto pinned memory buffers longer since we are spilling less. On the other hand, the previous code spills buffers randomly between pinned and pageable, and the new code is more focused on spilling pinned buffers, so the overall effect may be less pressure on pinned memory.

…-buffer-priority

abellina · 2022-01-12T18:42:06Z

so the overall effect may be less pressure on pinned memory.

Except in some cases like after a shuffle and before an expensive scan possibly from executing queries one after the other (the power mode of execution in TPCDS is a good example). In this case the store could be full from a prior expensive query, and everything is held since the RDD hasn't been GC'ed, but we are asking for pinned memory when scanning data for the next input, which won't be there, especially now because our target size for the host store is now larger.

revans2 · 2022-01-12T19:07:46Z

Okay, so I think my main concern with this really comes down to
"Do we care about the maximum memory usage and making it predictable?" For me I do care about this. But UCX is not really something that is on in production right now, and we already have this same problem, just not perhaps to the same degree. I would be fine if we had a separate follow on issue to look at how do we keep the worst case host memory usage within a predictable bound, and if we even care about it in practice.

For the other concern I can see cases on both sides. At this point it really is a matter of tuning and what do we want to be the default tuning to be. I would be happy if we just had a config for the maximum amount of memory that the spill pool could use both pinned and unpinned, or what was the maximum amount of pinned memory it could use. The default can be all of it. The idea is just to have some way to tune it in the future as we try to play around with this. That config could even be hidden to start out with too so users don't need to think about it unless they run into problems.

But to answer the question of how it should be tuned by default we probably need to come up with some suite of benchmarks that we care about so we can decide which is better and which is not for these specific use cases. Without that we are just speculating. This is probably something we need to do for all of our performance work.

rongou · 2022-01-12T19:16:33Z

We do already have control over the sizes of both pinned memory pool and pageable memory pool:

spark.rapids.memory.host.spillStorageSize: size of the pageable memory pool, default 1GB
spark.rapids.memory.pinnedPool.size: size of the pinned memory pool, default 0

We don't have a way to fully limit host memory usage though since buffers larger than the size of the pool are allocated directly.

rongou · 2022-01-12T21:58:56Z

Except in some cases like after a shuffle and before an expensive scan possibly from executing queries one after the other (the power mode of execution in TPCDS is a good example). In this case the store could be full from a prior expensive query, and everything is held since the RDD hasn't been GC'ed, but we are asking for pinned memory when scanning data for the next input, which won't be there, especially now because our target size for the host store is now larger.

The current code doesn't handle this well since the spilling could be mostly on pageable buffers.

abellina · 2022-01-12T22:17:28Z

If we add the config that @revans2 suggested, I think it would be fine to go in. It is a max for the pool that is smaller than pageable + pinned, but default it to pageable + pinned for now. It gives us the flexibility to override the behavior, while still prioritizing the pinned buffer spilling. Thoughts?

rongou · 2022-01-13T02:04:28Z

Maybe add a new spark.rapids.memory.pageablePool.size for the pageable memory pool, and use the existing spark.rapids.memory.host.spillStorageSize as the spill target?

jlowe · 2022-01-13T14:03:12Z

Maybe add a new spark.rapids.memory.pageablePool.size for the pageable memory pool

I would prefer if this was spark.rapids.memory.host.pageablePool.size since all configs specific to host memory should be under the spark.rapids.memory.host. prefix.

abellina · 2022-01-13T14:41:18Z

+1 on the config!

revans2 · 2022-01-13T17:01:59Z

+1 for the new config idea

…-buffer-priority

Signed-off-by: Rong Ou <rong.ou@gmail.com>

rongou · 2022-01-13T19:28:33Z

Added the new config. Please take another look.

abellina · 2022-01-13T19:39:09Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillPriorities.scala

jlowe · 2022-01-14T15:00:24Z

build

Make better use of pinned memory pool

d1ef814

Signed-off-by: Rong Ou <rong.ou@gmail.com>

rongou added performance A performance related task/issue shuffle things that impact the shuffle plugin improve labels Jan 11, 2022

rongou added this to the Jan 10 - Jan 28 milestone Jan 11, 2022

rongou requested review from jlowe, abellina and revans2 January 11, 2022 02:41

rongou self-assigned this Jan 11, 2022

jlowe reviewed Jan 11, 2022

View reviewed changes

revans2 reviewed Jan 11, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala Outdated Show resolved Hide resolved

rongou added 2 commits January 11, 2022 13:06

Merge remote-tracking branch 'upstream/branch-22.02' into host-memory…

1389f97

…-buffer-priority

review feedback

cfb1e7d

Signed-off-by: Rong Ou <rong.ou@gmail.com>

jlowe reviewed Jan 11, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala Show resolved Hide resolved

fix match statement and test

e78c3b8

Signed-off-by: Rong Ou <rong.ou@gmail.com>

jlowe previously approved these changes Jan 11, 2022

View reviewed changes

fix disk store suite

e3225fc

Signed-off-by: Rong Ou <rong.ou@gmail.com>

rongou dismissed jlowe’s stale review via e3225fc January 12, 2022 02:24

Merge remote-tracking branch 'upstream/branch-22.02' into host-memory…

62ddd9b

…-buffer-priority

rongou mentioned this pull request Jan 12, 2022

[FEA] Make better use of pinned memory with Spark shuffle #4516

Closed

rongou added 2 commits January 13, 2022 10:57

Merge remote-tracking branch 'upstream/branch-22.02' into host-memory…

0986362

…-buffer-priority

add a separate config for pageable pool size

04c0a03

Signed-off-by: Rong Ou <rong.ou@gmail.com>

revans2 previously approved these changes Jan 13, 2022

View reviewed changes

abellina previously approved these changes Jan 13, 2022

View reviewed changes

jlowe reviewed Jan 13, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala Outdated Show resolved Hide resolved

clamp offset spill priorities

2d60a3f

rongou dismissed stale reviews from abellina and revans2 via 2d60a3f January 13, 2022 22:28

jlowe reviewed Jan 13, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillPriorities.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillPriorities.scala Outdated Show resolved Hide resolved

make priority offset method more general

d3e2588

jlowe approved these changes Jan 14, 2022

View reviewed changes

revans2 approved these changes Jan 14, 2022

View reviewed changes

rongou merged commit 004cc39 into NVIDIA:branch-22.02 Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make better use of pinned memory pool #4497

Make better use of pinned memory pool #4497

rongou commented Jan 11, 2022

revans2 left a comment

rongou commented Jan 12, 2022

rongou commented Jan 12, 2022

revans2 commented Jan 12, 2022

abellina commented Jan 12, 2022

rongou commented Jan 12, 2022

abellina commented Jan 12, 2022

revans2 commented Jan 12, 2022

rongou commented Jan 12, 2022

rongou commented Jan 12, 2022

abellina commented Jan 12, 2022

rongou commented Jan 13, 2022

jlowe commented Jan 13, 2022

abellina commented Jan 13, 2022

revans2 commented Jan 13, 2022

rongou commented Jan 13, 2022

abellina commented Jan 13, 2022

jlowe commented Jan 14, 2022

Make better use of pinned memory pool #4497

Make better use of pinned memory pool #4497

Conversation

rongou commented Jan 11, 2022

revans2 left a comment

Choose a reason for hiding this comment

rongou commented Jan 12, 2022

rongou commented Jan 12, 2022

revans2 commented Jan 12, 2022

abellina commented Jan 12, 2022

rongou commented Jan 12, 2022

abellina commented Jan 12, 2022

revans2 commented Jan 12, 2022

rongou commented Jan 12, 2022

rongou commented Jan 12, 2022

abellina commented Jan 12, 2022

rongou commented Jan 13, 2022

jlowe commented Jan 13, 2022

abellina commented Jan 13, 2022

revans2 commented Jan 13, 2022

rongou commented Jan 13, 2022

abellina commented Jan 13, 2022

jlowe commented Jan 14, 2022