-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make better use of pinned memory pool #4497
Make better use of pinned memory pool #4497
Conversation
Signed-off-by: Rong Ou <rong.ou@gmail.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsBufferCatalog.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially the same comments as Jason. I think some of the gains are coming from us deciding that we are extending the size of the pool with pinned memory, instead of letting it be a bonus. That said I understand why we want to do that, but if we do start to spill, then it will leave no pinned memory for other operations. So I am conflicted on it.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Rong Ou <rong.ou@gmail.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Show resolved
Hide resolved
Signed-off-by: Rong Ou <rong.ou@gmail.com>
build |
Signed-off-by: Rong Ou <rong.ou@gmail.com>
build |
The code looks good to me from a code perspective. My main concern is with UCX and something like the TPC-DS power benchmark. For normal operations the Spilled data should be short lived, so using pinned memory before the others, and spilling/freeing it first sounds fine. For UCX on the TPC-DS power benchmark we will try to run lots of queries one right after the other. Spark does not quickly release the shuffle data once a query finishes. It relies on the RDDs being garbage collected on the driver. That means with this change we are likely to have all of the pinned memory pool and all of the spill pool holding old stale shuffle data. There will likely be no memory left for operations like a buffer for reading input data. This is likely to slow down those queries because they have to get non-pinned memory allocated directly from the OS. That in and of itself would probably be okay. But, if we are running under cgroups (on YARN or Kubernetes) then we have now increased memory pressure, and made it more likely that we will be shot for using too much memory, and also made it difficult for a user to predict how much memory we really will use. I think we already have this problem today with UCX because we allocate the entire non-pinned pool and then try to use as much pinned memory as possible first. But because the OS does not actually give you pages until they are used, it is more likely to not allocate some of them from the non-pinned pool at all... I am probably just blowing this out of proportion, especially with UCX being off by default. |
Agree with @revans2. If I understand the change correctly, given that now target size is larger (by including both the pinned and pageable memory) this makes host memory spill less aggressive. If there existed a trigger to spill host pinned memory (no OOM like in the device case) in case where pinned memory is running low that would be different, but we don't have such a trigger, just this limit that we are now raising. |
So there are two behavioral changes in this PR:
2 doesn't introduce more pressure on the pinned memory pool, it should lessen it. 1 may possibly lead to us holding onto pinned memory buffers longer since we are spilling less. On the other hand, the previous code spills buffers randomly between pinned and pageable, and the new code is more focused on spilling pinned buffers, so the overall effect may be less pressure on pinned memory. |
Except in some cases like after a shuffle and before an expensive scan possibly from executing queries one after the other (the power mode of execution in TPCDS is a good example). In this case the store could be full from a prior expensive query, and everything is held since the RDD hasn't been GC'ed, but we are asking for pinned memory when scanning data for the next input, which won't be there, especially now because our target size for the host store is now larger. |
Okay, so I think my main concern with this really comes down to For the other concern I can see cases on both sides. At this point it really is a matter of tuning and what do we want to be the default tuning to be. I would be happy if we just had a config for the maximum amount of memory that the spill pool could use both pinned and unpinned, or what was the maximum amount of pinned memory it could use. The default can be all of it. The idea is just to have some way to tune it in the future as we try to play around with this. That config could even be hidden to start out with too so users don't need to think about it unless they run into problems. But to answer the question of how it should be tuned by default we probably need to come up with some suite of benchmarks that we care about so we can decide which is better and which is not for these specific use cases. Without that we are just speculating. This is probably something we need to do for all of our performance work. |
We do already have control over the sizes of both pinned memory pool and pageable memory pool:
We don't have a way to fully limit host memory usage though since buffers larger than the size of the pool are allocated directly. |
The current code doesn't handle this well since the spilling could be mostly on pageable buffers. |
If we add the config that @revans2 suggested, I think it would be fine to go in. It is a max for the pool that is smaller than pageable + pinned, but default it to pageable + pinned for now. It gives us the flexibility to override the behavior, while still prioritizing the pinned buffer spilling. Thoughts? |
Maybe add a new |
I would prefer if this was |
+1 on the config! |
+1 for the new config idea |
Signed-off-by: Rong Ou <rong.ou@gmail.com>
Added the new config. Please take another look. |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsHostMemoryStore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillPriorities.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillPriorities.scala
Outdated
Show resolved
Hide resolved
build |
Currently the
RapidsHostMemoryStore
only knows about the size of the unpinned memory pool, and puts pinned and unpinned memory buffers at the same spill priority. This likely causes pinned memory buffers to stick around longer than necessary when spilling.This change allows the store to keep total buffer size under the combined size of the pinned and unpinned memory pools, and try to spill pinned buffers first (lower priorities).
Testing on my desktop with UCX enabled on q50, performance increases with a larger pinned memory pool (up to ~60% with 64GB pinned memory), and performs slightly better with the common configuration (8GB pinned, 32GB unpinned).
As far as I can tell this doesn't have any effect when UCX is turned off.
Signed-off-by: Rong Ou rong.ou@gmail.com