Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make better use of pinned memory pool #4497

Merged
merged 10 commits into from
Jan 14, 2022

Conversation

rongou
Copy link
Collaborator

@rongou rongou commented Jan 11, 2022

Currently the RapidsHostMemoryStore only knows about the size of the unpinned memory pool, and puts pinned and unpinned memory buffers at the same spill priority. This likely causes pinned memory buffers to stick around longer than necessary when spilling.

This change allows the store to keep total buffer size under the combined size of the pinned and unpinned memory pools, and try to spill pinned buffers first (lower priorities).

Testing on my desktop with UCX enabled on q50, performance increases with a larger pinned memory pool (up to ~60% with 64GB pinned memory), and performs slightly better with the common configuration (8GB pinned, 32GB unpinned).

As far as I can tell this doesn't have any effect when UCX is turned off.

Signed-off-by: Rong Ou rong.ou@gmail.com

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou rongou added performance A performance related task/issue shuffle things that impact the shuffle plugin improve labels Jan 11, 2022
@rongou rongou added this to the Jan 10 - Jan 28 milestone Jan 11, 2022
@rongou rongou self-assigned this Jan 11, 2022
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the same comments as Jason. I think some of the gains are coming from us deciding that we are extending the size of the pool with pinned memory, instead of letting it be a bonus. That said I understand why we want to do that, but if we do start to spill, then it will leave no pinned memory for other operations. So I am conflicted on it.

Signed-off-by: Rong Ou <rong.ou@gmail.com>
Signed-off-by: Rong Ou <rong.ou@gmail.com>
jlowe
jlowe previously approved these changes Jan 11, 2022
@rongou
Copy link
Collaborator Author

rongou commented Jan 12, 2022

build

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou
Copy link
Collaborator Author

rongou commented Jan 12, 2022

build

@revans2
Copy link
Collaborator

revans2 commented Jan 12, 2022

The code looks good to me from a code perspective. My main concern is with UCX and something like the TPC-DS power benchmark.

For normal operations the Spilled data should be short lived, so using pinned memory before the others, and spilling/freeing it first sounds fine.

For UCX on the TPC-DS power benchmark we will try to run lots of queries one right after the other. Spark does not quickly release the shuffle data once a query finishes. It relies on the RDDs being garbage collected on the driver. That means with this change we are likely to have all of the pinned memory pool and all of the spill pool holding old stale shuffle data. There will likely be no memory left for operations like a buffer for reading input data. This is likely to slow down those queries because they have to get non-pinned memory allocated directly from the OS. That in and of itself would probably be okay.

But, if we are running under cgroups (on YARN or Kubernetes) then we have now increased memory pressure, and made it more likely that we will be shot for using too much memory, and also made it difficult for a user to predict how much memory we really will use. I think we already have this problem today with UCX because we allocate the entire non-pinned pool and then try to use as much pinned memory as possible first. But because the OS does not actually give you pages until they are used, it is more likely to not allocate some of them from the non-pinned pool at all...

I am probably just blowing this out of proportion, especially with UCX being off by default.

@abellina
Copy link
Collaborator

Agree with @revans2. If I understand the change correctly, given that now target size is larger (by including both the pinned and pageable memory) this makes host memory spill less aggressive. If there existed a trigger to spill host pinned memory (no OOM like in the device case) in case where pinned memory is running low that would be different, but we don't have such a trigger, just this limit that we are now raising.

@rongou
Copy link
Collaborator Author

rongou commented Jan 12, 2022

So there are two behavioral changes in this PR:

  1. With the first spilling, instead of trying to get under the size of the pageable memory pool, we now try to stay below the combined size of pinned and pageable memory pools.
  2. Pinned memory buffers now have lower priorities, so will be spilled first.

2 doesn't introduce more pressure on the pinned memory pool, it should lessen it. 1 may possibly lead to us holding onto pinned memory buffers longer since we are spilling less. On the other hand, the previous code spills buffers randomly between pinned and pageable, and the new code is more focused on spilling pinned buffers, so the overall effect may be less pressure on pinned memory.

@abellina
Copy link
Collaborator

so the overall effect may be less pressure on pinned memory.

Except in some cases like after a shuffle and before an expensive scan possibly from executing queries one after the other (the power mode of execution in TPCDS is a good example). In this case the store could be full from a prior expensive query, and everything is held since the RDD hasn't been GC'ed, but we are asking for pinned memory when scanning data for the next input, which won't be there, especially now because our target size for the host store is now larger.

@revans2
Copy link
Collaborator

revans2 commented Jan 12, 2022

Okay, so I think my main concern with this really comes down to
"Do we care about the maximum memory usage and making it predictable?" For me I do care about this. But UCX is not really something that is on in production right now, and we already have this same problem, just not perhaps to the same degree. I would be fine if we had a separate follow on issue to look at how do we keep the worst case host memory usage within a predictable bound, and if we even care about it in practice.

For the other concern I can see cases on both sides. At this point it really is a matter of tuning and what do we want to be the default tuning to be. I would be happy if we just had a config for the maximum amount of memory that the spill pool could use both pinned and unpinned, or what was the maximum amount of pinned memory it could use. The default can be all of it. The idea is just to have some way to tune it in the future as we try to play around with this. That config could even be hidden to start out with too so users don't need to think about it unless they run into problems.

But to answer the question of how it should be tuned by default we probably need to come up with some suite of benchmarks that we care about so we can decide which is better and which is not for these specific use cases. Without that we are just speculating. This is probably something we need to do for all of our performance work.

@rongou
Copy link
Collaborator Author

rongou commented Jan 12, 2022

We do already have control over the sizes of both pinned memory pool and pageable memory pool:

  • spark.rapids.memory.host.spillStorageSize: size of the pageable memory pool, default 1GB
  • spark.rapids.memory.pinnedPool.size: size of the pinned memory pool, default 0

We don't have a way to fully limit host memory usage though since buffers larger than the size of the pool are allocated directly.

@rongou
Copy link
Collaborator Author

rongou commented Jan 12, 2022

Except in some cases like after a shuffle and before an expensive scan possibly from executing queries one after the other (the power mode of execution in TPCDS is a good example). In this case the store could be full from a prior expensive query, and everything is held since the RDD hasn't been GC'ed, but we are asking for pinned memory when scanning data for the next input, which won't be there, especially now because our target size for the host store is now larger.

The current code doesn't handle this well since the spilling could be mostly on pageable buffers.

@abellina
Copy link
Collaborator

If we add the config that @revans2 suggested, I think it would be fine to go in. It is a max for the pool that is smaller than pageable + pinned, but default it to pageable + pinned for now. It gives us the flexibility to override the behavior, while still prioritizing the pinned buffer spilling. Thoughts?

@rongou
Copy link
Collaborator Author

rongou commented Jan 13, 2022

Maybe add a new spark.rapids.memory.pageablePool.size for the pageable memory pool, and use the existing spark.rapids.memory.host.spillStorageSize as the spill target?

@jlowe
Copy link
Member

jlowe commented Jan 13, 2022

Maybe add a new spark.rapids.memory.pageablePool.size for the pageable memory pool

I would prefer if this was spark.rapids.memory.host.pageablePool.size since all configs specific to host memory should be under the spark.rapids.memory.host. prefix.

@abellina
Copy link
Collaborator

+1 on the config!

@revans2
Copy link
Collaborator

revans2 commented Jan 13, 2022

+1 for the new config idea

@rongou
Copy link
Collaborator Author

rongou commented Jan 13, 2022

Added the new config. Please take another look.

revans2
revans2 previously approved these changes Jan 13, 2022
abellina
abellina previously approved these changes Jan 13, 2022
@abellina
Copy link
Collaborator

build

@rongou rongou dismissed stale reviews from abellina and revans2 via 2d60a3f January 13, 2022 22:28
@jlowe
Copy link
Member

jlowe commented Jan 14, 2022

build

@rongou rongou merged commit 004cc39 into NVIDIA:branch-22.02 Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve performance A performance related task/issue shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants