-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize thread pool enqueue to handle job batches more efficiently. #45749
Conversation
Tagging subscribers to this area: @BrzVlad Issue Details!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !! When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it this way creates a lot of signalled threads + contention over shared mutex for each queued job. Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1. This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:
So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.
|
92b58cd
to
ee40f4c
Compare
…ently. When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it this way creates a lot of signalled threads + contention over shared mutex for each queued job. Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1. This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR: | Default | Simple-par | Simple-par + enqueue optimization | Improvement | Improvement default Simple-par -- | -- | -- | -- | -- | -- Stdev (µs) | 1143 | 1047 | 952 | | Avg (µs) | 2585 | 2206 | 1629 | 36,98% | 26,17% | | | | | TrimMean 10% (µs) | 2548 | 2146 | 1576 | 38,17% | 26,58% TrimMean 25% (µs) | 2535 | 2106 | 1548 | 38,94% | 26,50% First quartile – 25th percentile (µs) | 1598 | 1295 | 965 | 39,63% | 25,54% Second quartile – 50th percentile (µs) | 2449 | 2062 | 1515 | 38,16% | 26,55% Third quartile – 75th percentile (µs) | 3588 | 2892 | 2116 | 41,04% | 26,84% So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.
ee40f4c
to
278f96e
Compare
!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !!
Do not automatically approve this PR:
* Consider how the changes affect configurations in this repo,
* Check effects on files that are not mirrored,
* Identify test cases that may be needed in this repo.
!! Merge the PR only after the original PR is merged !!
When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.
Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.
This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:
So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.