Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize thread pool enqueue to handle job batches more efficiently. #45749

Merged
merged 1 commit into from
Jan 18, 2021

Conversation

monojenkins
Copy link
Contributor

!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !!
Do not automatically approve this PR:

* Consider how the changes affect configurations in this repo,
* Check effects on files that are not mirrored,
* Identify test cases that may be needed in this repo.

!! Merge the PR only after the original PR is merged !!



When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.

Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.

This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:

  Default Simple-par Simple-par + enqueue optimization Improvement Improvement default Simple-par
Stdev (µs) 1143 1047 952    
Avg (µs) 2585 2206 1629 36,98% 26,17%
           
TrimMean 10% (µs) 2548 2146 1576 38,17% 26,58%
TrimMean 25% (µs) 2535 2106 1548 38,94% 26,50%
First quartile – 25th percentile (µs) 1598 1295 965 39,63% 25,54%
First quartile – 50th percentile (µs) 2449 2062 1515 38,16% 26,55%
First quartile – 75th percentile (µs) 3588 2892 2116 41,04% 26,84%

So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.

@ghost
Copy link

ghost commented Dec 8, 2020

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

Issue Details

!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !!
Do not automatically approve this PR:

* Consider how the changes affect configurations in this repo,
* Check effects on files that are not mirrored,
* Identify test cases that may be needed in this repo.

!! Merge the PR only after the original PR is merged !!



When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.

Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.

This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:

  Default Simple-par Simple-par + enqueue optimization Improvement Improvement default Simple-par
Stdev (µs) 1143 1047 952    
Avg (µs) 2585 2206 1629 36,98% 26,17%
           
TrimMean 10% (µs) 2548 2146 1576 38,17% 26,58%
TrimMean 25% (µs) 2535 2106 1548 38,94% 26,50%
First quartile – 25th percentile (µs) 1598 1295 965 39,63% 25,54%
First quartile – 50th percentile (µs) 2449 2062 1515 38,16% 26,55%
First quartile – 75th percentile (µs) 3588 2892 2116 41,04% 26,84%

So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.

Author: monojenkins
Assignees: -
Labels:

area-GC-mono, mono-mirror

Milestone: -

@monojenkins monojenkins force-pushed the sync-pr-20634-from-mono branch 2 times, most recently from 92b58cd to ee40f4c Compare January 14, 2021 08:28
…ently.

When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.

Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.

This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:

  | Default | Simple-par | Simple-par + enqueue   optimization | Improvement | Improvement default Simple-par
-- | -- | -- | -- | -- | --
Stdev (µs) | 1143 | 1047 | 952 |   |  
Avg (µs) | 2585 | 2206 | 1629 | 36,98% | 26,17%
  |   |   |   |   |  
TrimMean 10% (µs) | 2548 | 2146 | 1576 | 38,17% | 26,58%
TrimMean 25% (µs) | 2535 | 2106 | 1548 | 38,94% | 26,50%
First quartile – 25th percentile (µs) | 1598 | 1295 | 965 | 39,63% | 25,54%
Second quartile – 50th percentile (µs) | 2449 | 2062 | 1515 | 38,16% | 26,55%
Third quartile – 75th percentile (µs) | 3588 | 2892 | 2116 | 41,04% | 26,84%

So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.
@lateralusX lateralusX merged commit 5c5bb6a into dotnet:master Jan 18, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants