Optimize thread pool enqueue to handle job batches more efficiently. #45749

monojenkins · 2020-12-08T09:08:49Z

!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !!
Do not automatically approve this PR:

* Consider how the changes affect configurations in this repo,
* Check effects on files that are not mirrored,
* Identify test cases that may be needed in this repo.

!! Merge the PR only after the original PR is merged !!

When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.

Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.

This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:

	Default	Simple-par	Simple-par + enqueue optimization	Improvement	Improvement default Simple-par
Stdev (µs)	1143	1047	952
Avg (µs)	2585	2206	1629	36,98%	26,17%

TrimMean 10% (µs)	2548	2146	1576	38,17%	26,58%
TrimMean 25% (µs)	2535	2106	1548	38,94%	26,50%
First quartile – 25th percentile (µs)	1598	1295	965	39,63%	25,54%
First quartile – 50th percentile (µs)	2449	2062	1515	38,16%	26,55%
First quartile – 75th percentile (µs)	3588	2892	2116	41,04%	26,84%

So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.

ghost · 2020-12-08T09:08:54Z

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

Issue Details

!! This PR is a copy of mono/mono#20634, please do not edit or review it in this repo !!
Do not automatically approve this PR:

* Consider how the changes affect configurations in this repo,
* Check effects on files that are not mirrored,
* Identify test cases that may be needed in this repo.

!! Merge the PR only after the original PR is merged !!

When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it
this way creates a lot of signalled threads + contention over shared mutex for each queued job.

Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1.

This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR:

	Default	Simple-par	Simple-par + enqueue optimization	Improvement	Improvement default Simple-par
Stdev (µs)	1143	1047	952
Avg (µs)	2585	2206	1629	36,98%	26,17%

TrimMean 10% (µs)	2548	2146	1576	38,17%	26,58%
TrimMean 25% (µs)	2535	2106	1548	38,94%	26,50%
First quartile – 25th percentile (µs)	1598	1295	965	39,63%	25,54%
First quartile – 50th percentile (µs)	2449	2062	1515	38,16%	26,55%
First quartile – 75th percentile (µs)	3588	2892	2116	41,04%	26,84%

So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.

Author:	monojenkins
Assignees:	-
Labels:	`area-GC-mono`, `mono-mirror`
Milestone:	-

…ently. When running parallel minor GC the scan of major and LOS is split up in several jobs based on number of available cores * 4 * 2. On an 8 core machine that will generate 32 major heap and 32 LOS scanning jobs. Current implementation queued each individual job into the thread pool but since the threads where not allowed to work on the items at this point, doing it this way creates a lot of signalled threads + contention over shared mutex for each queued job. Commit optimize this pattern adding the ability to add a batch of allocated jobs into the thread pool at the same time, signalling all threads when all, jobs have been added to the queue reducing the number of mutex acquire/release and waking all thread pool threads from 65 (including scan wbroots) down to 1. This is an upstream of a change running in downstream repro for a little over a year giving a good performance boost for those platforms. As part of upstreaming I also did some performance benchmark around performance of this PR on desktop 8 core Windows PC. Test was primarily stressing the job dispatch as part of minor GC using a major heap of ~2.5 GB and a LOB heap of ~1 GB, doing around 1500 minor GC measures points per configuration (over stable GC state). The following is a summary of minor GC pause times comparing default, default with minor=simple-par and minor=simple-par + this PR: | Default | Simple-par | Simple-par + enqueue optimization | Improvement | Improvement default Simple-par -- | -- | -- | -- | -- | -- Stdev (µs) | 1143 | 1047 | 952 | | Avg (µs) | 2585 | 2206 | 1629 | 36,98% | 26,17% | | | | | TrimMean 10% (µs) | 2548 | 2146 | 1576 | 38,17% | 26,58% TrimMean 25% (µs) | 2535 | 2106 | 1548 | 38,94% | 26,50% First quartile – 25th percentile (µs) | 1598 | 1295 | 965 | 39,63% | 25,54% Second quartile – 50th percentile (µs) | 2449 | 2062 | 1515 | 38,16% | 26,55% Third quartile – 75th percentile (µs) | 3588 | 2892 | 2116 | 41,04% | 26,84% So to summarize, above indicates minor GC pause times improved with ~38% when running with the enqueue optimizations implemented in this PR on a desktop 8 core machine on a 2.5 GB major heap and 1 GB LOB heap. Results can off course vary depending on hardware, workload and platform, but benchmark above at least gives a good indication that batching jobs and reduce number of unnecessary kernel calls and pressure on scheduler have measurable positive improvements on minor GC pause times.

monojenkins requested review from BrzVlad, lambdageek and naricc as code owners December 8, 2020 09:08

Dotnet-GitSync-Bot added area-GC-mono mono-mirror labels Dec 8, 2020

monojenkins force-pushed the sync-pr-20634-from-mono branch 2 times, most recently from 92b58cd to ee40f4c Compare January 14, 2021 08:28

monojenkins force-pushed the sync-pr-20634-from-mono branch from ee40f4c to 278f96e Compare January 15, 2021 16:30

lateralusX self-requested a review January 18, 2021 10:45

lateralusX approved these changes Jan 18, 2021

View reviewed changes

lateralusX merged commit 5c5bb6a into dotnet:master Jan 18, 2021

ghost locked as resolved and limited conversation to collaborators Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize thread pool enqueue to handle job batches more efficiently. #45749

Optimize thread pool enqueue to handle job batches more efficiently. #45749

monojenkins commented Dec 8, 2020

ghost commented Dec 8, 2020

Optimize thread pool enqueue to handle job batches more efficiently. #45749

Optimize thread pool enqueue to handle job batches more efficiently. #45749

Conversation

monojenkins commented Dec 8, 2020

ghost commented Dec 8, 2020