Implement mutex-free spin lock for task queue #14834

RandySheriffH · 2023-02-27T02:57:15Z

Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model	mutex(ms)	mutex-free
alvert_base_v2	34.21	34.09
bert_large_uncased	116.27	117.84
bart_base	72.06	71.99
distilgpt2	25.43	25.02
vit_base_patch16_224	37.33	37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model	mutex(ms)	mutex-free
deepthink_v2	24.35	22.95
bing_feeds	36.96	36.48
deep_writes	14.46	14.32
keypoints	9.34	7.69
model11	1.71	1.66
model12	1.82	1.44
model2	4.21	3.95
model6	1.08	1.05
agiencoder	0.99	0.93
geminet_transformer	5.32	5.24

…untime into rashuai/SpinLock

include/onnxruntime/core/platform/ort_mutex.h

yuslepukhin · 2023-04-17T22:56:52Z

I have not looked into this in depth. However, spin locks that never go to sleep do not do well in all scenarios. Case in point our spinning in the threadpools. If you just want to increase the spin count there are official ways of doing so with CriticalSecion on windows and phtread_mutexes. The windows read/write lock delivers great performance and may have other options.

Echoing Pranav's point about memory_order. It is hard to get it right. I recommend watching the appropriate talk from the Cpp Conference.

include/onnxruntime/core/platform/ort_mutex.h

tlh20 · 2023-04-26T10:48:18Z

Thanks for the profiling and exploration here! Believe the spinlock implementation is correct. I share Dmitri and Pranav's concerns on generality though. The risk is that a spinklock can look good on standalone microbenchmarks where the number of threads >= the number of CPUs, and then performance can collapse in a more complex setting if the CPU is over-subscribed -- if a lock-holder is preempted, then other threads spin holding on to CPUs. For that reason, mutex libraries tend to implement a spin-then-block approach so that the fast path of lock acquire/release can stay in user mode, and kernel work is only incurred after some spinning delay.

For Linux, we use the nsync which IIRC provides that kind of behavior.

For Windows, it sounds from Dmitri's comment that we may be able to configure the existing lock to perform better?

For the original 8-NUMA-node workload, do we know how other configurations behave? For instance, running 8 * 1-NUMA-node ORT instances, each pinned to a separate socket? Also, if we are running concurrent inference requests over a single ORT for the whole 8-NUMA machine, is the workload configured with a limit on a maximum degree of parallelism for each request? Very few operators scale cross-NUMA up to 128 threads... so I am worried we may be adding overheads on work distribution and the locks here without much benefit.

…untime into rashuai/SpinLock

snnn · 2023-05-17T21:52:51Z

For me this is a naive implementation of TAS or TTAS: https://disco.ethz.ch/courses/hs15/distsys/lecture/chapter11.pdf . In general, they are not any better than modern locking queues except they are easy to understand and implement. I think they are not good for general use cases. They only make senses in some special scenarios.

include/onnxruntime/core/platform/ort_mutex.h

tools/ci_build/build.py

Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model | mutex(ms) | mutex-free --- | --- | --- alvert_base_v2 | 34.21 | 34.09 bert_large_uncased | 116.27| 117.84 bart_base | 72.06 | 71.99 distilgpt2 | 25.43 | 25.02 vit_base_patch16_224 | 37.33 | 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model | mutex(ms) | mutex-free --- | --- | --- deepthink_v2 | 24.35 | 22.95 bing_feeds | 36.96 | 36.48 deep_writes | 14.46 | 14.32 keypoints | 9.34 | 7.69 model11 | 1.71 | 1.66 model12 | 1.82 | 1.44 model2 | 4.21 | 3.95 model6 | 1.08 | 1.05 agiencoder | 0.99 | 0.93 geminet_transformer | 5.32 | 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>

### Description Cherry-pick 4 commits to rel-1.15.0 branch #14834 #15727 #16010 #16011

### Description Cherry-pick 4 commits to rel-1.15.0 branch microsoft#14834 microsoft#15727 microsoft#16010 microsoft#16011

RandyShuai added 2 commits February 26, 2023 18:51

implement spinlock

69b486a

Merge branch 'main' of https://github.com/microsoft/onnxruntime

9ef0c86

RandySheriffH marked this pull request as ready for review March 16, 2023 22:05

RandySheriffH changed the title ~~Implement lock-free mutex for task queue for main~~ Implement mutex-free spin lock for task queue Mar 16, 2023

RandyShuai added 5 commits March 29, 2023 11:22

Merge branch 'main' into rashuai/SpinLock

5216358

Merge branch 'main' of https://github.com/microsoft/onnxruntime

27c1803

Merge branch 'rashuai/SpinLock' of https://github.com/microsoft/onnxr…

712cf05

…untime into rashuai/SpinLock

Merge branch 'main' of https://github.com/microsoft/onnxruntime

2a205b5

Merge branch 'main' into rashuai/SpinLock

8899d2e

RandySheriffH requested review from yuslepukhin and pranavsharma April 4, 2023 20:25

cancel relaxed

bfebe1c

pranavsharma reviewed Apr 17, 2023

View reviewed changes

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

pranavsharma requested a review from tlh20 April 17, 2023 23:03

fix comments

7003b76

pranavsharma reviewed Apr 18, 2023

View reviewed changes

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

RandyShuai added 5 commits April 17, 2023 21:48

using for typedef

29fe3a8

Merge branch 'main' of https://github.com/microsoft/onnxruntime

82d6e02

Merge branch 'main' into rashuai/SpinLock

ca28d72

Merge branch 'main' of https://github.com/microsoft/onnxruntime

8449c6f

Merge branch 'main' into rashuai/SpinLock

52ac89d

RandyShuai added 4 commits May 16, 2023 13:41

Merge branch 'main' into rashuai/SpinLock

ced667f

Merge branch 'main' into rashuai/SpinLock

55586da

make lock-free an options and add test to pipeline

5003785

Merge branch 'rashuai/SpinLock' of https://github.com/microsoft/onnxr…

725874c

…untime into rashuai/SpinLock

RandySheriffH requested a review from a team as a code owner May 17, 2023 21:38

RandySheriffH self-assigned this May 17, 2023

RandySheriffH added the release:1.15 label May 17, 2023

pranavsharma reviewed May 18, 2023

View reviewed changes

include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved

tools/ci_build/build.py Show resolved Hide resolved

move to separate file

4bb9219

snnn added the triage:approved Approved for cherrypicks for release label May 18, 2023

pranavsharma approved these changes May 18, 2023

View reviewed changes

jchen351 approved these changes May 19, 2023

View reviewed changes

RandySheriffH merged commit 4dfb89b into main May 19, 2023

RandySheriffH deleted the rashuai/SpinLock branch May 19, 2023 17:12

snnn added a commit that referenced this pull request May 20, 2023

Cherry-pick 4 commits to rel-1.15.0 branch (#16021)

ade58cf

### Description Cherry-pick 4 commits to rel-1.15.0 branch #14834 #15727 #16010 #16011

snnn removed triage:approved Approved for cherrypicks for release release:1.15 labels May 22, 2023

preetha-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 7, 2023

Cherry-pick 4 commits to rel-1.15.0 branch (microsoft#16021)

796567b

### Description Cherry-pick 4 commits to rel-1.15.0 branch microsoft#14834 microsoft#15727 microsoft#16010 microsoft#16011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement mutex-free spin lock for task queue #14834

Implement mutex-free spin lock for task queue #14834

RandySheriffH commented Feb 27, 2023 •

edited

Loading

yuslepukhin commented Apr 17, 2023 •

edited

Loading

tlh20 commented Apr 26, 2023

snnn commented May 17, 2023 •

edited

Loading

Implement mutex-free spin lock for task queue #14834

Implement mutex-free spin lock for task queue #14834

Conversation

RandySheriffH commented Feb 27, 2023 • edited Loading

yuslepukhin commented Apr 17, 2023 • edited Loading

tlh20 commented Apr 26, 2023

snnn commented May 17, 2023 • edited Loading

RandySheriffH commented Feb 27, 2023 •

edited

Loading

yuslepukhin commented Apr 17, 2023 •

edited

Loading

snnn commented May 17, 2023 •

edited

Loading