-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement mutex-free spin lock for task queue #14834
Conversation
…untime into rashuai/SpinLock
I have not looked into this in depth. However, spin locks that never go to sleep do not do well in all scenarios. Case in point our spinning in the threadpools. If you just want to increase the spin count there are official ways of doing so with CriticalSecion on windows and phtread_mutexes. The windows read/write lock delivers great performance and may have other options. Echoing Pranav's point about |
Thanks for the profiling and exploration here! Believe the spinlock implementation is correct. I share Dmitri and Pranav's concerns on generality though. The risk is that a spinklock can look good on standalone microbenchmarks where the number of threads >= the number of CPUs, and then performance can collapse in a more complex setting if the CPU is over-subscribed -- if a lock-holder is preempted, then other threads spin holding on to CPUs. For that reason, mutex libraries tend to implement a spin-then-block approach so that the fast path of lock acquire/release can stay in user mode, and kernel work is only incurred after some spinning delay. For Linux, we use the nsync which IIRC provides that kind of behavior. For Windows, it sounds from Dmitri's comment that we may be able to configure the existing lock to perform better? For the original 8-NUMA-node workload, do we know how other configurations behave? For instance, running 8 * 1-NUMA-node ORT instances, each pinned to a separate socket? Also, if we are running concurrent inference requests over a single ORT for the whole 8-NUMA machine, is the workload configured with a limit on a maximum degree of parallelism for each request? Very few operators scale cross-NUMA up to 128 threads... so I am worried we may be adding overheads on work distribution and the locks here without much benefit. |
For me this is a naive implementation of TAS or TTAS: https://disco.ethz.ch/courses/hs15/distsys/lecture/chapter11.pdf . In general, they are not any better than modern locking queues except they are easy to understand and implement. I think they are not good for general use cases. They only make senses in some special scenarios. |
Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model | mutex(ms) | mutex-free --- | --- | --- alvert_base_v2 | 34.21 | 34.09 bert_large_uncased | 116.27| 117.84 bart_base | 72.06 | 71.99 distilgpt2 | 25.43 | 25.02 vit_base_patch16_224 | 37.33 | 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model | mutex(ms) | mutex-free --- | --- | --- deepthink_v2 | 24.35 | 22.95 bing_feeds | 36.96 | 36.48 deep_writes | 14.46 | 14.32 keypoints | 9.34 | 7.69 model11 | 1.71 | 1.66 model12 | 1.82 | 1.44 model2 | 4.21 | 3.95 model6 | 1.08 | 1.05 agiencoder | 0.99 | 0.93 geminet_transformer | 5.32 | 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description Cherry-pick 4 commits to rel-1.15.0 branch microsoft#14834 microsoft#15727 microsoft#16010 microsoft#16011
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%.
For 32 cores, the curve is flat:
Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:
Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms: