-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak and thread pools not closing #4093
Comments
By the way, performance-wise, ORT is great. But in a multi-threaded, heavily orchestrated application using lots of models (2 to 5), i'm literally getting 50+ threads created. I have an 4core-8thread PC... |
Update:
and
It has the opposite effect. It creates many more threads. |
It also looks like that if you use |
Has it got something to do with protobuf having its own memory arena ? |
Actually, i've just tried libdarknet. I get the same thing. Does cudnn do it's own thread pooling and memory arena business ? |
Or can cudafree leak ? |
https://stackoverflow.com/questions/61346562/cuda-unified-memory-leak indicates that memory leaks are possible if there is a "corrupt" CUDA installation |
For the thread count concern, would you please try this? https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/global_thread_pools |
About thread pools closing:
|
Note that there's currently no way to disable the arena the CUDA ExecutionProvider uses, so there will be some memory held by the session (as it owns the execution provider instance) for that (most on the CUDA device, but a small amount on CPU for transfers). Unfortunately I don't know if there's currently a way to use the same ExecutionProvider instance across multiple sessions, so each is going to end up with a separate arena.
If you don't provide OrtThreadingOptions on the first call to create the environment it will not use a global threadpool. However I would have expected an error from setting DisablePerSessionThreads if that was the case, so I'm not sure if you had envOpts commented out when constructing Ort::Env. If it wasn't commented out and you used the same Env instance for all the sessions the number of threads created by ORT shouldn't increase with each new session. |
The official CPU builds are indeed built with openmp enabled. |
So i'm using:
then
I'm running 5 models concurrently and i'm getting 25 threads, 1 of which is main thread and 5 of which are running the 5 models concurrently. I have 8 physical threads on my PC. So i would expect all 5 models to share a thread pool of 8 threads. No? If i use :
and
i get 11 threads when running 5 models concurrently. So behavior isn't as expected. |
i've tried building onnxruntime from source using both CUDA 10.1 and CUDA 10.2. I get the same behavior. So i've dumped 10.1 because i want to avoid having conflicting CUDA installations. |
There's also the issue with leftover threads holding loads of memory. I get similar behavior with libdarknet. So i'm wondering if CUDA library or CUDNN have internal memory pools that allocate workspaces or something like that and there's nothing you can do about it. However it would seem libdarknet is better at disposing memory when "sessions" are deallocated. I don't know enough about the internals of CUDA and CUDNN to say for sure. But this is all making me very nervous |
Oh and i'm using |
@snnn Can you see anything wrong with this code?
|
if there is nothing wrong with the code above it must be either:
|
Some clarifications: When you create the shared global threadpool by passing OrtThreadingOptions to the OrtEnv constructor it will default to creating pool sizes based on the number of cores. e.g. if you have 4 cores you would get 4 threads in the intra-op threadpools, and 4 in the inter-op threadpool. Any session created after that will use that threadpool, so setting the intra/inter op thread counts in session options will do nothing. onnxruntime/onnxruntime/core/session/inference_session.cc Lines 173 to 210 in 905c535
When you are not using the shared threadpool, each session will have its own intra/inter op threadpool and the size will be based on SessionOptions.Set{Intra|Inter}OpNumThreads I re-created most of your setup (Windows not linux though so possibly some differences) and see this after creating 4 session using the global threadpool: One main thread. 4 threads for global intra and inter op threadpools, and 3 CUDA threads that were created when the first CUDA EP was first created but no extra threads after that. Ignore the 3 threads from ntdll as they're not relevant to your question. The number of CUDA threads may differ on Linux as I don't know how they choose how many threads to create. No growth in threads occurs when calling your infer(...) method. Finally when I cleanup the sessions and free the environment the ORT threadpool threads go away. There are still the 3 CUDA threads remaining though as we don't control those. I don't know how much memory they're holding onto. |
Thank you for your response. I think i'm satisfied with the thread pooling. I think part of the confusion is i'm building onnxruntime from source and regardless of whether or not i use option "use_openmp", it builds with openMP. So after passing my program through GDB and tracing thread back traces, i realised that onnxruntime was using openmp. So indeed the SetInterOpNumThreads and SetIntraOpNumThreads functions weren't doing anything. Is there a way of disabling openmp when building from source? Maybe CMAKE auto-detects openmp and uses it regardless (on Linux. I appreciate you guys develop on windows) ? The CUDA threads not being removed and holding onto memory worries me a bit. Though less so since i realised that other libraries like darknet have similar behavior. Maybe CUDA does some internal caching, thread pooling that is hidden to the user. Maybe that is precisely what functions like cudaMallocManaged do. I'm don't know why i had a growth in threads when using a global thread pool. Maybe it's because i hadn't realised my build of onnxruntime was using openmp and the combination of the two don't play nicely. I'll try rebuilding it without openmp and see if the behaviour is as expected. |
So i rebuild using the following command:
The cmake log outputs this:
So i'm wondering if uses it anyway, even though i haven't specified the "use_openmp" option. |
ah, just realised the build script has this line: |
Maybe the '--use_openmp' should be removed from build.sh ? |
Ok that worked. So now it uses onnxruntime's own thread pool, and when per-session thread pool is enabled, it gets 'freed' when the session gets deallocted. Great. |
Looks like that has already been cleaned up. Apologies for the confusion that caused. What should be done differently if it could report if it's using openmp at runtime? FWIW you can run ldd on a binary to see if it depends on openmp libraries. |
yeah i thought about running ldd on libonnxruntime.so at runtime, but that seems like a temporary hack. Basically i need to know whether to set:
or
Which one depends on whether onnxruntime is built with openmp |
i'm temporarily using some extern bools:
to configure the onnxruntime inference code. Then i just need to define them somewhere in my app otherwise it won't compile, so it forces me to think about it and set it appropriately in order for the app to compile. |
I guess somewhere in a CPP file in onnxruntime you could define an extern bool similarly to what i have done:
extern const bool onnxruntime_is_built_with_openmp;
|
Ok, i've rebuilt opencv without concurrency support. So no extra threads i'm not aware of. So i get expected behavior if global threads is enabled or not. I still have 3 CUDA threads created that never get freed. But you get the same behavior so not too worried about that. |
i guess this issue can be closed. The only remaining niggle is that i definitely saw thread growth when onnxruntime is built with openmp and trying to use a global thread pool. But i don't really care now because i'm not building it with openmp. So up to you if you want to close this. |
Just so you know, there is an impact on performance when using global thread pool on CPU. Looking at htop, cores aren't being used at 100%. Using per-session thread-pools, the cores are maxed out as expected. |
Can you provide some more details? In particular, number of physical cores vs logical. Number of concurrent requests. Number of sessions. Does using per-session thread pools result in more threads being available in total? |
So i'm not sure of the difference between physical and logical cores. I believe 8 either way. Here is the output of 'lscpu':
For this test, there was only 1 session and therefore no concurrent requests. |
Thank you. |
Describe the bug
I'm encapsulating ORT as much as possible in an object with the aim that whenever the object is deallocated, all resources are freed, including memory arenas and thread pools. I cannot achieve this using the API. I could live with a global thread pool lingering, but it seems there is a memory arena that never gets freed.
For example, if i create multiple objects sequentially like so:
there is a linear increase in memory without it ever getting freed. So it would seem there is a memory leak somewhere.
The code i'm using is below.
Has it got something to do with The environment (Ort:Env) ?
How can global thread pools be swapped for local thread pools? Or even better, disabled entirely?
How can memory arenas be disabled? Note i'm not using
OrtArenaAllocator
System information
The text was updated successfully, but these errors were encountered: