Memory leak and thread pools not closing #4093

pfeatherstone · 2020-05-30T17:28:37Z

Describe the bug
I'm encapsulating ORT as much as possible in an object with the aim that whenever the object is deallocated, all resources are freed, including memory arenas and thread pools. I cannot achieve this using the API. I could live with a global thread pool lingering, but it seems there is a memory arena that never gets freed.
For example, if i create multiple objects sequentially like so:

{ onnx_model_impl net1(...);}
{ onnx_model_impl net2(...);}
{ onnx_model_impl net3(...);}

there is a linear increase in memory without it ever getting freed. So it would seem there is a memory leak somewhere.

The code i'm using is below.
Has it got something to do with The environment (Ort:Env) ?
How can global thread pools be swapped for local thread pools? Or even better, disabled entirely?
How can memory arenas be disabled? Note i'm not using OrtArenaAllocator

Ort::Env create_env()
{
//    OrtThreadingOptions* envOpts = 0;
//    assert(Ort::Global<void>::api_.CreateThreadingOptions(&envOpts) == 0);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
//    Ort::Global<void>::api_.ReleaseThreadingOptions(envOpts);
    return env;
}

Ort::Env& default_env()
{
    static Ort::Env INSTANCE = create_env();
    return INSTANCE;
}

struct onnx_model_impl
{
    onnx_model_impl(const void* modeldata, size_t modelsize, int cuda_device)
    {
        Ort::Env& env = default_env();
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
//        session_options.SetInterOpNumThreads(1); //already using openmp. Using more threads slows the whole thing down
//        session_options.SetIntraOpNumThreads(1); //already using openmp. Using more threads slows the whole thing down
        if (cuda_device >= 0)
        {
            session_options.SetInterOpNumThreads(1);
            session_options.SetIntraOpNumThreads(1);
            session_options.DisableCpuMemArena();
//            session_options.DisablePerSessionThreads();
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);
        }
        
        session.reset(new Ort::Session(env, modeldata, modelsize, session_options));

        size_t num_input_nodes = session->GetInputCount();
        
        for (size_t i = 0; i < num_input_nodes; i++) 
        {
            // print input node names
            char* input_name = session->GetInputName(i, allocator);
            input_names.push_back(input_name);

            // print input node types
            Ort::TypeInfo type_info = session->GetInputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            input_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("input  %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               input_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    
        size_t num_output_nodes = session->GetOutputCount();

        for (size_t i = 0; i < num_output_nodes; i++) 
        {
            // print input node names
            char* output_name = session->GetOutputName(i, allocator);
            output_names.push_back(output_name);

            // print input node types
            Ort::TypeInfo type_info = session->GetOutputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            output_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("output %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               output_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    }
       
    void infer(vector<vector<int64_t>>  inputshapes, 
               vector<const float*>     inputs,
               vector<vector<int64_t>>& outputshapes,
               vector<vector<float>>&   outputs)
    {
        assert(inputshapes.size() == inputs.size());
        assert(inputs.size() == input_names.size());
        
        vector<Ort::Value> input_tensors;
        for (size_t i = 0 ; i < inputs.size() ; i++)
        {
            //check shape
            auto& shape = inputshapes[i];
            auto& exp   = input_shapes[i];
            for (size_t d = 0 ; d < shape.size() ; d++)
                if (exp[d] > 0)
                    assert(exp[d] == shape[d]);
            //total size
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            //load tensor
            auto memory_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeDefault);
            input_tensors.push_back(Ort::Value::CreateTensor<float>(memory_info, 
                                                                    const_cast<float*>(inputs[i]), 
                                                                    size, 
                                                                    shape.data(), 
                                                                    shape.size()));
            assert(input_tensors.back().IsTensor());
        }

        auto output_tensors = session->Run(Ort::RunOptions{nullptr}, input_names.data(), &input_tensors[0], input_tensors.size(), output_names.data(), output_names.size());

        for (auto& output_tensor : output_tensors)
        {
            auto type_info = output_tensor.GetTypeInfo();
//            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
//            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            outputshapes.push_back(shape);
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            float* ptr = output_tensor.GetTensorMutableData<float>();
            outputs.emplace_back(ptr, ptr + size);
        }
    }
    
    Ort::SessionOptions                 session_options;
    Ort::AllocatorWithDefaultOptions    allocator;
    std::unique_ptr<Ort::Session>       session;
    std::vector<const char*>            input_names;
    std::vector<std::vector<int64_t>>   input_shapes;
    std::vector<const char*>            output_names;
    std::vector<std::vector<int64_t>>   output_shapes;
};

System information

Linux Ubuntu 18.04
ONNX Runtime installed : latest GPU release): https://github.com/microsoft/onnxruntime/releases/download/v1.3.0/onnxruntime-linux-x64-gpu-1.3.0.tgz

The text was updated successfully, but these errors were encountered:

pfeatherstone · 2020-05-30T17:30:03Z

By the way, performance-wise, ORT is great. But in a multi-threaded, heavily orchestrated application using lots of models (2 to 5), i'm literally getting 50+ threads created. I have an 4core-8thread PC...

pfeatherstone · 2020-06-01T07:21:07Z

Update:
If I use :

Ort::Env create_env()
{
    OrtThreadingOptions* envOpts = 0;
    assert(Ort::Global<void>::api_.CreateThreadingOptions(&envOpts) == 0);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
    Ort::Global<void>::api_.ReleaseThreadingOptions(envOpts);
    return env;
}

and

session_options.SetInterOpNumThreads(1);
            session_options.SetIntraOpNumThreads(1);
            session_options.DisableCpuMemArena();
            session_options.DisablePerSessionThreads();

It has the opposite effect. It creates many more threads.
The biggest issue is the lack of memory freeing. I've done my best to not use a memory arena, though evidence in htop would suggest that it is using one and it is never getting freed. Now I would expect a memory arena to hold onto memory while a session is active but not when it has been deallocated.
@snnn what do you make of all this? Do you experience similar issues with https://github.com/microsoft/onnxruntime/releases/download/v1.3.0/onnxruntime-linux-x64-gpu-1.3.0.tgz ?
BTW, ignore my comments about OpenMP in the code above. I don't think the official builds are built with openmp.

pfeatherstone · 2020-06-01T07:29:22Z

It also looks like that if you use OrtSessionOptionsAppendExecutionProvider_CUDA to use a CUDA device, more threads are created. Furthermore, when the session is destroyed, there is still a thread pool alive with a lot of memory held. Again, this is looking at htop and nvidia-smi

pfeatherstone · 2020-06-01T08:08:26Z

Has it got something to do with protobuf having its own memory arena ?

pfeatherstone · 2020-06-01T17:48:19Z

Actually, i've just tried libdarknet. I get the same thing. Does cudnn do it's own thread pooling and memory arena business ?

pfeatherstone · 2020-06-01T18:56:34Z

Or can cudafree leak ?

pfeatherstone · 2020-06-01T19:22:05Z

https://stackoverflow.com/questions/61346562/cuda-unified-memory-leak indicates that memory leaks are possible if there is a "corrupt" CUDA installation

snnn · 2020-06-01T22:46:31Z

For the thread count concern, would you please try this?

https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/global_thread_pools

https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/global_thread_pools/test_main.cc#L53

snnn · 2020-06-02T04:23:56Z

About thread pools closing:

If openmp is enabled, yes, as you can see, it won't get closed. Because we didn't create the thread, openmp runtime did it.
If openmp is disabled, then if the thread pool was created when creating the OrtEnv, then it will get destroyed when you destroy the OrtEnv. If the thread pool was created when creating the session, then it will get destroyed when you destroy the session.

skottmckay · 2020-06-02T05:25:47Z

Note that there's currently no way to disable the arena the CUDA ExecutionProvider uses, so there will be some memory held by the session (as it owns the execution provider instance) for that (most on the CUDA device, but a small amount on CPU for transfers).

Unfortunately I don't know if there's currently a way to use the same ExecutionProvider instance across multiple sessions, so each is going to end up with a separate arena.

Ort::Env create_env()
{
OrtThreadingOptions* envOpts = 0;
assert(Ort::Global::api_.CreateThreadingOptions(&envOpts) == 0);
Ort::Env env(/envOpts,/ ORT_LOGGING_LEVEL_WARNING, "test");
Ort::Global::api_.ReleaseThreadingOptions(envOpts);
return env;
}

If you don't provide OrtThreadingOptions on the first call to create the environment it will not use a global threadpool. However I would have expected an error from setting DisablePerSessionThreads if that was the case, so I'm not sure if you had envOpts commented out when constructing Ort::Env. If it wasn't commented out and you used the same Env instance for all the sessions the number of threads created by ORT shouldn't increase with each new session.

pranavsharma · 2020-06-02T05:54:46Z

BTW, ignore my comments about OpenMP in the code above. I don't think the official builds are built with openmp.

The official CPU builds are indeed built with openmp enabled.

pfeatherstone · 2020-06-02T08:19:26Z

So i'm using:

    OrtThreadingOptions* envOpts = 0;
    assert(Ort::GetApi().CreateThreadingOptions(&envOpts) == nullptr);
    Ort::Env env(envOpts, ORT_LOGGING_LEVEL_WARNING, "test");
    Ort::GetApi().ReleaseThreadingOptions(envOpts);
    return env;

then

        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
        session_options.DisablePerSessionThreads();
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);

I'm running 5 models concurrently and i'm getting 25 threads, 1 of which is main thread and 5 of which are running the 5 models concurrently. I have 8 physical threads on my PC. So i would expect all 5 models to share a thread pool of 8 threads. No?

If i use :

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");

and

session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);

i get 11 threads when running 5 models concurrently.

So behavior isn't as expected.
By the way, i've reinstalled CUDA to avoid "corrupt" installations.

pfeatherstone · 2020-06-02T08:24:27Z

i've tried building onnxruntime from source using both CUDA 10.1 and CUDA 10.2. I get the same behavior. So i've dumped 10.1 because i want to avoid having conflicting CUDA installations.

pfeatherstone · 2020-06-02T08:26:33Z

There's also the issue with leftover threads holding loads of memory. I get similar behavior with libdarknet. So i'm wondering if CUDA library or CUDNN have internal memory pools that allocate workspaces or something like that and there's nothing you can do about it. However it would seem libdarknet is better at disposing memory when "sessions" are deallocated. I don't know enough about the internals of CUDA and CUDNN to say for sure. But this is all making me very nervous

pfeatherstone · 2020-06-02T08:29:51Z

Oh and i'm using OrtDeviceAllocator not OrtArenaAllocator

pfeatherstone · 2020-06-02T08:35:43Z

@snnn Can you see anything wrong with this code?

Ort::Env create_env()
{
//    OrtThreadingOptions* envOpts = 0;
//    assert(Ort::GetApi().CreateThreadingOptions(&envOpts) == nullptr);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
//    Ort::GetApi().ReleaseThreadingOptions(envOpts);
    return env;
}

Ort::Env env = create_env();

struct onnx_model::onnx_model_impl
{
    onnx_model_impl(const void* modeldata, size_t modelsize, int cuda_device)
    :   session(nullptr)
    {
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
//        session_options.DisableCpuMemArena();
//        session_options.DisablePerSessionThreads();
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);
        
        session = Ort::Session(env, modeldata, modelsize, session_options);

        size_t num_input_nodes = session.GetInputCount();
        
        for (size_t i = 0; i < num_input_nodes; i++) 
        {
            // print input node names
            char* input_name = session.GetInputName(i, allocator);
            input_names.push_back(input_name);

            // print input node types
            Ort::TypeInfo type_info = session.GetInputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            input_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("input  %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               input_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    
        size_t num_output_nodes = session.GetOutputCount();

        for (size_t i = 0; i < num_output_nodes; i++) 
        {
            // print input node names
            char* output_name = session.GetOutputName(i, allocator);
            output_names.push_back(output_name);

            // print input node types
            Ort::TypeInfo type_info = session.GetOutputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            output_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("output %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               output_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    }
       
    void infer(vector<vector<int64_t>>  inputshapes, 
               vector<const float*>     inputs,
               vector<vector<int64_t>>& outputshapes,
               vector<vector<float>>&   outputs)
    {
        assert(inputshapes.size() == inputs.size());
        assert(inputs.size() == input_names.size());
        
        vector<Ort::Value> input_tensors;
        for (size_t i = 0 ; i < inputs.size() ; i++)
        {
            //check shape
            auto& shape = inputshapes[i];
            auto& exp   = input_shapes[i];
            for (size_t d = 0 ; d < shape.size() ; d++)
                if (exp[d] > 0)
                    assert(exp[d] == shape[d]);
            //total size
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            //load tensor
            auto memory_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeDefault);
            input_tensors.push_back(Ort::Value::CreateTensor<float>(memory_info, 
                                                                    const_cast<float*>(inputs[i]), 
                                                                    size, 
                                                                    shape.data(), 
                                                                    shape.size()));
            assert(input_tensors.back().IsTensor());
        }

        auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_names.data(), &input_tensors[0], input_tensors.size(), output_names.data(), output_names.size());

        for (auto& output_tensor : output_tensors)
        {
            auto type_info = output_tensor.GetTypeInfo();
//            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
//            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            outputshapes.push_back(shape);
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            float* ptr = output_tensor.GetTensorMutableData<float>();
            outputs.emplace_back(ptr, ptr + size);
        }
    }

    Ort::SessionOptions                 session_options;
    Ort::AllocatorWithDefaultOptions    allocator;
    Ort::Session                        session;
    std::vector<const char*>            input_names;
    std::vector<std::vector<int64_t>>   input_shapes;
    std::vector<const char*>            output_names;
    std::vector<std::vector<int64_t>>   output_shapes;
};

pfeatherstone · 2020-06-02T08:54:02Z

if there is nothing wrong with the code above it must be either:

bad CUDA/CUDNN installation (which i doubt. I've reinstalled everything from scratch)
Misunderstanding of what CUDA and CUDNN do behind the scenes
Bug in onnxruntime

skottmckay · 2020-06-03T07:24:41Z

Some clarifications:

When you create the shared global threadpool by passing OrtThreadingOptions to the OrtEnv constructor it will default to creating pool sizes based on the number of cores. e.g. if you have 4 cores you would get 4 threads in the intra-op threadpools, and 4 in the inter-op threadpool. Any session created after that will use that threadpool, so setting the intra/inter op thread counts in session options will do nothing.

onnxruntime/onnxruntime/core/session/inference_session.cc

Lines 173 to 210 in 905c535

    
           if (use_per_session_threads_) { 
        
             LOGS(*session_logger_, INFO) << "Creating and using per session threadpools since use_per_session_threads_ is true"; 
        
             { 
        
               OrtThreadPoolParams to = session_options_.intra_op_param; 
        
               if (to.name == nullptr) { 
        
                 to.name = ORT_TSTR("intra-op"); 
        
               } 
        
               // If the thread pool can use all the processors, then 
        
               // we set affinity of each thread to each processor. 
        
               to.auto_set_affinity = to.thread_pool_size == 0 && 
        
                                      session_options_.execution_mode == ExecutionMode::ORT_SEQUENTIAL && 
        
                                      to.affinity_vec_len == 0; 
        
               thread_pool_ = 
        
                   concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP); 
        
             } 
        
             if (session_options_.execution_mode == ExecutionMode::ORT_PARALLEL) { 
        
               OrtThreadPoolParams to = session_options_.inter_op_param; 
        
               // If the thread pool can use all the processors, then 
        
               // we set thread affinity. 
        
               to.auto_set_affinity = 
        
                   to.thread_pool_size == 0 && session_options_.execution_mode == ExecutionMode::ORT_SEQUENTIAL; 
        
               if (to.name == nullptr) 
        
                 to.name = ORT_TSTR("intra-op"); 
        
               inter_op_thread_pool_ = 
        
                   concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTER_OP); 
        
               if (inter_op_thread_pool_ == nullptr) { 
        
                 LOGS(*session_logger_, INFO) << "Failed to create the inter-op thread pool for the parallel executor, setting ExecutionMode to SEQUENTIAL"; 
        
                 session_options_.execution_mode = ExecutionMode::ORT_SEQUENTIAL; 
        
               } 
        
             } 
        
           } else { 
        
             LOGS(*session_logger_, INFO) << "Using global/env threadpools since use_per_session_threads_ is false"; 
        
             intra_op_thread_pool_from_env_ = session_env.GetIntraOpThreadPool(); 
        
             inter_op_thread_pool_from_env_ = session_env.GetInterOpThreadPool(); 
        
             ORT_ENFORCE(session_env.EnvCreatedWithGlobalThreadPools(), 
        
                         "When the session is not configured to use per session" 
        
                         " threadpools, the env must be created with the the CreateEnvWithGlobalThreadPools API."); 
        
           }

When you are not using the shared threadpool, each session will have its own intra/inter op threadpool and the size will be based on SessionOptions.Set{Intra|Inter}OpNumThreads

I re-created most of your setup (Windows not linux though so possibly some differences) and see this after creating 4 session using the global threadpool:

One main thread. 4 threads for global intra and inter op threadpools, and 3 CUDA threads that were created when the first CUDA EP was first created but no extra threads after that. Ignore the 3 threads from ntdll as they're not relevant to your question.

The number of CUDA threads may differ on Linux as I don't know how they choose how many threads to create.

No growth in threads occurs when calling your infer(...) method.

Finally when I cleanup the sessions and free the environment the ORT threadpool threads go away. There are still the 3 CUDA threads remaining though as we don't control those. I don't know how much memory they're holding onto.

pfeatherstone · 2020-06-03T08:13:09Z

Thank you for your response. I think i'm satisfied with the thread pooling. I think part of the confusion is i'm building onnxruntime from source and regardless of whether or not i use option "use_openmp", it builds with openMP. So after passing my program through GDB and tracing thread back traces, i realised that onnxruntime was using openmp. So indeed the SetInterOpNumThreads and SetIntraOpNumThreads functions weren't doing anything. Is there a way of disabling openmp when building from source? Maybe CMAKE auto-detects openmp and uses it regardless (on Linux. I appreciate you guys develop on windows) ?

The CUDA threads not being removed and holding onto memory worries me a bit. Though less so since i realised that other libraries like darknet have similar behavior. Maybe CUDA does some internal caching, thread pooling that is hidden to the user. Maybe that is precisely what functions like cudaMallocManaged do.

I'm don't know why i had a growth in threads when using a global thread pool. Maybe it's because i hadn't realised my build of onnxruntime was using openmp and the combination of the two don't play nicely. I'll try rebuilding it without openmp and see if the behaviour is as expected.

pfeatherstone · 2020-06-03T08:16:59Z

So i rebuild using the following command:

./build.sh --use_cuda --cuda_version=10.2 --cuda_home /usr/local/cuda-10.2/ --cudnn_home /usr/local/cuda-10.2/ --build_shared_lib --config Release --parallel

The cmake log outputs this:

-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")

So i'm wondering if uses it anyway, even though i haven't specified the "use_openmp" option.

pfeatherstone · 2020-06-03T08:35:19Z

ah, just realised the build script has this line:
python3 $DIR/tools/ci_build/build.py --use_openmp --build_dir $DIR/build/Linux "$@"
So openmp was always being used

pfeatherstone · 2020-06-03T08:35:53Z

Maybe the '--use_openmp' should be removed from build.sh ?

pfeatherstone · 2020-06-03T09:02:37Z

Ok that worked. So now it uses onnxruntime's own thread pool, and when per-session thread pool is enabled, it gets 'freed' when the session gets deallocted. Great.
Is there a way to detect if onnxruntime is built with openmp at runtime? I know the python API allows this, but not sure about C++ api?

skottmckay · 2020-06-03T09:19:43Z

Looks like that has already been cleaned up. Apologies for the confusion that caused.
bd8993c

What should be done differently if it could report if it's using openmp at runtime?

FWIW you can run ldd on a binary to see if it depends on openmp libraries.

pfeatherstone · 2020-06-03T09:22:08Z

yeah i thought about running ldd on libonnxruntime.so at runtime, but that seems like a temporary hack. Basically i need to know whether to set:

session_options.SetInterOpNumThreads(1);
session_options.SetIntraOpNumThreads(1);

or

session_options.SetInterOpNumThreads(std::thread::hardware_concurrency());
session_options.SetIntraOpNumThreads(std::thread::hardware_concurrency());

Which one depends on whether onnxruntime is built with openmp

pfeatherstone · 2020-06-03T09:24:22Z

i'm temporarily using some extern bools:

extern const bool onnxruntime_is_built_with_openmp;
extern const bool onnxruntime_use_global_threadpool;

to configure the onnxruntime inference code. Then i just need to define them somewhere in my app otherwise it won't compile, so it forces me to think about it and set it appropriately in order for the app to compile.

pfeatherstone · 2020-06-03T09:36:36Z

I guess somewhere in a CPP file in onnxruntime you could define an extern bool similarly to what i have done:

#ifdef _OPENMP
extern const bool onnxruntime_is_built_with_openmp = true;
#else 
extern const bool onnxruntime_is_built_with_openmp = false;
#endif

Voila, now the user can just declare the following:

extern const bool onnxruntime_is_built_with_openmp;

 and can use it. Or a better, more C++ way, would be to access it via the global api object.

pfeatherstone · 2020-06-03T09:43:31Z

Ok, i've rebuilt opencv without concurrency support. So no extra threads i'm not aware of. So i get expected behavior if global threads is enabled or not. I still have 3 CUDA threads created that never get freed. But you get the same behavior so not too worried about that.

pfeatherstone · 2020-06-03T10:01:20Z

i guess this issue can be closed. The only remaining niggle is that i definitely saw thread growth when onnxruntime is built with openmp and trying to use a global thread pool. But i don't really care now because i'm not building it with openmp. So up to you if you want to close this.

pfeatherstone · 2020-06-03T14:03:28Z

Just so you know, there is an impact on performance when using global thread pool on CPU. Looking at htop, cores aren't being used at 100%. Using per-session thread-pools, the cores are maxed out as expected.

skottmckay · 2020-06-04T08:05:12Z

Just so you know, there is an impact on performance when using global thread pool on CPU. Looking at htop, cores aren't being used at 100%. Using per-session thread-pools, the cores are maxed out as expected.

Can you provide some more details? In particular, number of physical cores vs logical. Number of concurrent requests. Number of sessions.

Does using per-session thread pools result in more threads being available in total?

pfeatherstone · 2020-06-04T08:48:36Z

So i'm not sure of the difference between physical and logical cores. I believe 8 either way. Here is the output of 'lscpu':

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
Stepping:            13
CPU MHz:             800.098
CPU max MHz:         4700.0000
CPU min MHz:         800.0000
BogoMIPS:            6000.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

For this test, there was only 1 session and therefore no concurrent requests.

snnn · 2020-06-04T17:04:16Z

Thank you.

snnn closed this as completed Jun 4, 2020

jpsalada mentioned this issue Oct 20, 2021

Inference time gets slower over time (gpu + multithread) #9442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak and thread pools not closing #4093

Memory leak and thread pools not closing #4093

pfeatherstone commented May 30, 2020

pfeatherstone commented May 30, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

snnn commented Jun 1, 2020 •

edited

Loading

snnn commented Jun 2, 2020

skottmckay commented Jun 2, 2020

pranavsharma commented Jun 2, 2020 •

edited

Loading

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

skottmckay commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

skottmckay commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

skottmckay commented Jun 4, 2020

pfeatherstone commented Jun 4, 2020

snnn commented Jun 4, 2020

Memory leak and thread pools not closing #4093

Memory leak and thread pools not closing #4093

Comments

pfeatherstone commented May 30, 2020

pfeatherstone commented May 30, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

pfeatherstone commented Jun 1, 2020

snnn commented Jun 1, 2020 • edited Loading

snnn commented Jun 2, 2020

skottmckay commented Jun 2, 2020

pranavsharma commented Jun 2, 2020 • edited Loading

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

pfeatherstone commented Jun 2, 2020

skottmckay commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

skottmckay commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

pfeatherstone commented Jun 3, 2020

skottmckay commented Jun 4, 2020

pfeatherstone commented Jun 4, 2020

snnn commented Jun 4, 2020

snnn commented Jun 1, 2020 •

edited

Loading

pranavsharma commented Jun 2, 2020 •

edited

Loading