Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak and thread pools not closing #4093

Closed
pfeatherstone opened this issue May 30, 2020 · 33 comments
Closed

Memory leak and thread pools not closing #4093

pfeatherstone opened this issue May 30, 2020 · 33 comments

Comments

@pfeatherstone
Copy link

Describe the bug
I'm encapsulating ORT as much as possible in an object with the aim that whenever the object is deallocated, all resources are freed, including memory arenas and thread pools. I cannot achieve this using the API. I could live with a global thread pool lingering, but it seems there is a memory arena that never gets freed.
For example, if i create multiple objects sequentially like so:

{ onnx_model_impl net1(...);}
{ onnx_model_impl net2(...);}
{ onnx_model_impl net3(...);}

there is a linear increase in memory without it ever getting freed. So it would seem there is a memory leak somewhere.

The code i'm using is below.
Has it got something to do with The environment (Ort:Env) ?
How can global thread pools be swapped for local thread pools? Or even better, disabled entirely?
How can memory arenas be disabled? Note i'm not using OrtArenaAllocator

Ort::Env create_env()
{
//    OrtThreadingOptions* envOpts = 0;
//    assert(Ort::Global<void>::api_.CreateThreadingOptions(&envOpts) == 0);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
//    Ort::Global<void>::api_.ReleaseThreadingOptions(envOpts);
    return env;
}

Ort::Env& default_env()
{
    static Ort::Env INSTANCE = create_env();
    return INSTANCE;
}

struct onnx_model_impl
{
    onnx_model_impl(const void* modeldata, size_t modelsize, int cuda_device)
    {
        Ort::Env& env = default_env();
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
//        session_options.SetInterOpNumThreads(1); //already using openmp. Using more threads slows the whole thing down
//        session_options.SetIntraOpNumThreads(1); //already using openmp. Using more threads slows the whole thing down
        if (cuda_device >= 0)
        {
            session_options.SetInterOpNumThreads(1);
            session_options.SetIntraOpNumThreads(1);
            session_options.DisableCpuMemArena();
//            session_options.DisablePerSessionThreads();
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);
        }
        
        session.reset(new Ort::Session(env, modeldata, modelsize, session_options));

        size_t num_input_nodes = session->GetInputCount();
        
        for (size_t i = 0; i < num_input_nodes; i++) 
        {
            // print input node names
            char* input_name = session->GetInputName(i, allocator);
            input_names.push_back(input_name);

            // print input node types
            Ort::TypeInfo type_info = session->GetInputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            input_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("input  %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               input_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    
        size_t num_output_nodes = session->GetOutputCount();

        for (size_t i = 0; i < num_output_nodes; i++) 
        {
            // print input node names
            char* output_name = session->GetOutputName(i, allocator);
            output_names.push_back(output_name);

            // print input node types
            Ort::TypeInfo type_info = session->GetOutputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            output_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("output %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               output_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    }
       
    void infer(vector<vector<int64_t>>  inputshapes, 
               vector<const float*>     inputs,
               vector<vector<int64_t>>& outputshapes,
               vector<vector<float>>&   outputs)
    {
        assert(inputshapes.size() == inputs.size());
        assert(inputs.size() == input_names.size());
        
        vector<Ort::Value> input_tensors;
        for (size_t i = 0 ; i < inputs.size() ; i++)
        {
            //check shape
            auto& shape = inputshapes[i];
            auto& exp   = input_shapes[i];
            for (size_t d = 0 ; d < shape.size() ; d++)
                if (exp[d] > 0)
                    assert(exp[d] == shape[d]);
            //total size
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            //load tensor
            auto memory_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeDefault);
            input_tensors.push_back(Ort::Value::CreateTensor<float>(memory_info, 
                                                                    const_cast<float*>(inputs[i]), 
                                                                    size, 
                                                                    shape.data(), 
                                                                    shape.size()));
            assert(input_tensors.back().IsTensor());
        }

        auto output_tensors = session->Run(Ort::RunOptions{nullptr}, input_names.data(), &input_tensors[0], input_tensors.size(), output_names.data(), output_names.size());

        for (auto& output_tensor : output_tensors)
        {
            auto type_info = output_tensor.GetTypeInfo();
//            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
//            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            outputshapes.push_back(shape);
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            float* ptr = output_tensor.GetTensorMutableData<float>();
            outputs.emplace_back(ptr, ptr + size);
        }
    }
    
    Ort::SessionOptions                 session_options;
    Ort::AllocatorWithDefaultOptions    allocator;
    std::unique_ptr<Ort::Session>       session;
    std::vector<const char*>            input_names;
    std::vector<std::vector<int64_t>>   input_shapes;
    std::vector<const char*>            output_names;
    std::vector<std::vector<int64_t>>   output_shapes;
};

System information

@pfeatherstone
Copy link
Author

By the way, performance-wise, ORT is great. But in a multi-threaded, heavily orchestrated application using lots of models (2 to 5), i'm literally getting 50+ threads created. I have an 4core-8thread PC...

@pfeatherstone
Copy link
Author

Update:
If I use :

Ort::Env create_env()
{
    OrtThreadingOptions* envOpts = 0;
    assert(Ort::Global<void>::api_.CreateThreadingOptions(&envOpts) == 0);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
    Ort::Global<void>::api_.ReleaseThreadingOptions(envOpts);
    return env;
}

and

session_options.SetInterOpNumThreads(1);
            session_options.SetIntraOpNumThreads(1);
            session_options.DisableCpuMemArena();
            session_options.DisablePerSessionThreads();

It has the opposite effect. It creates many more threads.
The biggest issue is the lack of memory freeing. I've done my best to not use a memory arena, though evidence in htop would suggest that it is using one and it is never getting freed. Now I would expect a memory arena to hold onto memory while a session is active but not when it has been deallocated.
@snnn what do you make of all this? Do you experience similar issues with https://github.com/microsoft/onnxruntime/releases/download/v1.3.0/onnxruntime-linux-x64-gpu-1.3.0.tgz ?
BTW, ignore my comments about OpenMP in the code above. I don't think the official builds are built with openmp.

@pfeatherstone
Copy link
Author

It also looks like that if you use OrtSessionOptionsAppendExecutionProvider_CUDA to use a CUDA device, more threads are created. Furthermore, when the session is destroyed, there is still a thread pool alive with a lot of memory held. Again, this is looking at htop and nvidia-smi

@pfeatherstone
Copy link
Author

Has it got something to do with protobuf having its own memory arena ?

@pfeatherstone
Copy link
Author

Actually, i've just tried libdarknet. I get the same thing. Does cudnn do it's own thread pooling and memory arena business ?

@pfeatherstone
Copy link
Author

Or can cudafree leak ?

@pfeatherstone
Copy link
Author

https://stackoverflow.com/questions/61346562/cuda-unified-memory-leak indicates that memory leaks are possible if there is a "corrupt" CUDA installation

@snnn
Copy link
Member

snnn commented Jun 1, 2020

@snnn
Copy link
Member

snnn commented Jun 2, 2020

About thread pools closing:

  1. If openmp is enabled, yes, as you can see, it won't get closed. Because we didn't create the thread, openmp runtime did it.

  2. If openmp is disabled, then if the thread pool was created when creating the OrtEnv, then it will get destroyed when you destroy the OrtEnv. If the thread pool was created when creating the session, then it will get destroyed when you destroy the session.

@skottmckay
Copy link
Contributor

Note that there's currently no way to disable the arena the CUDA ExecutionProvider uses, so there will be some memory held by the session (as it owns the execution provider instance) for that (most on the CUDA device, but a small amount on CPU for transfers).

Unfortunately I don't know if there's currently a way to use the same ExecutionProvider instance across multiple sessions, so each is going to end up with a separate arena.

Ort::Env create_env()
{
OrtThreadingOptions* envOpts = 0;
assert(Ort::Global::api_.CreateThreadingOptions(&envOpts) == 0);
Ort::Env env(/envOpts,/ ORT_LOGGING_LEVEL_WARNING, "test");
Ort::Global::api_.ReleaseThreadingOptions(envOpts);
return env;
}

If you don't provide OrtThreadingOptions on the first call to create the environment it will not use a global threadpool. However I would have expected an error from setting DisablePerSessionThreads if that was the case, so I'm not sure if you had envOpts commented out when constructing Ort::Env. If it wasn't commented out and you used the same Env instance for all the sessions the number of threads created by ORT shouldn't increase with each new session.

@pranavsharma
Copy link
Contributor

pranavsharma commented Jun 2, 2020

BTW, ignore my comments about OpenMP in the code above. I don't think the official builds are built with openmp.

The official CPU builds are indeed built with openmp enabled.

@pfeatherstone
Copy link
Author

So i'm using:

    OrtThreadingOptions* envOpts = 0;
    assert(Ort::GetApi().CreateThreadingOptions(&envOpts) == nullptr);
    Ort::Env env(envOpts, ORT_LOGGING_LEVEL_WARNING, "test");
    Ort::GetApi().ReleaseThreadingOptions(envOpts);
    return env;

then

        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
        session_options.DisablePerSessionThreads();
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);

I'm running 5 models concurrently and i'm getting 25 threads, 1 of which is main thread and 5 of which are running the 5 models concurrently. I have 8 physical threads on my PC. So i would expect all 5 models to share a thread pool of 8 threads. No?

If i use :

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");

and

session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);

i get 11 threads when running 5 models concurrently.

So behavior isn't as expected.
By the way, i've reinstalled CUDA to avoid "corrupt" installations.

@pfeatherstone
Copy link
Author

i've tried building onnxruntime from source using both CUDA 10.1 and CUDA 10.2. I get the same behavior. So i've dumped 10.1 because i want to avoid having conflicting CUDA installations.

@pfeatherstone
Copy link
Author

There's also the issue with leftover threads holding loads of memory. I get similar behavior with libdarknet. So i'm wondering if CUDA library or CUDNN have internal memory pools that allocate workspaces or something like that and there's nothing you can do about it. However it would seem libdarknet is better at disposing memory when "sessions" are deallocated. I don't know enough about the internals of CUDA and CUDNN to say for sure. But this is all making me very nervous

@pfeatherstone
Copy link
Author

Oh and i'm using OrtDeviceAllocator not OrtArenaAllocator

@pfeatherstone
Copy link
Author

@snnn Can you see anything wrong with this code?

Ort::Env create_env()
{
//    OrtThreadingOptions* envOpts = 0;
//    assert(Ort::GetApi().CreateThreadingOptions(&envOpts) == nullptr);
    Ort::Env env(/*envOpts,*/ ORT_LOGGING_LEVEL_WARNING, "test");
//    Ort::GetApi().ReleaseThreadingOptions(envOpts);
    return env;
}

Ort::Env env = create_env();

struct onnx_model::onnx_model_impl
{
    onnx_model_impl(const void* modeldata, size_t modelsize, int cuda_device)
    :   session(nullptr)
    {
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
        session_options.SetInterOpNumThreads(1);
        session_options.SetIntraOpNumThreads(1);
//        session_options.DisableCpuMemArena();
//        session_options.DisablePerSessionThreads();
        if (cuda_device >= 0)
            assert(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, cuda_device) == 0);
        
        session = Ort::Session(env, modeldata, modelsize, session_options);

        size_t num_input_nodes = session.GetInputCount();
        
        for (size_t i = 0; i < num_input_nodes; i++) 
        {
            // print input node names
            char* input_name = session.GetInputName(i, allocator);
            input_names.push_back(input_name);

            // print input node types
            Ort::TypeInfo type_info = session.GetInputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            input_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("input  %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               input_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    
        size_t num_output_nodes = session.GetOutputCount();

        for (size_t i = 0; i < num_output_nodes; i++) 
        {
            // print input node names
            char* output_name = session.GetOutputName(i, allocator);
            output_names.push_back(output_name);

            // print input node types
            Ort::TypeInfo type_info = session.GetOutputTypeInfo(i);
            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            output_shapes.push_back(shape);
            string shape_str_all = "";
            for (auto s : shape)
                shape_str_all += std::to_string(s) + ",";
            printf("output %lu name %s obj type %i val type %i shape %s\n", i,
                                                                               output_name,
                                                                               (int)object_type,
                                                                               (int)value_type,
                                                                               shape_str_all.c_str());
        }
    }
       
    void infer(vector<vector<int64_t>>  inputshapes, 
               vector<const float*>     inputs,
               vector<vector<int64_t>>& outputshapes,
               vector<vector<float>>&   outputs)
    {
        assert(inputshapes.size() == inputs.size());
        assert(inputs.size() == input_names.size());
        
        vector<Ort::Value> input_tensors;
        for (size_t i = 0 ; i < inputs.size() ; i++)
        {
            //check shape
            auto& shape = inputshapes[i];
            auto& exp   = input_shapes[i];
            for (size_t d = 0 ; d < shape.size() ; d++)
                if (exp[d] > 0)
                    assert(exp[d] == shape[d]);
            //total size
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            //load tensor
            auto memory_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeDefault);
            input_tensors.push_back(Ort::Value::CreateTensor<float>(memory_info, 
                                                                    const_cast<float*>(inputs[i]), 
                                                                    size, 
                                                                    shape.data(), 
                                                                    shape.size()));
            assert(input_tensors.back().IsTensor());
        }

        auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_names.data(), &input_tensors[0], input_tensors.size(), output_names.data(), output_names.size());

        for (auto& output_tensor : output_tensors)
        {
            auto type_info = output_tensor.GetTypeInfo();
//            auto object_type        = type_info.GetONNXType();
            auto tensor_info        = type_info.GetTensorTypeAndShapeInfo();
//            auto value_type         = tensor_info.GetElementType();
            auto shape              = tensor_info.GetShape();
            outputshapes.push_back(shape);
            int64_t size = std::accumulate(shape.begin(), shape.end(), 1, [](int64_t p, int64_t s){return p*s;});
            float* ptr = output_tensor.GetTensorMutableData<float>();
            outputs.emplace_back(ptr, ptr + size);
        }
    }

    Ort::SessionOptions                 session_options;
    Ort::AllocatorWithDefaultOptions    allocator;
    Ort::Session                        session;
    std::vector<const char*>            input_names;
    std::vector<std::vector<int64_t>>   input_shapes;
    std::vector<const char*>            output_names;
    std::vector<std::vector<int64_t>>   output_shapes;
};

@pfeatherstone
Copy link
Author

if there is nothing wrong with the code above it must be either:

  • bad CUDA/CUDNN installation (which i doubt. I've reinstalled everything from scratch)
  • Misunderstanding of what CUDA and CUDNN do behind the scenes
  • Bug in onnxruntime

@skottmckay
Copy link
Contributor

Some clarifications:

When you create the shared global threadpool by passing OrtThreadingOptions to the OrtEnv constructor it will default to creating pool sizes based on the number of cores. e.g. if you have 4 cores you would get 4 threads in the intra-op threadpools, and 4 in the inter-op threadpool. Any session created after that will use that threadpool, so setting the intra/inter op thread counts in session options will do nothing.

if (use_per_session_threads_) {
LOGS(*session_logger_, INFO) << "Creating and using per session threadpools since use_per_session_threads_ is true";
{
OrtThreadPoolParams to = session_options_.intra_op_param;
if (to.name == nullptr) {
to.name = ORT_TSTR("intra-op");
}
// If the thread pool can use all the processors, then
// we set affinity of each thread to each processor.
to.auto_set_affinity = to.thread_pool_size == 0 &&
session_options_.execution_mode == ExecutionMode::ORT_SEQUENTIAL &&
to.affinity_vec_len == 0;
thread_pool_ =
concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP);
}
if (session_options_.execution_mode == ExecutionMode::ORT_PARALLEL) {
OrtThreadPoolParams to = session_options_.inter_op_param;
// If the thread pool can use all the processors, then
// we set thread affinity.
to.auto_set_affinity =
to.thread_pool_size == 0 && session_options_.execution_mode == ExecutionMode::ORT_SEQUENTIAL;
if (to.name == nullptr)
to.name = ORT_TSTR("intra-op");
inter_op_thread_pool_ =
concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTER_OP);
if (inter_op_thread_pool_ == nullptr) {
LOGS(*session_logger_, INFO) << "Failed to create the inter-op thread pool for the parallel executor, setting ExecutionMode to SEQUENTIAL";
session_options_.execution_mode = ExecutionMode::ORT_SEQUENTIAL;
}
}
} else {
LOGS(*session_logger_, INFO) << "Using global/env threadpools since use_per_session_threads_ is false";
intra_op_thread_pool_from_env_ = session_env.GetIntraOpThreadPool();
inter_op_thread_pool_from_env_ = session_env.GetInterOpThreadPool();
ORT_ENFORCE(session_env.EnvCreatedWithGlobalThreadPools(),
"When the session is not configured to use per session"
" threadpools, the env must be created with the the CreateEnvWithGlobalThreadPools API.");
}

When you are not using the shared threadpool, each session will have its own intra/inter op threadpool and the size will be based on SessionOptions.Set{Intra|Inter}OpNumThreads

I re-created most of your setup (Windows not linux though so possibly some differences) and see this after creating 4 session using the global threadpool:

image

One main thread. 4 threads for global intra and inter op threadpools, and 3 CUDA threads that were created when the first CUDA EP was first created but no extra threads after that. Ignore the 3 threads from ntdll as they're not relevant to your question.

The number of CUDA threads may differ on Linux as I don't know how they choose how many threads to create.

No growth in threads occurs when calling your infer(...) method.

Finally when I cleanup the sessions and free the environment the ORT threadpool threads go away. There are still the 3 CUDA threads remaining though as we don't control those. I don't know how much memory they're holding onto.

@pfeatherstone
Copy link
Author

Thank you for your response. I think i'm satisfied with the thread pooling. I think part of the confusion is i'm building onnxruntime from source and regardless of whether or not i use option "use_openmp", it builds with openMP. So after passing my program through GDB and tracing thread back traces, i realised that onnxruntime was using openmp. So indeed the SetInterOpNumThreads and SetIntraOpNumThreads functions weren't doing anything. Is there a way of disabling openmp when building from source? Maybe CMAKE auto-detects openmp and uses it regardless (on Linux. I appreciate you guys develop on windows) ?

The CUDA threads not being removed and holding onto memory worries me a bit. Though less so since i realised that other libraries like darknet have similar behavior. Maybe CUDA does some internal caching, thread pooling that is hidden to the user. Maybe that is precisely what functions like cudaMallocManaged do.

I'm don't know why i had a growth in threads when using a global thread pool. Maybe it's because i hadn't realised my build of onnxruntime was using openmp and the combination of the two don't play nicely. I'll try rebuilding it without openmp and see if the behaviour is as expected.

@pfeatherstone
Copy link
Author

So i rebuild using the following command:

./build.sh --use_cuda --cuda_version=10.2 --cuda_home /usr/local/cuda-10.2/ --cudnn_home /usr/local/cuda-10.2/ --build_shared_lib --config Release --parallel

The cmake log outputs this:

-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5") 

So i'm wondering if uses it anyway, even though i haven't specified the "use_openmp" option.

@pfeatherstone
Copy link
Author

ah, just realised the build script has this line:
python3 $DIR/tools/ci_build/build.py --use_openmp --build_dir $DIR/build/Linux "$@"
So openmp was always being used

@pfeatherstone
Copy link
Author

Maybe the '--use_openmp' should be removed from build.sh ?

@pfeatherstone
Copy link
Author

Ok that worked. So now it uses onnxruntime's own thread pool, and when per-session thread pool is enabled, it gets 'freed' when the session gets deallocted. Great.
Is there a way to detect if onnxruntime is built with openmp at runtime? I know the python API allows this, but not sure about C++ api?

@skottmckay
Copy link
Contributor

Looks like that has already been cleaned up. Apologies for the confusion that caused.
bd8993c

What should be done differently if it could report if it's using openmp at runtime?

FWIW you can run ldd on a binary to see if it depends on openmp libraries.

@pfeatherstone
Copy link
Author

yeah i thought about running ldd on libonnxruntime.so at runtime, but that seems like a temporary hack. Basically i need to know whether to set:

session_options.SetInterOpNumThreads(1);
session_options.SetIntraOpNumThreads(1);

or

session_options.SetInterOpNumThreads(std::thread::hardware_concurrency());
session_options.SetIntraOpNumThreads(std::thread::hardware_concurrency());

Which one depends on whether onnxruntime is built with openmp

@pfeatherstone
Copy link
Author

i'm temporarily using some extern bools:

extern const bool onnxruntime_is_built_with_openmp;
extern const bool onnxruntime_use_global_threadpool;

to configure the onnxruntime inference code. Then i just need to define them somewhere in my app otherwise it won't compile, so it forces me to think about it and set it appropriately in order for the app to compile.

@pfeatherstone
Copy link
Author

I guess somewhere in a CPP file in onnxruntime you could define an extern bool similarly to what i have done:

#ifdef _OPENMP
extern const bool onnxruntime_is_built_with_openmp = true;
#else 
extern const bool onnxruntime_is_built_with_openmp = false;
#endif

Voila, now the user can just declare the following:

extern const bool onnxruntime_is_built_with_openmp;

 and can use it. Or a better, more C++ way, would be to access it via the global api object. 

@pfeatherstone
Copy link
Author

Ok, i've rebuilt opencv without concurrency support. So no extra threads i'm not aware of. So i get expected behavior if global threads is enabled or not. I still have 3 CUDA threads created that never get freed. But you get the same behavior so not too worried about that.

@pfeatherstone
Copy link
Author

i guess this issue can be closed. The only remaining niggle is that i definitely saw thread growth when onnxruntime is built with openmp and trying to use a global thread pool. But i don't really care now because i'm not building it with openmp. So up to you if you want to close this.

@pfeatherstone
Copy link
Author

Just so you know, there is an impact on performance when using global thread pool on CPU. Looking at htop, cores aren't being used at 100%. Using per-session thread-pools, the cores are maxed out as expected.

@skottmckay
Copy link
Contributor

Just so you know, there is an impact on performance when using global thread pool on CPU. Looking at htop, cores aren't being used at 100%. Using per-session thread-pools, the cores are maxed out as expected.

Can you provide some more details? In particular, number of physical cores vs logical. Number of concurrent requests. Number of sessions.

Does using per-session thread pools result in more threads being available in total?

@pfeatherstone
Copy link
Author

So i'm not sure of the difference between physical and logical cores. I believe 8 either way. Here is the output of 'lscpu':

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
Stepping:            13
CPU MHz:             800.098
CPU max MHz:         4700.0000
CPU min MHz:         800.0000
BogoMIPS:            6000.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

For this test, there was only 1 session and therefore no concurrent requests.

@snnn
Copy link
Member

snnn commented Jun 4, 2020

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants