Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT EP] Segmentation fault when concurrently loading model using TensorRT EP #20089

Closed
tanmayv25 opened this issue Mar 26, 2024 · 3 comments
Assignees
Labels
ep:TensorRT issues related to TensorRT execution provider

Comments

@tanmayv25
Copy link

Describe the issue

There seems to be a regression in ONNXRUNTIME library within ORT backend for Triton Inference Server when using TensorRT execution provider.

We started observing a segmentation fault coming from some memory corruption when trying to load multiple session of the model concurrently. The failing test is specifically: L0_onnx_optimization.

I have also written a small reproducer that uses C API to load the models similar to how the models are loaded in Triton's ONNX runtime backend.

ort_trt_test.cc

#include <assert.h>
#include <onnxruntime_c_api.h>

#include <iostream>
#include <vector>
#include <memory>
#include <string>
#include <thread>
#include <mutex>

const OrtApi* ort_api;

#define THROW_ON_ERROR(S)                                                    \
  do {                                                                       \
    OrtStatus* status__ = (S);                                               \
    if (status__ != nullptr) {                                               \
      OrtErrorCode code = ort_api->GetErrorCode(status__);                   \
      std::string msg = std::string(ort_api->GetErrorMessage(status__));     \
      ort_api->ReleaseStatus(status__);                                      \
      throw std::invalid_argument((std::string("onnx runtime error ") +      \
                                        std::to_string(code) + ": " + msg)   \
                                           .c_str());                       \
    }                                                                        \
  } while (false)


void run_ort_trt(int thread_count, bool is_serial) {
  const OrtApi* ort_api = OrtGetApiBase()->GetApi(ORT_API_VERSION);

  std::mutex serialized_mutex;

  OrtEnv* env;
  THROW_ON_ERROR(ort_api->CreateEnv(ORT_LOGGING_LEVEL_VERBOSE, "log", &env));

  OrtSessionOptions* session_options;
  THROW_ON_ERROR(ort_api->CreateSessionOptions(&session_options));

  const char* model_path = "model.onnx";

  std::vector<std::thread> threads;
  for (int i = 0; i < thread_count; ++i)
    {
       // create new thread using a Lambda
        threads.emplace_back([&]() {
          if (is_serial) {
            serialized_mutex.lock();
          }
          // Make a clone for the session options for this instance...
          OrtSessionOptions* soptions;
          THROW_ON_ERROR(
            ort_api->CloneSessionOptions(session_options, &soptions));

          OrtTensorRTProviderOptionsV2* tensorrt_options;
          THROW_ON_ERROR(ort_api->CreateTensorRTProviderOptions(&tensorrt_options));
          std::unique_ptr<OrtTensorRTProviderOptionsV2, decltype(ort_api->ReleaseTensorRTProviderOptions)> rel_trt_options(
          tensorrt_options, ort_api->ReleaseTensorRTProviderOptions);
          std::string int8_calibration_table_name;
          std::string trt_engine_cache_path;
          std::vector<std::string> param_keys, keys, values;
          //keys.push_back("trt_engine_cache_enable");
          //values.push_back("1");

          //keys.push_back("trt_engine_cache_path");
          //values.push_back("/opt/tritonserver/qa/L0_onnx_optimization/trt_cache");
  
          std::vector<const char*> c_keys, c_values;
          if (!keys.empty() && !values.empty()) {
              for (size_t i = 0; i < keys.size(); ++i) {
                c_keys.push_back(keys[i].c_str());
                c_values.push_back(values[i].c_str());
              }
          THROW_ON_ERROR(ort_api->UpdateTensorRTProviderOptions(
            rel_trt_options.get(), c_keys.data(), c_values.data(),
            keys.size()));
          }

          THROW_ON_ERROR(ort_api->SessionOptionsAppendExecutionProvider_TensorRT_V2(static_cast<OrtSessionOptions*>(soptions),
                                                        rel_trt_options.get()));

          std::cout << "Running ORT TRT EP with default provider options" << std::endl;

          OrtSession* session;

          THROW_ON_ERROR(ort_api->CreateSession(
            env, model_path, soptions,
            &session));

             if (is_serial) {
            serialized_mutex.unlock();
          }

            ort_api->ReleaseSession(session);
            ort_api->ReleaseSessionOptions(soptions);
        });
    }

    for (auto& thread: threads)
    {
      thread.join();
    }

  //*****************************************************************************************
  // It's not suggested to directly new OrtTensorRTProviderOptionsV2 to get provider options
  //*****************************************************************************************
  //
  // auto tensorrt_options = get_default_trt_provider_options();
  // session_options.AppendExecutionProvider_TensorRT_V2(*tensorrt_options.get());

  //**************************************************************************************************************************
  // It's suggested to use CreateTensorRTProviderOptions() to get provider options
  // since ORT takes care of valid options for you
  //**************************************************************************************************************************
  ort_api->ReleaseSessionOptions(session_options);
  ort_api->ReleaseEnv(env);
}



int main(int argc, char *argv[]) {
  int thread_count = 1;
  bool is_serial = false;
  if (argc > 1) {
    thread_count = std::stol(argv[1]);
  }
  if (argc > 2) {
    is_serial = (std::stol(argv[2]) > 0);
  }
  run_ort_trt(thread_count, is_serial);
  return 0;
}

Test Combinations and Results

The first argument of the binary describes how many ort sessions will be loaded on the GPU.
The second argument sets whether or not to load these sessions concurrently: 0 means the sessions will be loaded concurrently while >0 means the sessions are loaded one at a time.

CLI invocations Result
./ort_trt_test 1 0 Passes
./ort_trt_test 10 1 Passes
./ort_trt_test 2 0 SegFault

Additionally, the backtrace of the segmentation fault is:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735072432128) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140735072432128) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140735072432128, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff5792476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff57787f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff57d9676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff592bb77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff57f0cfc in malloc_printerr (str=str@entry=0x7ffff592e790 "double free or corruption (out)") at ./malloc/malloc.c:5664
#7  0x00007ffff57f2e70 in _int_free (av=0x7ffff596ac80 <main_arena>, p=0x7ffee8027650, have_lock=<optimized out>) at ./malloc/malloc.c:4588
#8  0x00007ffff57f5453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007fffe20cba6b in __gnu_cxx::new_allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > >::deallocate (
    this=0x7fffe2167c70 <onnxruntime::CreateTensorRTCustomOpDomainList(std::vector<OrtCustomOpDomain*, std::allocator<OrtCustomOpDomain*> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::created_custom_op_list>, __p=0x7ffee8027660, __t=18446744073667605074) at /usr/include/c++/11/ext/new_allocator.h:145
#10 0x00007fffe20cb186 in std::allocator_traits<std::allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > >::deallocate (__a=..., 
    __p=0x7ffee8027660, __n=18446744073667605074) at /usr/include/c++/11/bits/alloc_traits.h:496
#11 0x00007fffe20cab48 in std::_Vector_base<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> >, std::allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > >::_M_deallocate (
    this=0x7fffe2167c70 <onnxruntime::CreateTensorRTCustomOpDomainList(std::vector<OrtCustomOpDomain*, std::allocator<OrtCustomOpDomain*> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::created_custom_op_list>, __p=0x7ffee8027660, __n=18446744073667605074) at /usr/include/c++/11/bits/stl_vector.h:354
#12 0x00007fffe20cb674 in std::vector<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> >, std::allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > >::_M_realloc_insert<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > (
    this=0x7fffe2167c70 <onnxruntime::CreateTensorRTCustomOpDomainList(std::vector<OrtCustomOpDomain*, std::allocator<OrtCustomOpDomain*> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::created_custom_op_list>, __position=std::unique_ptr<onnxruntime::TensorRTCustomOp> = {get() = 0x810}) at /usr/include/c++/11/bits/vector.tcc:500
#13 0x00007fffe20cadce in std::vector<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> >, std::allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > >::emplace_back<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > (
    this=0x7fffe2167c70 <onnxruntime::CreateTensorRTCustomOpDomainList(std::vector<OrtCustomOpDomain*, std::allocator<OrtCustomOpDomain*> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::created_custom_op_list>) at /usr/include/c++/11/bits/vector.tcc:121
#14 0x00007fffe20ca554 in std::vector<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> >, std::allocator<std::unique_ptr<onnxruntime::TensorRTCustomOp, std::default_delete<onnxruntime::TensorRTCustomOp> > > >::push_back (
    this=0x7fffe2167c70 <onnxruntime::CreateTensorRTCustomOpDomainList(std::vector<OrtCustomOpDomain*, std::allocator<OrtCustomOpDomain*> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::created_custom_op_list>, __x=...) at /usr/include/c++/11/bits/stl_vector.h:1204
#15 0x00007fffe20c9166 in onnxruntime::CreateTensorRTCustomOpDomainList (domain_list=std::vector of length 0, capacity 0, extra_plugin_lib_paths="")
    at /workspace/onnxruntime/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider_custom_ops.cc:76
#16 0x00007fffe20e4ea4 in onnxruntime::ProviderInfo_TensorRT_Impl::GetTensorRTCustomOpDomainList (this=0x7fffe2167038 <onnxruntime::g_info>, domain_list=std::vector of length 0, capacity 0, 
    extra_plugin_lib_paths="") at /workspace/onnxruntime/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc:36
#17 0x00007fff95d27fac in AddTensorRTCustomOpDomainToSessionOption (options=0x7ffeb00137e0, extra_plugin_lib_paths="") at /workspace/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1694
#18 0x00007fff95d295ba in OrtApis::SessionOptionsAppendExecutionProvider_TensorRT_V2 (options=0x7ffeb00137e0, tensorrt_options=0x7ffeb0013aa0)
    at /workspace/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1899
#19 0x00007fffe217a6a6 in triton::backend::onnxruntime::ModelState::LoadModel (this=0x7fff9848bf80, artifact_name="model.onnx", instance_group_kind=TRITONSERVER_INSTANCEGROUPKIND_GPU, 
    instance_group_device_id=0, model_path=0x7ffeb0012ac0, session=0x7ffeb0012ae0, default_allocator=0x7ffeb0012ae8, stream=0x7ffeb0012f40) at /tmp/tritonbuild/onnxruntime/src/onnxruntime.cc:526
#20 0x00007fffe217f70a in triton::backend::onnxruntime::ModelInstanceState::ModelInstanceState (this=0x7ffeb0012a30, model_state=0x7fff9848bf80, triton_model_instance=0x7ffeb0012430)

To reproduce

Compile the described reproducer(ort_trt_test) and execute it with mentioned CLI options.

Urgency

The regression is quite serious and impact users in production environment.

Platform

Linux

OS Version

5.15.0-89-generic

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.2

ONNX Runtime API

C

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

8.6.3.1+cuda12.2.2.009

@github-actions github-actions bot added the ep:TensorRT issues related to TensorRT execution provider label Mar 26, 2024
@yf711 yf711 self-assigned this Mar 26, 2024
@chilo-ms
Copy link
Contributor

chilo-ms commented Mar 26, 2024

@tanmayv25 Thanks for raising this issue.
Here is the PR to fix this concurrency issue and it can fix the issue on my side.
Could you help double check as well? Thank you.

chilo-ms added a commit that referenced this issue Mar 27, 2024
The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its
static variables, `created_custom_op_list` and `custom_op_domain`.
This PR makes sure synchronization using mutex.

see issue: #20089
YUNQIUGUO pushed a commit that referenced this issue Mar 27, 2024
The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its
static variables, `created_custom_op_list` and `custom_op_domain`.
This PR makes sure synchronization using mutex.

see issue: #20089
@tanmayv25
Copy link
Author

tanmayv25 commented Apr 2, 2024

@chilo-ms I can confirm that the linked PR has fixed the issue. Thanks a lot!

@chilo-ms
Copy link
Contributor

chilo-ms commented Apr 3, 2024

@tanmayv25, thanks for verifying.
fyi, the fix will be in ORT 1.17.3 patch release.

TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this issue May 7, 2024
…#20093)

The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its
static variables, `created_custom_op_list` and `custom_op_domain`.
This PR makes sure synchronization using mutex.

see issue: microsoft#20089
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

3 participants