Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Segmentation fault occurs and the machine with openEuler os was automatically reboots #1905

Closed
2 tasks done
jiajie-yang opened this issue Jul 3, 2024 · 6 comments
Closed
2 tasks done
Assignees

Comments

@jiajie-yang
Copy link
Contributor

jiajie-yang commented Jul 3, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

In my A40/T4 card environment, when the service of lmdeploy is run, after the service address is printed out, a Segmentation fault occurs and exits without any other error message. The most important point is that about a minute after the above error occurred, my machine was automatically restarted very quickly, which is very confusing!

Reproduction

lmdeploy serve api_server internlm2-chat-7b --tp 2 --log-level DEBUG

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A40
CUDA_HOME: None
GCC: gcc (GCC) 10.3.1
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.41.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.8.0
triton: 2.2.0

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:0B:00.0 Off |                    0 |
|  0%   52C    P0              84W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:0C:00.0 Off |                    0 |
|  0%   53C    P0              85W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
...

Error traceback

HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
Segmentation fault
@jiajie-yang
Copy link
Contributor Author

jiajie-yang commented Jul 3, 2024

My lmdeploy is installed by pip install lmdeploy. The system and kernel information is as follows:

$ cat /etc/os-release 
NAME="openEuler"
VERSION="22.03 LTS"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 LTS"
ANSI_COLOR="0;31"

$ uname -a
Linux standalone-amd64-a40-26-163 5.10.0-60.94.0.118.oe2203.x86_64 #1 SMP Mon Sep 4 12:14:17 CST 2023 x86_64 x86_64 x86_64 GNU/Linux

@jiajie-yang
Copy link
Contributor Author

Is it because the system is openEuler? There doesn't seem to be this problem on centos and ubuntu systems.

@jiajie-yang jiajie-yang changed the title [Bug] Segmentation fault occurs and machine's os automatically reboots [Bug] Segmentation fault occurs and the machine with openEuler os was automatically reboots Jul 3, 2024
@zhyncs
Copy link
Collaborator

zhyncs commented Jul 3, 2024

The Python package is compiled with manylinux2014_x86_64 and I only have the server with CentOS 7 or Ubuntu 22.04. Maybe it's a corner case and hard for me to reproduce because I don't have the environment.

@lvhan028
Copy link
Collaborator

lvhan028 commented Jul 9, 2024

Is it because the system is openEuler? There doesn't seem to be this problem on centos and ubuntu systems.

Can you try to build the project from source? Here is the guide: https://lmdeploy.readthedocs.io/en/latest/build.html

@lvhan028 lvhan028 self-assigned this Jul 9, 2024
@jiajie-yang
Copy link
Contributor Author

The Python package is compiled with manylinux2014_x86_64 and I only have the server with CentOS 7 or Ubuntu 22.04. Maybe it's a corner case and hard for me to reproduce because I don't have the environment.

It is almost certain that the kernel version (5.10.0-60.94.0.118.oe2203.x86_64) is the reason. Because it seems that upgrading or downgrading the kernel version will solve the problem.

@jiajie-yang
Copy link
Contributor Author

Is it because the system is openEuler? There doesn't seem to be this problem on centos and ubuntu systems.

Can you try to build the project from source? Here is the guide: https://lmdeploy.readthedocs.io/en/latest/build.html

That doesn't seem to be necessary, as it seems that upgrading or downgrading the kernel version will solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants