Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error: Start the triton server #2673

Closed
1 task done
Mrzhiyao opened this issue Nov 9, 2023 · 11 comments
Closed
1 task done

[Bug]: Error: Start the triton server #2673

Mrzhiyao opened this issue Nov 9, 2023 · 11 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Issues needs triage

Comments

@Mrzhiyao
Copy link

Mrzhiyao commented Nov 9, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

root@85d70c862b32:/opt/tritonserver# tritonserver --model-repository pwd/models
W1109 05:31:06.568839 124 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I1109 05:31:06.568981 124 cuda_memory_manager.cc:115] CUDA memory pool disabled
I1109 05:31:06.569292 124 tritonserver.cc:2176]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.24.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
| | or_data statistics trace |
| model_repository_path[0] | /opt/tritonserver/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1109 05:31:06.569348 124 server.cc:257] No server context available. Exiting immediately.
error: creating server: Internal - failed to stat file /opt/tritonserver/models

Expected Behavior

I'm following the official documentation to deploy triton server and start towhee to speed up coding.

I got an error in step “Start the Triton server”after entering the server.
But I can use towhee for encoding in the local environment if I don't use the triton server method. It prompts whether the cuda driver and version in the error message is the reason why it cannot be executed. How can I continue the operation?

Steps To Reproduce

1.Build Image
from towhee import pipe, ops, AutoConfig
import numpy as np

p = (
    pipe.input('text')
    .map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
    .output('vec')
)

import towhee

towhee.build_docker_image(
    dc_pipeline=p,
    image_name='clip:v1',
    cuda_version='11.7', # '117dev' for developer
    format_priority=['onnx'],
    parallelism=4,
    inference_server='triton'
)


2.Create models

import towhee
from towhee import pipe, ops, AutoConfig
import numpy as np
p = (
    pipe.input('text')
    .map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
    .output('vec')
)

towhee.build_pipeline_model(
    dc_pipeline=p,
    model_root='models',
    format_priority=['onnx'],
    parallelism=4,
    server='triton'
)

Environment

- Towhee version(1.1.2):
- OS(Ubuntu or CentOS):Ubuntu
- GPU:3090
- tritonsever:22.07
- cuda:11.7
- Cuda Driver:535.129.03

(base) eg@eg-HP-Z8-G4-Workstation:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

(base) eg@eg-HP-Z8-G4-Workstation:~$ nvidia-smi
Thu Nov  9 14:03:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |

root@85d70c862b32:/opt/tritonserver# nvcc -v
nvcc fatal   : No input files specified; use option --help for more information
root@85d70c862b32:/opt/tritonserver# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


root@85d70c862b32:/opt/tritonserver# nvidia-smi
bash: nvidia-smi: command not found


Anything else?

No response

@Mrzhiyao Mrzhiyao added kind/bug Issues or changes related a bug needs-triage Issues needs triage labels Nov 9, 2023
@junjiejiangjjj
Copy link
Contributor

image
Did you use the --gpu params when starting docker

@Mrzhiyao
Copy link
Author

Mrzhiyao commented Nov 9, 2023

This problem was solved after I restarted the container, but a new error occurred when executing the program.

Traceback (most recent call last):
File "/home/eg/PycharmProjects/Towhee/triton_endcod.py", line 8, in
res = client(data)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 81, in call
return self._loop.run_until_complete(self._call(inputs))[0]
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 68, in _call
response = await self._client.infer(self._model_name, inputs)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 757, in infer
response = await self._post(
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 209, in _post
res = await self._stub.post(
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client.py", line 586, in _request
await resp.start(conn)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 920, in start
self._continue = None
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/helpers.py", line 725, in exit
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

@Mrzhiyao
Copy link
Author

Mrzhiyao commented Nov 9, 2023

image Did you use the --gpu params when starting docker

Yes, the problem was solved after I recreated the container, but a new problem appeared. Do you know how to solve this problem?

@junjiejiangjjj
Copy link
Contributor

It seems that access to the triton server timeout. Are there any logs on the server?

@Mrzhiyao
Copy link
Author

Mrzhiyao commented Nov 9, 2023

It seems that access to the triton server timeout. Are there any logs on the server?

docker logs shows that:

NVIDIA Release 22.07 (build 41737377)
Triton Server Version 2.24.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I1109 06:53:09.532688 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6a4e000000' with size 268435456
I1109 06:53:09.533016 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1109 06:53:09.536004 1 model_repository_manager.cc:1206] loading: pipeline:1
I1109 06:53:09.536049 1 model_repository_manager.cc:1206] loading: sentence-embedding.sbert-0:1
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:11.225232 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I1109 06:53:11.225295 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I1109 06:53:11.225317 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I1109 06:53:11.225331 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1109 06:53:11.259270 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: sentence-embedding.sbert-0 (version 1)
W1109 06:53:14.630221 1 onnxruntime.cc:787] autofilled max_batch_size to 4 for model 'sentence-embedding.sbert-0' since batching is supporrted but no max_batch_size is specified in model configuration. Must specify max_batch_size to utilize autofill with a larger max batch size
I1109 06:53:14.685000 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_0 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:17.996107 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: sentence-embedding.sbert-0_0 (GPU device 0)
I1109 06:53:20.312004 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_1 (CPU device 0)
I1109 06:53:20.312255 1 model_repository_manager.cc:1352] successfully loaded 'sentence-embedding.sbert-0' version 1
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:23.568245 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_2 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:26.839855 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_3 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:30.081773 1 model_repository_manager.cc:1352] successfully loaded 'pipeline' version 1
I1109 06:53:30.082043 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1109 06:53:30.082215 1 server.cc:586]

+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b |
| | | ackends","default-max-batch-size":"4"}} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b |
| | | ackends","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

I1109 06:53:30.082348 1 server.cc:629]
+----------------------------+---------+--------+
| Model | Version | Status |
+----------------------------+---------+--------+
| pipeline | 1 | READY |
| sentence-embedding.sbert-0 | 1 | READY |
+----------------------------+---------+--------+

I1109 06:53:30.135753 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I1109 06:53:30.136027 1 tritonserver.cc:2176]
I1109 06:53:30.137643 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I1109 06:53:30.137940 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I1109 06:53:30.179419 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

@junjiejiangjjj
Copy link
Contributor

curl http://0.0.0.0:8000/v2/models/stats Check the server is available

@Mrzhiyao
Copy link
Author

Mrzhiyao commented Nov 9, 2023

curl http://0.0.0.0:8000/v2/models/stats Check the server is available

I set the local port to 8010. So I can get such a result, what may be the cause of the error in this case, thank you for your help.

(base) eg@eg-HP-Z8-G4-Workstation:~$ curl http://0.0.0.0:8010/v2/models/stats
{"model_stats":[{"name":"pipeline","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]},{"name":"sentence-embedding.sbert-0","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]}]}

@junjiejiangjjj
Copy link
Contributor

Try ops.sentence_embedding.transformers, sbert has some bugs.
image
This pipeline works fine.

@Mrzhiyao
Copy link
Author

Try ops.sentence_embedding.transformers, sbert has some bugs. image This pipeline works fine.

Thank you for your help. I think my problem has been resolved. My other question is, which parameters can further improve the encoding speed by accelerating model inference through the Triton server in parameter settings.

@junjiejiangjjj
Copy link
Contributor

It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server

@Mrzhiyao
Copy link
Author

It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server

Thank you very much for your help. I think my problem has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Issues needs triage
Projects
None yet
Development

No branches or pull requests

3 participants