Skip to content

Commit

Permalink
Update based on Feedback from TME (triton-inference-server#3003)
Browse files Browse the repository at this point in the history
* Update based on Feedback from TME

2 fixes in README to resolve startup error user might run into

* update user numbers

increase user numbers to 1000 and step to 50

* Update README.md
  • Loading branch information
mengdong authored Jul 20, 2021
1 parent 47fad09 commit 1a7a0d6
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 7 deletions.
11 changes: 6 additions & 5 deletions deploy/gke-marketplace-app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,15 +60,17 @@ Currently, GKE >= 1.18.7 only supported in GKE rapid channel, to find the latest
export PROJECT_ID=<your GCP project ID>
export ZONE=<GCP zone of your choice>
export REGION=<GCP region of your choice>
export DEPLOYMENT_NAME=<GKE cluster name, triton_gke for example>
export DEPLOYMENT_NAME=<GKE cluster name, triton-gke for example>
gcloud beta container clusters create ${DEPLOYMENT_NAME} \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,Istio \
--machine-type=n1-standard-8 \
--node-locations=${ZONE} \
--zone=${ZONE} \
--subnetwork=default \
--scopes cloud-platform \
--num-nodes 1
--num-nodes 1 \
--project ${PROJECT_ID}
# add GPU node pools, user can modify number of node based on workloads
gcloud container node-pools create accel \
Expand All @@ -95,8 +97,7 @@ kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-ad
# enable stackdriver custom metrics adaptor
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
```

GPU resources in GCP could be fully utilized, so please try a different zone in case compute resource cannot be allocated. After GKE cluster is running, run `kubectl get pods --all-namespaces` to make sure the client can access the cluster correctly:
Creating a cluster and adding GPU nodes could take up-to 10 minutes. Please be patient after executing this command. GPU resources in GCP could be fully utilized, so please try a different zone in case compute resource cannot be allocated. After GKE cluster is running, run `kubectl get pods --all-namespaces` to make sure the client can access the cluster correctly:

If user would like to experiment with A100 MIG partitioned GPU in GKE, please create node pool with following command:
```
Expand Down Expand Up @@ -135,7 +136,7 @@ export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -

Third, we will try sending request to server with provide client example.

If User selected deploy Triton to accept HTTP request, please launch [Locust](https://docs.locust.io/en/stable/installation.html) with Ingress host and port to query Triton Inference Server. In this [example script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py), we send request to Triton server which has loaded a BERT large TensorRT Engine with Sequence length of 128 into GCP bucket. We simulate 300 concurrent user as target and spawn user at rate of 10 users per second.
If User selected deploy Triton to accept HTTP request, please launch [Locust](https://docs.locust.io/en/stable/installation.html) with Ingress host and port to query Triton Inference Server. In this [example script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py), we send request to Triton server which has loaded a BERT large TensorRT Engine with Sequence length of 128 into GCP bucket. We simulate 1000 concurrent user as target and spawn user at rate of 50 users per second.
```
locust -f locustfile_bert_large.py -H http://${INGRESS_HOST}:${INGRESS_PORT}
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ class ProfileLoad(LoadTestShape):
until time_limit is reached.
'''

target_users = 300
step_users = 10 # ramp users each step
target_users = 1000
step_users = 50 # ramp users each step
time_limit = 3600 # seconds

def tick(self):
Expand Down

0 comments on commit 1a7a0d6

Please sign in to comment.