Monitoring all components in pai, provide insight on detectiving system/hardware failuring and analysing jobs performance.
We have three parts consisting Pai's monitoring system: watchdog
, exporter
and
prometheus
.
prometheus
is an open source solution for metrics collecting, storage and querying.
watchdog
is a single instance pod responsible for monitoring k8s/nodes/pods health, it will first
list all pod/node from kubernetes api server and check their status and log their status, this is very
helpful for debugging.
exporter
are those pods running in the lefe side of node, they are responsible for collecting
metrics from jobs/nodes/gpus. There are two containers running inside exporter
pod: gpu_exporter
and node_exporter
: gpu_exporter
exposes job/gpu
metrics to volume mounted in /datastorage/prometheus
.
Metrics generated by watchdog
and gpu_exporter
are collected by node_exporter
container running
inside exporter
pod. Those metrics are scraped by node_exporter
container. node_exporter
also
expose node metrics like node cpu/memory/disk usage.
Exporter's metrics are listed here.
More metrics are listed here.
The most important usage of metrics is for alerting, checkout rule directory to see metrics we already used for alerting.
Other pai component also used some metrics for display, they are:
Component | Metric used |
Grafana |
|
WebPortal |
|
Build image by using paictl.py
:
./paictl.py image build -p ~/pai-config/ -n gpu-exporter
./paictl.py image build -p ~/pai-config/ -n watchdog
push to registry for deploying:
./paictl.py image push -p ~/pai-config/ -n gpu-exporter
./paictl.py image push -p ~/pai-config/ -n watchdog
start by:
./paictl.py service start -p ~/pai-config/ -n prometheus
stop by:
./paictl.py service stop -p ~/pai-config/ -n prometheus
stop and clean data by:
./paictl.py service delete -p ~/pai-config/ -n prometheus