Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
alert-manager.md		alert-manager.md
architecture.png		architecture.png
architecture.vsdx		architecture.vsdx
exporter-for-other-services.md		exporter-for-other-services.md
exporter-metrics.md		exporter-metrics.md
sidecar.png		sidecar.png
watchdog-metrics.md		watchdog-metrics.md

README.md

Goal

Monitoring all components in pai, provide insight on detectiving system/hardware failuring and analysing jobs performance.

Architecture

We have three parts consisting Pai's monitoring system: watchdog, exporter and prometheus.

prometheus is an open source solution for metrics collecting, storage and querying.

watchdog is a single instance pod responsible for monitoring k8s/nodes/pods health, it will first list all pod/node from kubernetes api server and check their status and log their status, this is very helpful for debugging.

exporter are those pods running in the lefe side of node, they are responsible for collecting metrics from jobs/nodes/gpus. There are two containers running inside exporter pod: gpu_exporter and node_exporter: gpu_exporter exposes job/gpu metrics to volume mounted in /datastorage/prometheus.

Metrics generated by watchdog and gpu_exporter are collected by node_exporter container running inside exporter pod. Those metrics are scraped by node_exporter container. node_exporter also expose node metrics like node cpu/memory/disk usage.

Metrics collected

Exporter's metrics are listed here.

More metrics are listed here.

Metrics used

The most important usage of metrics is for alerting, checkout rule directory to see metrics we already used for alerting.

Other pai component also used some metrics for display, they are:

Component	Metric used
Grafana	node_uname_info gpu_utilization gpu_mem_utilization configured_gpu_count node_cpu_seconds_total node_memory_MemTotal_bytes node_memory_MemFree_bytes node_memory_Buffers_bytes node_memory_Cached_bytes node_network_receive_bytes_total node_network_transmit_bytes_total node_disk_read_bytes_total node_disk_written_bytes_total task_cpu_percent task_mem_usage_byte task_net_in_byte task_net_out_byte task_block_in_byte task_block_out_byte task_gpu_percent task_gpu_mem_percent
WebPortal	node_cpu_seconds_total node_memory_MemTotal_bytes node_memory_MemFree_bytes node_memory_Buffers_bytes node_memory_Cached_bytes node_disk_read_bytes_total node_disk_written_bytes_total node_network_receive_bytes_total node_disk_written_bytes_total

Build

Build image by using paictl.py:

./paictl.py image build -p ~/pai-config/ -n gpu-exporter
./paictl.py image build -p ~/pai-config/ -n watchdog

push to registry for deploying:

./paictl.py image push -p ~/pai-config/ -n gpu-exporter
./paictl.py image push -p ~/pai-config/ -n watchdog

Deployment

start by:

./paictl.py service start -p ~/pai-config/ -n prometheus

stop by:

./paictl.py service stop -p ~/pai-config/ -n prometheus

stop and clean data by:

./paictl.py service delete -p ~/pai-config/ -n prometheus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alerting

alerting

README.md

Goal

Architecture

Metrics collected

Metrics used

Build

Deployment

Files

alerting

Directory actions

More options

Directory actions

More options

Latest commit

History

alerting

Folders and files

parent directory

README.md

Goal

Architecture

Metrics collected

Metrics used

Build

Deployment