Skip to content

Latest commit

 

History

History
 
 

alerting

Goal

Monitoring all components in pai, provide insight on detectiving system/hardware failuring and analysing jobs performance.

Architecture

Architecture

We have three parts consisting Pai's monitoring system: watchdog, exporter and prometheus.

prometheus is an open source solution for metrics collecting, storage and querying.

watchdog is a single instance pod responsible for monitoring k8s/nodes/pods health, it will first list all pod/node from kubernetes api server and check their status and log their status, this is very helpful for debugging.

exporter are those pods running in the lefe side of node, they are responsible for collecting metrics from jobs/nodes/gpus. There are two containers running inside exporter pod: gpu_exporter and node_exporter: gpu_exporter exposes job/gpu metrics to volume mounted in /datastorage/prometheus.

Metrics generated by watchdog and gpu_exporter are collected by node_exporter container running inside exporter pod. Those metrics are scraped by node_exporter container. node_exporter also expose node metrics like node cpu/memory/disk usage.

Metrics collected

Exporter's metrics are listed here.

More metrics are listed here.

Metrics used

The most important usage of metrics is for alerting, checkout rule directory to see metrics we already used for alerting.

Other pai component also used some metrics for display, they are:

Component Metric used
Grafana
  • node_uname_info
  • gpu_utilization
  • gpu_mem_utilization
  • configured_gpu_count
  • node_cpu_seconds_total
  • node_memory_MemTotal_bytes
  • node_memory_MemFree_bytes
  • node_memory_Buffers_bytes
  • node_memory_Cached_bytes
  • node_network_receive_bytes_total
  • node_network_transmit_bytes_total
  • node_disk_read_bytes_total
  • node_disk_written_bytes_total
  • task_cpu_percent
  • task_mem_usage_byte
  • task_net_in_byte
  • task_net_out_byte
  • task_block_in_byte
  • task_block_out_byte
  • task_gpu_percent
  • task_gpu_mem_percent
WebPortal
  • node_cpu_seconds_total
  • node_memory_MemTotal_bytes
  • node_memory_MemFree_bytes
  • node_memory_Buffers_bytes
  • node_memory_Cached_bytes
  • node_disk_read_bytes_total
  • node_disk_written_bytes_total
  • node_network_receive_bytes_total
  • node_disk_written_bytes_total

Build

Build image by using paictl.py:

./paictl.py image build -p ~/pai-config/ -n gpu-exporter
./paictl.py image build -p ~/pai-config/ -n watchdog

push to registry for deploying:

./paictl.py image push -p ~/pai-config/ -n gpu-exporter
./paictl.py image push -p ~/pai-config/ -n watchdog

Deployment

start by:

./paictl.py service start -p ~/pai-config/ -n prometheus

stop by:

./paictl.py service stop -p ~/pai-config/ -n prometheus

stop and clean data by:

./paictl.py service delete -p ~/pai-config/ -n prometheus