Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring node-exporter to monitor CPU utilization metrics #32

Closed
knarayan opened this issue Nov 15, 2023 · 5 comments
Closed

Configuring node-exporter to monitor CPU utilization metrics #32

knarayan opened this issue Nov 15, 2023 · 5 comments
Assignees
Labels
Power modeling Node utilization vs. power model training

Comments

@knarayan
Copy link
Collaborator

  • Run below commands to configure Prometheus from the kube-prometheus directory.
    • Teardown existing installation using kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
    • And trigger a fresh installation using kubectl apply --server-side -f manifests/setup -f manifests
  • Assuming the namespace “monitoring” was used for Prometheus installation, use the same namespace in the file “node-exporter.yaml” for node-exporter installation.
  • Run the command kubectl apply -f node-exporter.yaml
    • If it fails with an error no matches for kind "DaemonSet" in version "apps/v1" (which is only applicable for K8s 1.1.7 and above), then use an appropriate version “extensions/v1” or "extensions/v1beta1".
  • Ensure that node-exporter service is available.
    • kubectl get svc -n monitoring
  • Port forward node-exporter
    • kubectl port-forward svc/node-exporter 9100 -n monitoring
  • Check that the node-level metric node_cpu_seconds_total is available in node-exporter service.
    • curl http://localhost:9100/metrics | grep node_cpu_seconds_total
  • Port forward prometheus-k8s
    • kubectl port-forward svc/prometheus-k8s 9090 -n monitoring
  • Check that the node-level metric node_cpu_seconds_total is available in prometheus-k8s service.
    • curl 'http://localhost:9090/api/v1/label/__name__/values' | grep node_cpu_seconds_total
@knarayan knarayan added the Power modeling Node utilization vs. power model training label Nov 15, 2023
@knarayan
Copy link
Collaborator Author

When using kepler-model-server script for kind cluster preparation, node-exporter service is not configured by default. Above procedure can be used to configure it.

@knarayan
Copy link
Collaborator Author

Content of the file node-exporter.yaml

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app: node-exporter
    spec:
      containers:
      - args:
        - --web.listen-address=0.0.0.0:9100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        image: quay.io/prometheus/node-exporter:v0.18.1
        imagePullPolicy: IfNotPresent
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 100m
            memory: 30Mi
        volumeMounts:
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /host/sys
          name: sys
          readOnly: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /proc
          type: ""
        name: proc
      - hostPath:
          path: /sys
          type: ""
        name: sys
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  ports:
  - name: node-exporter
    port: 9100
    protocol: TCP
    targetPort: 9100
  selector:
    app: node-exporter
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: node-exporter
    serviceMonitorSelector: prometheus-k8s
  name: node-exporter
  namespace: monitoring
spec:
  endpoints:
  - honorLabels: true
    interval: 10s
    path: /metrics
    targetPort: 9100
  jobLabel: node-exporter
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: node-exporter

@knarayan
Copy link
Collaborator Author

knarayan commented Nov 15, 2023

Make sure that the Prometheus query step is greater than the node-exporter interval (which is set to 10 seconds in the node-exporter.yaml file) so that different values are reported in the time series data collected.

Export the third party metric before collecting the corresponding metrics as shown below.

% export PROM_THIRDPARTY_METRICS=node_cpu_seconds_total
% NATIVE="true" ./script.sh custom_collect

@knarayan knarayan self-assigned this Nov 15, 2023
@knarayan
Copy link
Collaborator Author

knarayan commented Nov 15, 2023

Setting up the appropriate access controls also may be needed as in https://github.com/sustainable-computing-io/kepler-model-server/blob/main/model_training/script.sh#L147 after the installation of kepler.

Kepler can be installed either from the repository (using these steps) or by running the deployment file in kepler-model-server (kubectl apply -f https://github.com/sustainable-computing-io/kepler-model-server/blob/main/model_training/deployment/kepler.yaml).

@knarayan
Copy link
Collaborator Author

  1. Configuring metrics-server on the kind cluster:
    • Download the file https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
    • Create the configuration by running kubectl apply -f components.yaml
    • If metrics-server pod throws IP SANs error (kubectl logs -f metrics-server-5fb5598cf8-8588b -n kube-system) then add the option --kubelet-insecure-tls in the components.yaml file as suggested here and retry deployment.
  2. When creating a kind cluster with multiple nodes, one may need to adjust the number of CPUs and memory to be allocated for each node in docker environment (like in Docker Desktop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Power modeling Node utilization vs. power model training
Projects
None yet
Development

No branches or pull requests

1 participant