Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baremetal dashboards #2657

Merged
merged 6 commits into from
Aug 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
* [CHANGE] Dashboards: remove the "Cache - Latency (old)" panel from the "Mimir / Queries" dashboard. #2796
* [FEATURE] Dashboards: added support to experimental read-write deployment mode. #2780
* [ENHANCEMENT] Dashboards: added support to query-tee in front of ruler-query-frontend in the "Remote ruler reads" dashboard. #2761
* [ENHANCEMENT] Dashboards: Introduce support for baremetal deployment, setting `deployment_type: 'baremetal'` in the mixin `_config`. #2657
* [BUGFIX] Dashboards: stop setting 'interval' in dashboards; it should be set on your datasource. #2802

### Jsonnet
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ The following table shows the required label names and whether they can be custo

For rules and alerts to function properly, you must configure your Prometheus or Grafana Agent to scrape metrics from Grafana Mimir at an interval of `15s` or shorter.

## Deployment type

By default, Grafana Mimir dashboards assume Mimir is deployed in containers orchestrated by Kubernetes.
If you're running Mimir on baremental, you should set the configuration field `deployment_type: 'baremetal'` and [re-compile the dashboards]({{< relref "installing-dashboards-and-alerts.md" >}}).

## Job selection

A metric could be exposed by multiple Grafana Mimir components, or even different applications running in the same namespace.
Expand Down
109 changes: 109 additions & 0 deletions operations/mimir-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,115 @@
gateway: helmCompatibleName('(gateway|cortex-gw|cortex-gw).*'),
},

deployment_type: 'kubernetes',
// System mount point where mimir stores its data, used for baremetal
// deployment only.
instance_data_mountpoint: '/',
resources_panel_series: {
kubernetes: {
network_receive_bytes_metrics: 'container_network_receive_bytes_total',
network_transmit_bytes_metrics: 'container_network_transmit_bytes_total',
},
baremetal: {
network_receive_bytes_metrics: 'node_network_receive_bytes_total',
network_transmit_bytes_metrics: 'node_network_transmit_bytes_total',
},
},
resources_panel_queries: {
kubernetes: {
cpu_usage: 'sum by(%(instance)s) (rate(container_cpu_usage_seconds_total{%(namespace)s,container=~"%(instanceName)s"}[$__rate_interval]))',
cpu_limit: 'min(container_spec_cpu_quota{%(namespace)s,container=~"%(instanceName)s"} / container_spec_cpu_period{%(namespace)s,container=~"%(instanceName)s"})',
cpu_request: 'min(kube_pod_container_resource_requests{%(namespace)s,container=~"%(instanceName)s",resource="cpu"})',
// We use "max" instead of "sum" otherwise during a rolling update of a statefulset we will end up
// summing the memory of the old instance/pod (whose metric will be stale for 5m) to the new instance/pod.
memory_working_usage: 'max by(%(instance)s) (container_memory_working_set_bytes{%(namespace)s,container=~"%(instanceName)s"})',
memory_working_limit: 'min(container_spec_memory_limit_bytes{%(namespace)s,container=~"%(instanceName)s"} > 0)',
memory_working_request: 'min(kube_pod_container_resource_requests{%(namespace)s,container=~"%(instanceName)s",resource="memory"})',
// We use "max" instead of "sum" otherwise during a rolling update of a statefulset we will end up
// summing the memory of the old instance/pod (whose metric will be stale for 5m) to the new instance/pod.
memory_rss_usage: 'max by(%(instance)s) (container_memory_rss{%(namespace)s,container=~"%(instanceName)s"})',
memory_rss_limit: 'min(container_spec_memory_limit_bytes{%(namespace)s,container=~"%(instanceName)s"} > 0)',
memory_rss_request: 'min(kube_pod_container_resource_requests{%(namespace)s,container=~"%(instanceName)s",resource="memory"})',
network: 'sum by(%(instance)s) (rate(%(metric)s{%(namespace)s,%(instance)s=~"%(instanceName)s"}[$__rate_interval]))',
disk_writes:
|||
sum by(%(instanceLabel)s, %(instance)s, device) (
rate(
node_disk_written_bytes_total[$__rate_interval]
)
)
+
%(filterNodeDiskContainer)s
|||,
disk_reads:
|||
sum by(%(instanceLabel)s, %(instance)s, device) (
rate(
node_disk_read_bytes_total[$__rate_interval]
)
) + %(filterNodeDiskContainer)s
|||,
disk_utilization:
|||
max by(persistentvolumeclaim) (
kubelet_volume_stats_used_bytes{%(namespace)s} /
kubelet_volume_stats_capacity_bytes{%(namespace)s}
)
and
count by(persistentvolumeclaim) (
kube_persistentvolumeclaim_labels{
%(namespace)s,
%(label)s
}
)
|||,
},
baremetal: {
// Somes queries does not makes sense when running mimir on baremetal
// no need to define them
cpu_usage: 'sum by(%(instance)s) (rate(node_cpu_seconds_total{mode="user",%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}[$__rate_interval]))',
memory_working_usage:
|||
node_memory_MemTotal_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_MemFree_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_Buffers_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_Cached_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_Slab_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_PageTables_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
- node_memory_SwapCached_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
|||,
// From cAdvisor code, the memory RSS is:
// The amount of anonymous and swap cache memory (includes transparent hugepages).
memory_rss_usage:
|||
node_memory_Active_anon_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
+ node_memory_SwapCached_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}
|||,
network: 'sum by(%(instance)s) (rate(%(metric)s{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}[$__rate_interval]))',
disk_writes:
|||
sum by(%(instanceLabel)s, %(instance)s, device) (
rate(
node_disk_written_bytes_total{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}[$__rate_interval]
)
)
|||,
disk_reads:
|||
sum by(%(instanceLabel)s, %(instance)s, device) (
rate(
node_disk_read_bytes_total{%(namespace)s,%(instance)s=~".*%(instanceName)s.*"}[$__rate_interval]
)
)
|||,
disk_utilization:
|||
1 - ((node_filesystem_avail_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*", mountpoint="%(instanceDataDir)s"})
/ node_filesystem_size_bytes{%(namespace)s,%(instance)s=~".*%(instanceName)s.*", mountpoint="%(instanceDataDir)s"})
|||,
},
},

// The label used to differentiate between different nodes (i.e. servers).
per_node_label: 'instance',

Expand Down
140 changes: 68 additions & 72 deletions operations/mimir-mixin/dashboards/dashboard-utils.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,40 @@ local utils = import 'mixin-utils/utils.libsonnet';
},
},

containerCPUUsagePanel(title, containerName)::
resourcesPanelLegend(first_legend)::
if $._config.deployment_type == 'kubernetes'
then [first_legend, 'limit', 'request']
// limit and request does not makes sense when running on baremetal
else [first_legend],

resourcesPanelQueries(metric, instanceName)::
if $._config.deployment_type == 'kubernetes'
then [
$._config.resources_panel_queries[$._config.deployment_type]['%s_usage' % metric] % {
instance: $._config.per_instance_label,
namespace: $.namespaceMatcher(),
instanceName: instanceName,
},
$._config.resources_panel_queries[$._config.deployment_type]['%s_limit' % metric] % {
namespace: $.namespaceMatcher(),
instanceName: instanceName,
},
$._config.resources_panel_queries[$._config.deployment_type]['%s_request' % metric] % {
namespace: $.namespaceMatcher(),
instanceName: instanceName,
},
]
else [
$._config.resources_panel_queries[$._config.deployment_type]['%s_usage' % metric] % {
instance: $._config.per_instance_label,
namespace: $.namespaceMatcher(),
instanceName: instanceName,
},
],

containerCPUUsagePanel(title, instanceName)::
$.panel(title) +
$.queryPanel([
'sum by(%s) (rate(container_cpu_usage_seconds_total{%s,container=~"%s"}[$__rate_interval]))' % [$._config.per_instance_label, $.namespaceMatcher(), containerName],
'min(container_spec_cpu_quota{%s,container=~"%s"} / container_spec_cpu_period{%s,container=~"%s"})' % [$.namespaceMatcher(), containerName, $.namespaceMatcher(), containerName],
'min(kube_pod_container_resource_requests{%s,container=~"%s",resource="cpu"})' % [$.namespaceMatcher(), containerName],
], ['{{%s}}' % $._config.per_instance_label, 'limit', 'request']) +
$.queryPanel($.resourcesPanelQueries('cpu', instanceName), $.resourcesPanelLegend('{{%s}}' % $._config.per_instance_label)) +
{
seriesOverrides: [
resourceRequestStyle,
Expand All @@ -194,15 +221,9 @@ local utils = import 'mixin-utils/utils.libsonnet';
fill: 0,
},

containerMemoryWorkingSetPanel(title, containerName)::
containerMemoryWorkingSetPanel(title, instanceName)::
$.panel(title) +
$.queryPanel([
// We use "max" instead of "sum" otherwise during a rolling update of a statefulset we will end up
// summing the memory of the old instance/pod (whose metric will be stale for 5m) to the new instance/pod.
'max by(%s) (container_memory_working_set_bytes{%s,container=~"%s"})' % [$._config.per_instance_label, $.namespaceMatcher(), containerName],
'min(container_spec_memory_limit_bytes{%s,container=~"%s"} > 0)' % [$.namespaceMatcher(), containerName],
'min(kube_pod_container_resource_requests{%s,container=~"%s",resource="memory"})' % [$.namespaceMatcher(), containerName],
], ['{{%s}}' % $._config.per_instance_label, 'limit', 'request']) +
$.queryPanel($.resourcesPanelQueries('memory_working', instanceName), $.resourcesPanelLegend('{{%s}}' % $._config.per_instance_label)) +
{
seriesOverrides: [
resourceRequestStyle,
Expand All @@ -213,15 +234,9 @@ local utils = import 'mixin-utils/utils.libsonnet';
fill: 0,
},

containerMemoryRSSPanel(title, containerName)::
containerMemoryRSSPanel(title, instanceName)::
$.panel(title) +
$.queryPanel([
// We use "max" instead of "sum" otherwise during a rolling update of a statefulset we will end up
// summing the memory of the old instance/pod (whose metric will be stale for 5m) to the new instance/pod.
'max by(%s) (container_memory_rss{%s,container=~"%s"})' % [$._config.per_instance_label, $.namespaceMatcher(), containerName],
'min(container_spec_memory_limit_bytes{%s,container=~"%s"} > 0)' % [$.namespaceMatcher(), containerName],
'min(kube_pod_container_resource_requests{%s,container=~"%s",resource="memory"})' % [$.namespaceMatcher(), containerName],
], ['{{%s}}' % $._config.per_instance_label, 'limit', 'request']) +
$.queryPanel($.resourcesPanelQueries('memory_rss', instanceName), $.resourcesPanelLegend('{{%s}}' % $._config.per_instance_label)) +
{
seriesOverrides: [
resourceRequestStyle,
Expand All @@ -235,7 +250,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
containerNetworkPanel(title, metric, instanceName)::
$.panel(title) +
$.queryPanel(
'sum by(%(instance)s) (rate(%(metric)s{%(namespace)s,%(instance)s=~"%(instanceName)s"}[$__rate_interval]))' % {
$._config.resources_panel_queries[$._config.deployment_type].network % {
namespace: $.namespaceMatcher(),
metric: metric,
instance: $._config.per_instance_label,
Expand All @@ -246,80 +261,61 @@ local utils = import 'mixin-utils/utils.libsonnet';
{ yaxes: $.yaxes('Bps') },

containerNetworkReceiveBytesPanel(instanceName)::
$.containerNetworkPanel('Receive bandwidth', 'container_network_receive_bytes_total', instanceName),
$.containerNetworkPanel('Receive bandwidth', $._config.resources_panel_series[$._config.deployment_type].network_receive_bytes_metrics, instanceName),

containerNetworkTransmitBytesPanel(instanceName)::
$.containerNetworkPanel('Transmit bandwidth', 'container_network_transmit_bytes_total', instanceName),
$.containerNetworkPanel('Transmit bandwidth', $._config.resources_panel_series[$._config.deployment_type].network_transmit_bytes_metrics, instanceName),

containerDiskWritesPanel(title, containerName)::
containerDiskWritesPanel(title, instanceName)::
$.panel(title) +
$.queryPanel(
|||
sum by(%s, %s, device) (
rate(
node_disk_written_bytes_total[$__rate_interval]
)
)
+
%s
||| % [
$._config.per_node_label,
$._config.per_instance_label,
$.filterNodeDiskContainer(containerName),
],
$._config.resources_panel_queries[$._config.deployment_type].disk_writes % {
namespace: $.namespaceMatcher(),
instanceLabel: $._config.per_node_label,
instance: $._config.per_instance_label,
filterNodeDiskContainer: $.filterNodeDiskContainer(instanceName),
instanceName: instanceName,
},
'{{%s}} - {{device}}' % $._config.per_instance_label
) +
$.stack +
{ yaxes: $.yaxes('Bps') },

containerDiskReadsPanel(title, containerName)::
containerDiskReadsPanel(title, instanceName)::
$.panel(title) +
$.queryPanel(
|||
sum by(%s, %s, device) (
rate(
node_disk_read_bytes_total[$__rate_interval]
)
) + %s
||| % [
$._config.per_node_label,
$._config.per_instance_label,
$.filterNodeDiskContainer(containerName),
],
$._config.resources_panel_queries[$._config.deployment_type].disk_reads % {
namespace: $.namespaceMatcher(),
instanceLabel: $._config.per_node_label,
instance: $._config.per_instance_label,
filterNodeDiskContainer: $.filterNodeDiskContainer(instanceName),
instanceName: instanceName,
},
'{{%s}} - {{device}}' % $._config.per_instance_label
) +
$.stack +
{ yaxes: $.yaxes('Bps') },

containerDiskSpaceUtilization(title, containerName)::
containerDiskSpaceUtilization(title, instanceName)::
$.panel(title) +
$.queryPanel(
|||
max by(persistentvolumeclaim) (
kubelet_volume_stats_used_bytes{%(namespace)s} /
kubelet_volume_stats_capacity_bytes{%(namespace)s}
)
and
count by(persistentvolumeclaim) (
kube_persistentvolumeclaim_labels{
%(namespace)s,
%(label)s
}
)
||| % {
$._config.resources_panel_queries[$._config.deployment_type].disk_utilization % {
namespace: $.namespaceMatcher(),
label: $.containerLabelMatcher(containerName),
label: $.containerLabelMatcher(instanceName),
instance: $._config.per_instance_label,
instanceName: instanceName,
instanceDataDir: $._config.instance_data_mountpoint,
}, '{{persistentvolumeclaim}}'
) +
{
yaxes: $.yaxes('percentunit'),
fill: 0,
},

containerLabelMatcher(containerName)::
if containerName == 'ingester' then 'label_name=~"ingester.*"'
else if containerName == 'store-gateway' then 'label_name=~"store-gateway.*"'
else 'label_name="%s"' % containerName,
containerLabelMatcher(instanceName)::
if instanceName == 'ingester' then 'label_name=~"ingester.*"'
else if instanceName == 'store-gateway' then 'label_name=~"store-gateway.*"'
else 'label_name="%s"' % instanceName,

jobNetworkingRow(title, name)::
local vars = $._config {
Expand Down Expand Up @@ -561,7 +557,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
{ yaxes: $.yaxes('percentunit') }
),

filterNodeDiskContainer(containerName)::
filterNodeDiskContainer(instanceName)::
|||
ignoring(%s) group_right() (
label_replace(
Expand All @@ -588,7 +584,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
$._config.per_node_label,
$._config.per_instance_label,
$.namespaceMatcher(),
containerName,
instanceName,
],

filterKedaMetricByHPA(query, hpa_name)::
Expand Down