Skip to content

Commit

Permalink
Alertmanager: Add state size limit (#9475)
Browse files Browse the repository at this point in the history
  • Loading branch information
titolins authored Oct 3, 2024
1 parent d62e3f7 commit 4ba903c
Show file tree
Hide file tree
Showing 12 changed files with 187 additions and 66 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
* `-query-scheduler.grpc-client-config.grpc-compression=s2`
* `-ruler.client.grpc-compression=s2`
* `-ruler.query-frontend.grpc-client-config.grpc-compression=s2`
* [FEATURE] Alertmanager: limit added for maximum size of the Grafana state (`-alertmanager.max-grafana-state-size-bytes`). #9475
* [FEATURE] Alertmanager: limit added for maximum size of the Grafana configuration (`-alertmanager.max-config-size-bytes`). #9402
* [FEATURE] Ingester: Experimental support for ingesting out-of-order native histograms. This is disabled by default and can be enabled by setting `-ingester.ooo-native-histograms-ingestion-enabled` to `true`. #7175
* [ENHANCEMENT] Ruler: Support `exclude_alerts` parameter in `<prometheus-http-prefix>/api/v1/rules` endpoint. #9300
Expand Down
14 changes: 12 additions & 2 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4612,7 +4612,7 @@
"kind": "field",
"name": "alertmanager_max_grafana_config_size_bytes",
"required": false,
"desc": "Maximum size of the Grafana configuration file for Alertmanager that a tenant can upload via the Alertmanager API. 0 = no limit.",
"desc": "Maximum size of the Grafana Alertmanager configuration for a tenant. 0 = no limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "alertmanager.max-grafana-config-size-bytes",
Expand All @@ -4622,12 +4622,22 @@
"kind": "field",
"name": "alertmanager_max_config_size_bytes",
"required": false,
"desc": "Maximum size of configuration file for Alertmanager that tenant can upload via Alertmanager API. 0 = no limit.",
"desc": "Maximum size of the Alertmanager configuration for a tenant. 0 = no limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "alertmanager.max-config-size-bytes",
"fieldType": "int"
},
{
"kind": "field",
"name": "alertmanager_max_grafana_state_size_bytes",
"required": false,
"desc": "Maximum size of the Grafana Alertmanager state for a tenant. 0 = no limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "alertmanager.max-grafana-state-size-bytes",
"fieldType": "int"
},
{
"kind": "field",
"name": "alertmanager_max_silences_count",
Expand Down
6 changes: 4 additions & 2 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -200,11 +200,13 @@ Usage of ./cmd/mimir/mimir:
-alertmanager.max-concurrent-get-requests-per-tenant int
Maximum number of concurrent GET requests allowed per tenant. The zero value (and negative values) result in a limit of GOMAXPROCS or 8, whichever is larger. Status code 503 is served for GET requests that would exceed the concurrency limit.
-alertmanager.max-config-size-bytes int
Maximum size of configuration file for Alertmanager that tenant can upload via Alertmanager API. 0 = no limit.
Maximum size of the Alertmanager configuration for a tenant. 0 = no limit.
-alertmanager.max-dispatcher-aggregation-groups int
Maximum number of aggregation groups in Alertmanager's dispatcher that a tenant can have. Each active aggregation group uses single goroutine. When the limit is reached, dispatcher will not dispatch alerts that belong to additional aggregation groups, but existing groups will keep working properly. 0 = no limit.
-alertmanager.max-grafana-config-size-bytes int
Maximum size of the Grafana configuration file for Alertmanager that a tenant can upload via the Alertmanager API. 0 = no limit.
Maximum size of the Grafana Alertmanager configuration for a tenant. 0 = no limit.
-alertmanager.max-grafana-state-size-bytes int
Maximum size of the Grafana Alertmanager state for a tenant. 0 = no limit.
-alertmanager.max-recv-msg-size int
Maximum size (bytes) of an accepted HTTP request body. (default 104857600)
-alertmanager.max-silence-size-bytes int
Expand Down
6 changes: 4 additions & 2 deletions cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -86,11 +86,13 @@ Usage of ./cmd/mimir/mimir:
-alertmanager.max-alerts-size-bytes int
Maximum total size of alerts that a single tenant can have, alert size is the sum of the bytes of its labels, annotations and generatorURL. Inserting more alerts will fail with a log message and metric increment. 0 = no limit.
-alertmanager.max-config-size-bytes int
Maximum size of configuration file for Alertmanager that tenant can upload via Alertmanager API. 0 = no limit.
Maximum size of the Alertmanager configuration for a tenant. 0 = no limit.
-alertmanager.max-dispatcher-aggregation-groups int
Maximum number of aggregation groups in Alertmanager's dispatcher that a tenant can have. Each active aggregation group uses single goroutine. When the limit is reached, dispatcher will not dispatch alerts that belong to additional aggregation groups, but existing groups will keep working properly. 0 = no limit.
-alertmanager.max-grafana-config-size-bytes int
Maximum size of the Grafana configuration file for Alertmanager that a tenant can upload via the Alertmanager API. 0 = no limit.
Maximum size of the Grafana Alertmanager configuration for a tenant. 0 = no limit.
-alertmanager.max-grafana-state-size-bytes int
Maximum size of the Grafana Alertmanager state for a tenant. 0 = no limit.
-alertmanager.max-silence-size-bytes int
Maximum silence size in bytes. 0 = no limit.
-alertmanager.max-silences-count int
Expand Down
11 changes: 7 additions & 4 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3664,16 +3664,19 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -alertmanager.notification-rate-limit-per-integration
[alertmanager_notification_rate_limit_per_integration: <map of string to float64> | default = {}]

# Maximum size of the Grafana configuration file for Alertmanager that a tenant
# can upload via the Alertmanager API. 0 = no limit.
# Maximum size of the Grafana Alertmanager configuration for a tenant. 0 = no
# limit.
# CLI flag: -alertmanager.max-grafana-config-size-bytes
[alertmanager_max_grafana_config_size_bytes: <int> | default = 0]

# Maximum size of configuration file for Alertmanager that tenant can upload via
# Alertmanager API. 0 = no limit.
# Maximum size of the Alertmanager configuration for a tenant. 0 = no limit.
# CLI flag: -alertmanager.max-config-size-bytes
[alertmanager_max_config_size_bytes: <int> | default = 0]

# Maximum size of the Grafana Alertmanager state for a tenant. 0 = no limit.
# CLI flag: -alertmanager.max-grafana-state-size-bytes
[alertmanager_max_grafana_state_size_bytes: <int> | default = 0]

# Maximum number of silences, including expired silences, that a tenant can have
# at once. 0 = no limit.
# CLI flag: -alertmanager.max-silences-count
Expand Down
10 changes: 10 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2386,6 +2386,16 @@ How to **fix** it:
This error only occurs when an administrator has explicitly define a blocked list for a given tenant. After assessing whether or not the reason for blocking one or multiple queries you can update the tenant's limits and remove the pattern.
### err-mimir-alertmanager-max-grafana-config-size
This non-critical error occurs when the Alertmanager receives a Grafana Alertmanager configuration larger than the configured size limit.
The limit protects the system’s stability from potential abuse or mistakes. To configure the limit on a per-tenant basis, use the `alertmanager.max-grafana-config-size-bytes` option.
### err-mimir-alertmanager-max-grafana-state-size
This non-critical error occurs when the Alertmanager receives a Grafana Alertmanager state larger than the configured size limit.
The limit protects the system’s stability from potential abuse or mistakes. To configure the limit on a per-tenant basis, use the `alertmanager.max-grafana-state-size-bytes` option.
## Mimir routes by path
**Write path**:
Expand Down
41 changes: 38 additions & 3 deletions pkg/alertmanager/api_grafana.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ import (

"github.com/grafana/mimir/pkg/alertmanager/alertspb"
"github.com/grafana/mimir/pkg/util"
"github.com/grafana/mimir/pkg/util/globalerror"
util_log "github.com/grafana/mimir/pkg/util/log"
"github.com/grafana/mimir/pkg/util/validation"
)

const (
Expand All @@ -38,6 +40,17 @@ const (
statusError = "error"
)

var (
maxGrafanaConfigSizeMsgFormat = globalerror.AlertmanagerMaxGrafanaConfigSize.MessageWithPerTenantLimitConfig(
"Alertmanager configuration is too big, limit: %d bytes",
validation.AlertmanagerMaxGrafanaConfigSizeFlag,
)
maxGrafanaStateSizeMsgFormat = globalerror.AlertmanagerMaxGrafanaStateSize.MessageWithPerTenantLimitConfig(
"Alertmanager state is too big, limit: %d bytes",
validation.AlertmanagerMaxGrafanaStateSizeFlag,
)
)

type GrafanaAlertmanagerConfig struct {
Templates map[string]string `json:"template_files"`
AlertmanagerConfig definition.PostableApiAlertingConfig `json:"alertmanager_config"`
Expand Down Expand Up @@ -169,8 +182,27 @@ func (am *MultitenantAlertmanager) SetUserGrafanaState(w http.ResponseWriter, r
return
}

payload, err := io.ReadAll(r.Body)
var input io.Reader
maxStateSize := am.limits.AlertmanagerMaxGrafanaStateSize(userID)
if maxStateSize > 0 {
input = http.MaxBytesReader(w, r.Body, int64(maxStateSize))
} else {
input = r.Body
}

payload, err := io.ReadAll(input)
if err != nil {
if maxBytesErr := (&http.MaxBytesError{}); errors.As(err, &maxBytesErr) {
msg := fmt.Sprintf(maxGrafanaStateSizeMsgFormat, maxStateSize)
level.Warn(logger).Log("msg", msg)
w.WriteHeader(http.StatusBadRequest)
util.WriteJSONResponse(w, errorResult{
Status: statusError,
Error: msg,
})
return
}

level.Error(logger).Log("msg", errReadingState, "err", err.Error())
w.WriteHeader(http.StatusBadRequest)
util.WriteJSONResponse(w, errorResult{
Expand Down Expand Up @@ -320,10 +352,13 @@ func (am *MultitenantAlertmanager) SetUserGrafanaConfig(w http.ResponseWriter, r
payload, err := io.ReadAll(input)
if err != nil {
if maxBytesErr := (&http.MaxBytesError{}); errors.As(err, &maxBytesErr) {
msg := fmt.Sprintf(errConfigurationTooBig, maxConfigSize)
msg := fmt.Sprintf(maxGrafanaConfigSizeMsgFormat, maxConfigSize)
level.Warn(logger).Log("msg", msg)
w.WriteHeader(http.StatusBadRequest)
util.WriteJSONResponse(w, errorResult{Status: statusError, Error: msg})
util.WriteJSONResponse(w, errorResult{
Status: statusError,
Error: msg,
})
return
}

Expand Down
140 changes: 89 additions & 51 deletions pkg/alertmanager/api_grafana_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ func TestMultitenantAlertmanager_SetUserGrafanaConfig(t *testing.T) {
expStatusCode: http.StatusBadRequest,
expResponseBody: `
{
"error": "Alertmanager configuration is too big, limit: 10 bytes",
"error": "Alertmanager configuration is too big, limit: 10 bytes (err-mimir-alertmanager-max-grafana-config-size). To adjust the related per-tenant limit, configure -alertmanager.max-grafana-config-size-bytes, or contact your service administrator.",
"status": "error"
}
`,
Expand Down Expand Up @@ -434,62 +434,100 @@ func TestMultitenantAlertmanager_SetUserGrafanaState(t *testing.T) {
storage := objstore.NewInMemBucket()
alertstore := bucketclient.NewBucketAlertStore(bucketclient.BucketAlertStoreConfig{}, storage, nil, log.NewNopLogger())

am := &MultitenantAlertmanager{
store: alertstore,
logger: test.NewTestingLogger(t),
cases := []struct {
name string
maxStateSize int
orgID string
body string
expStatusCode int
expResponseBody string
expStorageKey string
}{
{
name: "missing org id",
expStatusCode: http.StatusUnauthorized,
},
{
name: "state size > max size",
body: `
{
"state": "ChEKBW5mbG9nEghzb21lZGF0YQ=="
}
`,
orgID: "test_user",
maxStateSize: 10,
expStatusCode: http.StatusBadRequest,
expResponseBody: `
{
"error": "Alertmanager state is too big, limit: 10 bytes (err-mimir-alertmanager-max-grafana-state-size). To adjust the related per-tenant limit, configure -alertmanager.max-grafana-state-size-bytes, or contact your service administrator.",
"status": "error"
}
`,
},
{
name: "invalid config",
body: `{}`,
orgID: "test_user",
expStatusCode: http.StatusBadRequest,
expResponseBody: `
{
"error": "error marshalling JSON Grafana Alertmanager state: no state specified",
"status": "error"
}
`,
},
{
name: "with valid state",
body: `
{
"state": "ChEKBW5mbG9nEghzb21lZGF0YQ=="
}
`,
orgID: "test_user",
expStatusCode: http.StatusCreated,
expResponseBody: successJSON,
expStorageKey: "grafana_alertmanager/test_user/grafana_fullstate",
},
}

require.Len(t, storage.Objects(), 0)
req := httptest.NewRequest(http.MethodPost, "/api/v1/grafana/state", nil)
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
am := &MultitenantAlertmanager{
store: alertstore,
logger: test.NewTestingLogger(t),
limits: &mockAlertManagerLimits{
maxGrafanaStateSize: tc.maxStateSize,
},
}
rec := httptest.NewRecorder()
ctx := context.Background()
if tc.orgID != "" {
ctx = user.InjectOrgID(ctx, "test_user")
}

{
rec := httptest.NewRecorder()
am.SetUserGrafanaState(rec, req)
require.Equal(t, http.StatusUnauthorized, rec.Code)
require.Len(t, storage.Objects(), 0)
}
req := httptest.NewRequest(
http.MethodPost,
"/api/v1/grafana/state",
io.NopCloser(strings.NewReader(tc.body)),
).WithContext(ctx)

ctx := user.InjectOrgID(context.Background(), "test_user")
req = req.WithContext(ctx)
{
// First, try with invalid state payload.
rec := httptest.NewRecorder()
json := `
{
}
`
req.Body = io.NopCloser(strings.NewReader(json))
am.SetUserGrafanaState(rec, req)
am.SetUserGrafanaState(rec, req)
require.Equal(t, tc.expStatusCode, rec.Code)

require.Equal(t, http.StatusBadRequest, rec.Code)
body, err := io.ReadAll(rec.Body)
require.NoError(t, err)
failureJSON := `
{
"error": "error marshalling JSON Grafana Alertmanager state: no state specified",
"status": "error"
}
`
require.JSONEq(t, failureJSON, string(body))
require.Equal(t, "application/json", rec.Header().Get("Content-Type"))
// Now, with a valid one.
rec = httptest.NewRecorder()
json = `
{
"state": "ChEKBW5mbG9nEghzb21lZGF0YQ=="
}
`
req.Body = io.NopCloser(strings.NewReader(json))
am.SetUserGrafanaState(rec, req)
if tc.expResponseBody != "" {
body, err := io.ReadAll(rec.Body)
require.NoError(t, err)

require.Equal(t, http.StatusCreated, rec.Code)
body, err = io.ReadAll(rec.Body)
require.NoError(t, err)
require.JSONEq(t, successJSON, string(body))
require.Equal(t, "application/json", rec.Header().Get("Content-Type"))
require.JSONEq(t, tc.expResponseBody, string(body))
}

require.Len(t, storage.Objects(), 1)
_, ok := storage.Objects()["grafana_alertmanager/test_user/grafana_fullstate"]
require.True(t, ok)
if tc.expStorageKey == "" {
require.Len(t, storage.Objects(), 0)
} else {
require.Len(t, storage.Objects(), 1)
_, ok := storage.Objects()[tc.expStorageKey]
require.True(t, ok)
}
})
}
}
3 changes: 3 additions & 0 deletions pkg/alertmanager/multitenant.go
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,9 @@ type Limits interface {
// AlertmanagerMaxConfigSize returns max size of configuration file that user is allowed to upload. If 0, there is no limit.
AlertmanagerMaxConfigSize(tenant string) int

// AlertmanagerMaxGrafanaStateSize returns the max size of the grafana state in bytes. If 0, there is no limit.
AlertmanagerMaxGrafanaStateSize(tenant string) int

// AlertmanagerMaxSilencesCount returns the max number of silences, including expired silences. If negative or 0, there is no limit.
AlertmanagerMaxSilencesCount(tenant string) int

Expand Down
5 changes: 5 additions & 0 deletions pkg/alertmanager/multitenant_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3294,6 +3294,7 @@ type mockAlertManagerLimits struct {
emailNotificationBurst int
maxConfigSize int
maxGrafanaConfigSize int
maxGrafanaStateSize int
maxSilencesCount int
maxSilenceSizeBytes int
maxTemplatesCount int
Expand All @@ -3311,6 +3312,10 @@ func (m *mockAlertManagerLimits) AlertmanagerMaxGrafanaConfigSize(string) int {
return m.maxGrafanaConfigSize
}

func (m *mockAlertManagerLimits) AlertmanagerMaxGrafanaStateSize(string) int {
return m.maxGrafanaStateSize
}

func (m *mockAlertManagerLimits) AlertmanagerMaxSilencesCount(string) int { return m.maxSilencesCount }

func (m *mockAlertManagerLimits) AlertmanagerMaxSilenceSizeBytes(string) int {
Expand Down
4 changes: 4 additions & 0 deletions pkg/util/globalerror/user.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,10 @@ const (
NativeHistogramNegativeBucketCount ID = "native-histogram-negative-bucket-count"
NativeHistogramSpanNegativeOffset ID = "native-histogram-span-negative-offset"
NativeHistogramSpansBucketsMismatch ID = "native-histogram-spans-buckets-mismatch"

// Alertmanager errors
AlertmanagerMaxGrafanaConfigSize ID = "alertmanager-max-grafana-config-size"
AlertmanagerMaxGrafanaStateSize ID = "alertmanager-max-grafana-state-size"
)

// Message returns the provided msg, appending the error id.
Expand Down
Loading

0 comments on commit 4ba903c

Please sign in to comment.