Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

422 Unprocessable Entity on random queries #687

Open
paulojmdias opened this issue Sep 25, 2024 · 4 comments
Open

422 Unprocessable Entity on random queries #687

paulojmdias opened this issue Sep 25, 2024 · 4 comments
Labels

Comments

@paulojmdias
Copy link

paulojmdias commented Sep 25, 2024

Hi,

I'm seeing some random HTTP 422 in some random queries that come from Grafana. I have done 20+ simultaneous curl requests, and randomly I see those HTTP 422. Even with debug logs enabled, I don't see anything wrong.

What can cause this ? Anything can be improved on the configuration?

image
promxy:
  server_groups:
  - http_client:
      dial_timeout: 1s
      tls_config:
        insecure_skip_verify: true
    http_headers:
      X-Scope-OrgID: org1|org2
    path_prefix: /prometheus
    remote_read: false
    scheme: https
    static_configs:
    - targets:
      - dns.domain.com:8080

On Grafana side I see HTTP 500, which is weird since I don't see it on Promxy logs.

image

@jacksontj
Copy link
Owner

Hmm, that is curious -- haven't seen that error before. Would it be possible to get debug or trace logs? Alternatively a tcpdump would be great (but a bit hard to anonymize).

The grafana side error says "error in servergroup ord=0" which really sounds like promxy returned an error (422 or 500 -- unclear since they disagree). If we could get some logs or tcpdumps from promxy we could get some better idea of what happened there.

A 422 code means the request is bad -- so maybe there is some encoding or escaping issue with the subqueries being sent to the downstreams (maybe promxy is getting 422s from downstream and returning a 500)?

@paulojmdias
Copy link
Author

paulojmdias commented Sep 27, 2024

@jacksontj I will enable trace logs next week and I will share them here.

@paulojmdias
Copy link
Author

@jacksontj here you have a small portion of logs from Promxy and also Mimir which I can take with a query that has an error.

We are using the following versions:

  • Grafana: 11.2.0
  • Promxy: 0.0.90
  • Mimir: 2.13.0

I'm also trying to debug the issue, but if you need anything which can help please let me know

Promxy logs:

{"level":"debug","msg":"BinaryExpr: count by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 2","time":"2024-10-04T10:36:26Z"}
{"level":"debug","msg":"BinaryExpr: min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","time":"2024-10-04T10:36:26Z"}
{"level":"debug","msg":"BinaryExpr (AggregateExpr + Literal): min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","time":"2024-10-04T10:36:26Z"}
{"level":"debug","msg":"AggregateExpr count by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) count","time":"2024-10-04T10:36:26Z"}
{"api":"Query","level":"debug","msg":"https://mimir.dns.com:8080/prometheus","query":"count by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"})","time":"2024-10-04T10:36:26Z","ts":"2024-10-04T10:36:20Z"}
{"level":"trace","msg":"Set ServerGroup custom header X-Scope-OrgID: tenant_id","time":"2024-10-04T10:36:26Z"}
{"api":"Query","level":"debug","msg":"https://mimir.dns.com:8080/prometheus","query":"min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","time":"2024-10-04T10:36:26Z","ts":"2024-10-04T10:36:20Z"}
{"level":"trace","msg":"Set ServerGroup custom header X-Scope-OrgID: tenant_id","time":"2024-10-04T10:36:26Z"}
{"level":"debug","msg":"BinaryExpr: max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 0","time":"2024-10-04T10:36:26Z"}
{"level":"debug","msg":"BinaryExpr (AggregateExpr + Literal): max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 0","time":"2024-10-04T10:36:26Z"}
{"api":"Query","level":"debug","msg":"https://mimir.dns.com:8080/prometheus","query":"max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 0","time":"2024-10-04T10:36:26Z","ts":"2024-10-04T10:36:20Z"}
{"level":"trace","msg":"Set ServerGroup custom header X-Scope-OrgID: tenant_id","time":"2024-10-04T10:36:26Z"}
{"api":"Query","error":"server_error: server error: 500","level":"trace","msg":"https://mimir.dns.com:8080/prometheus","query":"min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","time":"2024-10-04T10:36:26Z","took":1002389,"ts":"2024-10-04T10:36:20Z","value":null,"warnings":null}
{"api":"Query","error":"server_error: server error: 500","level":"trace","msg":"https://mimir.dns.com:8080/prometheus","query":"max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 0","time":"2024-10-04T10:36:26Z","took":1043217,"ts":"2024-10-04T10:36:20Z","value":null,"warnings":null}
{"api":"Query","error":null,"level":"trace","msg":"https://mimir.dns.com:8080/prometheus","query":"count by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"})","time":"2024-10-04T10:36:26Z","took":2478969,"ts":"2024-10-04T10:36:20Z","value":[{"metric":{"provider":"openstack","region":"scl","role":"rulers"},"value":[1728038180,"2"]}],"warnings":null}
{"remoteAddr":"10.26.50.231","time":"2024-10-04T10:36:26.284953771Z","method":"POST","path":"/api/v1/query","protocol":"HTTP/1.1","status":422,"responseBytes":191,"duration":0.004576245,"query":"query=%23+Alert+on+ruler+instances+in+microservices+mode+that+have+no+rule+groups+assigned%2C%0Amin+by%28provider%2C+region%2C+role%2C+instance%29+%28cortex_ruler_managers_total%7Bstack%3D~%22a%7Cb%22%2Crole%3D%22rulers%22%7D%29+%3D%3D+0%0A"}

Mimir Logs:

{"caller":"handler.go:382","component":"query-frontend","err":"rpc error: code = Unavailable desc = transport is closing","estimated_series_count":0,"fetched_chunk_bytes":0,"fetched_chunks_count":0,"fetched_index_bytes":0,"fetched_series_count":0,"file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-002","jurisdiction":"us","length":"1h5m0s","level":"info","message":"query stats","message_key":null,"method":"POST","offset":40475661,"param_end":"2024-10-04T10:41:00Z","param_query":"min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","param_start":"2024-10-04T09:41:00Z","param_step":"30000","partition":0,"path":"/prometheus/api/v1/query_range","provider":"cloud","query_wall_time_seconds":0,"queue_time_seconds":0,"region":"region","response_size_bytes":0,"response_time":"909.115µs","results_cache_hit_bytes":1070,"results_cache_miss_bytes":0,"role":"queriers","route_name":"prometheus_api_v1_query_range","service":"mimir","sharded_queries":0,"source_type":"kafka","split_queries":1,"stack":"stack","status":"failed","status_code":0,"time_since_max_time":"11.06988766s","time_since_min_time":"1h5m11.06988766s","topic":"logs--generic","user":"tenantd_id","user_agent":"Go-http-client/1.1"}
{"caller":"handler.go:382","component":"query-frontend","err":"rpc error: code = Unavailable desc = transport is closing","estimated_series_count":0,"fetched_chunk_bytes":0,"fetched_chunks_count":0,"fetched_index_bytes":0,"fetched_series_count":0,"file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","length":"5m0s","level":"info","message":"query stats","message_key":null,"method":"POST","offset":40467566,"param_query":"max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","param_time":"2024-10-04T10:36:20Z","partition":1,"path":"/prometheus/api/v1/query","provider":"cloud","query_wall_time_seconds":0,"queue_time_seconds":0,"region":"region","response_size_bytes":0,"response_time":"408.708µs","results_cache_hit_bytes":0,"results_cache_miss_bytes":0,"role":"queriers","route_name":"prometheus_api_v1_query","service":"mimir","sharded_queries":0,"source_type":"kafka","split_queries":0,"stack":"stack","status":"failed","status_code":0,"time_since_max_time":"6.282686245s","time_since_min_time":"5m6.282686245s","topic":"logs--generic","user":"tenantd_id","user_agent":"Go-http-client/1.1"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467549,"partition":1,"provider":"cloud","query":"min by (provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":2,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467565,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":4,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467547,"partition":1,"provider":"cloud","query":"min by (provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":1,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467553,"partition":1,"provider":"cloud","query":"min by (provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":4,"user":"tenantd_id"}
{"caller":"handler.go:382","component":"query-frontend","err":"rpc error: code = Unavailable desc = transport is closing","estimated_series_count":0,"fetched_chunk_bytes":0,"fetched_chunks_count":0,"fetched_index_bytes":0,"fetched_series_count":0,"file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","length":"5m0s","level":"info","message":"query stats","message_key":null,"method":"POST","offset":40467554,"param_query":"min by(provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","param_time":"2024-10-04T10:36:20Z","partition":1,"path":"/prometheus/api/v1/query","provider":"cloud","query_wall_time_seconds":0,"queue_time_seconds":0,"region":"region","response_size_bytes":0,"response_time":"484.378µs","results_cache_hit_bytes":0,"results_cache_miss_bytes":0,"role":"queriers","route_name":"prometheus_api_v1_query","service":"mimir","sharded_queries":0,"source_type":"kafka","split_queries":0,"stack":"stack","status":"failed","status_code":0,"time_since_max_time":"6.28200176s","time_since_min_time":"5m6.28200176s","topic":"logs--generic","user":"tenantd_id","user_agent":"Go-http-client/1.1"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467563,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":3,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467559,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":1,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467545,"partition":1,"provider":"cloud","query":"min by (provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":0,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467561,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":2,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467557,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":0,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467551,"partition":1,"provider":"cloud","query":"min by (provider, region, role, instance) (cortex_ruler_managers_total{role=\"rulers\",stack=~\"a|b\"}) == 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":3,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467539,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":3,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467533,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":0,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467537,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":2,"user":"tenantd_id"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467541,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":4,"user":"tenantd_id"}
{"caller":"handler.go:382","component":"query-frontend","err":"rpc error: code = Unavailable desc = transport is closing","estimated_series_count":0,"fetched_chunk_bytes":0,"fetched_chunks_count":0,"fetched_index_bytes":0,"fetched_series_count":0,"file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","length":"5m0s","level":"info","message":"query stats","message_key":null,"method":"POST","offset":40467542,"param_query":"max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","param_time":"2024-10-04T10:36:20Z","partition":1,"path":"/prometheus/api/v1/query","provider":"cloud","query_wall_time_seconds":0,"queue_time_seconds":0,"region":"region","response_size_bytes":0,"response_time":"534.363µs","results_cache_hit_bytes":0,"results_cache_miss_bytes":0,"role":"queriers","route_name":"prometheus_api_v1_query","service":"mimir","sharded_queries":0,"source_type":"kafka","split_queries":0,"stack":"stack","status":"failed","status_code":0,"time_since_max_time":"6.277494383s","time_since_min_time":"5m6.277494383s","topic":"logs--generic","user":"tenantd_id","user_agent":"Go-http-client/1.1"}
{"caller":"spanlogger.go:109","err":"rpc error: code = Unavailable desc = transport is closing","file":"/var/log/mimir/mimir.log","headers":{},"instance":"mimir-queriers-001","jurisdiction":"us","level":"error","message":"error processing request","message_key":null,"offset":40467535,"partition":1,"provider":"cloud","query":"max by (provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) > 0","query_timestamp":1728038180000,"region":"region","role":"queriers","service":"mimir","source_type":"kafka","stack":"stack","topic":"logs--generic","try":1,"user":"tenantd_id"}

@jacksontj
Copy link
Owner

jacksontj commented Oct 4, 2024

Thanks for providing the logs! that actually helps a lot.

There are a few things I see off here:

  1. promxy returning 422

From the logs we can see that promxy got a 500 from the downstream mimir for a query:

{"api":"Query","error":"server_error: server error: 500","level":"trace","msg":"https://mimir.dns.com:8080/prometheus","query":"max by(provider, region, role) (cortex_ruler_managers_total{stack=~\"a|b\"}) \u003e 0","time":"2024-10-04T10:36:26Z","took":1043217,"ts":"2024-10-04T10:36:20Z","value":null,"warnings":null}

Which is at first confusing since we got a 422 -- not a 500. But after looking into the prometheus API handler code we can see that the handler will mark any execution failure as Unprocessible -- which is annoying as it causes this confusion.

So for #1 -- this is a design choice from prometheus -- where any processing error is a 422 (even though it could be transient.. so IMO its not correct -- but thats what it is doing).

  1. a query in the logs appears to be invalid?

The last line in the promxy log is:

{"remoteAddr":"10.26.50.231","time":"2024-10-04T10:36:26.284953771Z","method":"POST","path":"/api/v1/query","protocol":"HTTP/1.1","status":422,"responseBytes":191,"duration":0.004576245,"query":"query=%23+Alert+on+ruler+instances+in+microservices+mode+that+have+no+rule+groups+assigned%2C%0Amin+by%28provider%2C+region%2C+role%2C+instance%29+%28cortex_ruler_managers_total%7Bstack%3D~%22a%7Cb%22%2Crole%3D%22rulers%22%7D%29+%3D%3D+0%0A"}

Which has a query with what seems to be an attempted comment (# prefixing a line) -- which is an invalid query in promql. This seems unrelated to the reported issue; but FYI as that is odd :D

  1. mimir errors

I'm not familiar enough with mimir to speculate on why the errors -- but those 500s are why promxy is returning 422s (as they generate an error in processing the response).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants