Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

Closed
sagor999 opened this issue Jul 23, 2021 · 11 comments
Labels

Comments

@sagor999
Copy link

Hello @jacksontj
Recently installed 0.0.71 from 0.0.70 version, and noticed this error that is returned from promxy about once or twice in 24 hours:

 {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"unexpected error: runtime error: invalid memory address or nil pointer dereference\"}"

This is coming from vmalert when it connects to promxy to run alert queries. Simply re-running the query doesn't cause an error.
Here is a full output:

{"ts":"2021-07-23T09:02:18.789Z","level":"error","caller":"VictoriaMetrics/app/vmalert/group.go:256","msg":"group \"/rules/servers.rules\": rule \"UbuntuServerDown\": failed to execute: failed to execute query \"up{osfamily=\\\"debian\\\"} == 0 unless ((up{osfamily=\\\"debian\\\", cluster=\\\"ceph\\\"} == 0) or (up{osfamily=\\\"debian\\\", instance=~\\\"^btea.*\\\"} == 0) or (up{osfamily=\\\"debian\\\", snmp_module=\\\"idrac\\\"} == 0) or (up{osfamily=\\\"debian\\\", snmp_module=\\\"ibm-imm\\\"} == 0))\": unexpected response code 422 for http://promxy-ingress-service:8082/api/v1/query?query=up%7Bosfamily%3D%22debian%22%7D+%3D%3D+0+unless+%28%28up%7Bosfamily%3D%22debian%22%2C+cluster%3D%22ceph%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+instance%3D~%22%5Ebtea.%2A%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+snmp_module%3D%22idrac%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+snmp_module%3D%22ibm-imm%22%7D+%3D%3D+0%29%29&step=1m0s&time=1627030920. Response body {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"unexpected error: runtime error: invalid memory address or nil pointer dereference\"}"}                                

I looked at logs in promxy, but there were no errors reported by promxy itself.

Could it be due to parallel execution that was added in 0.71 version?

And thank you for maintaining and keep updating promxy!!

@jacksontj
Copy link
Owner

Do you happen to have a repro case (e.g. query and config)? A panic like this is definitely unexpected and is a bug.

@jacksontj jacksontj added the bug label Jul 26, 2021
@sagor999
Copy link
Author

It seems to happen on these two rules:

up{osfamily="debian"} == 0 unless ((up{osfamily="debian", cluster="ceph"} == 0) or (up{osfamily="debian", instance=~"^btea.*"} == 0) or (up{osfamily="debian", snmp_module="idrac"} == 0) or (up{osfamily="debian", snmp_module="ibm-imm"} == 0))
          p4_pull_master_journal_sequence{server_services="edge-server"} - p4_pull_replica_journal_sequence_applied{server_services="edge-server"} > 2.5e+10
          or p4_pull_master_journal_number{server_services="edge-server"} - p4_pull_replica_journal_number_applied{server_services="edge-server"} > 0
          or rate(p4_pull_replica_time{server_services="edge-server"}[15m]) == 0

So a query with lots of "or" in it, which means it hits that new parallel execution path.

Though running that query a 1000 times would maybe get you 1 repro. :(
For example, it happened only 6 times since I reported this issue. (and we have a lot of alert rules that constantly bombard promxy).

As for config, here is what we use for each server in server_groups

    - static_configs:
        - targets:
            - victoria-metrics.example.com
      scheme: https
      labels:
        promxy: prometheus.example.com
      anti_affinity: 10s
      remote_read: false
      remote_read_path: api/v1/read
      path_prefix: /select/0/prometheus
      query_params:
        nocache: 1
      http_client:
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
      ignore_error: true

And we have 7 servers like this that promxy checks.
Besides that, here is global portion:

global:
  evaluation_interval: 5s
  external_labels:
    source: promxy

Another thing, promxy is running as 3 pods, and traffic is load balanced between them. Though that shouldn't matter I think, as whole query is being executed on one pod.

@jacksontj
Copy link
Owner

I have spent a bit of time attempting to reproduce and have been unsuccessful so far. I looked and I don't see an obvious nil exception in the parallel child execution; but again I can't repro the issue.

Maybe you can reproduce it with some version against the following config (since Its easy to replicate):

promxy:
  server_groups:
    - static_configs:
      - targets:
        - demo.robustperception.io:9090
        - demo.robustperception.io:9090
      http_client:
        dial_timeout: 1s

@sagor999
Copy link
Author

sagor999 commented Aug 2, 2021

I tried to turn on debug log output, but that did not produce any valuable information.
Is it possible to enable some sort of verbose logging to get a call stack of where the exception happens?
Because I can't reproduce it easily as well, but I do see it quite a bit under production load.
I see 503 error returned by promxy:

GET /api/v1/query_range HTTP/1.1 503 107" 0.040726 query=p4_pull_replica_journal_sequence_persisted%7Binstance%3D~%22192%5C%5C.168%5C%5C.46%5C%5C.150%3A9666%22%2C+server_id%3D~%22x-p4-edge01%22%2C+server_services%3D%22edge-server%22%7D+-+p4_pull_replica_journal_sequence_applied&start=1627931490&end=162793

But cannot find anything else in log. :(

@jacksontj
Copy link
Owner

The issue here is that the promql engine is recovering from the panic to return an error; but not printing out the whole error message:

https://github.com/jacksontj/prometheus/blob/master/promql/engine.go#L849

If you are up to a custom build; you can run with this commit (https://github.com/jacksontj/promxy/compare/no_recover?expand=1) which will allow it to panic (so we can capture that backtrace).

@sagor999
Copy link
Author

sagor999 commented Aug 5, 2021

Ran a custom build and finally caught it:

Response body Error running handler: goroutine 4104383 [running]:
runtime/debug.Stack(0xc0026f1000, 0x24bbd40, 0x4098710)
  /usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).runHandler.func1(0xc0026f3ad0)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:139 +0x46
panic(0x24bbd40, 0x4098710)
  /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/prometheus/prometheus/promql.expandSeriesSet(0x304cf98, 0xc000eea510, 0x0, 0x0, 0xc000eea510, 0x0, 0x0, 0x40b619, 0x18, 0x23e56c0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:737 +0x57
github.com/prometheus/prometheus/promql.checkAndExpandSeriesSet(0x304cf98, 0xc000eea510, 0x304d778, 0xc0005fce70, 0x0, 0x0, 0x1, 0x1, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:729 +0xa6
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0026f2a30, 0x304d778, 0xc0005fce70, 0xc000eea5a0, 0x0, 0x0, 0x1, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:1291 +0x4fa
github.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0026f2a30, 0xc0026f2400, 0xc0026f24c8, 0x2, 0x2, 0x0, 0xc001a123e0, 0xc000a986a8, 0xc000a98708, 0xc000a98690, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:887 +0x1518
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0026f2a30, 0x304d5b8, 0xc001220380, 0x304cf98, 0xc000eea510, 0xf, 0x4, 0x290634c)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:1275 +0x3d65
github.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc0026f2a30, 0x304d5b8, 0xc001220380, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:809 +0xb2
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc00040dc70, 0x304cf98, 0xc002174a50, 0xc0007ae960, 0xc000e8e2d0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:510 +0x4ed
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc00040dc70, 0x304cf98, 0xc002174a50, 0xc0007ae960, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:462 +0x4fe
github.com/prometheus/prometheus/promql.(*query).Exec(0xc0007ae960, 0x304cf98, 0xc002174930, 0x304cf98)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:173 +0x9d
github.com/prometheus/prometheus/web/api/v1.(*API).query(0xc000290c00, 0xc000291100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:376 +0x547
github.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1(0x303c070, 0xc001a12220, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:265 +0xa5
net/http.HandlerFunc.ServeHTTP(0xc000303248, 0x303c070, 0xc001a12220, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP(0x300ad60, 0xc000303248, 0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/util/httputil/compression.go:90 +0x7e
github.com/prometheus/prometheus/web.(*Handler).testReady.func1(0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/web.go:518 +0x55
net/http.HandlerFunc.ServeHTTP(0xc000303278, 0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:198 +0xee
net/http.HandlerFunc.ServeHTTP(0xc000915bf0, 0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:76 +0xb2
net/http.HandlerFunc.ServeHTTP(0xc000915d70, 0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x303b980, 0xc0013e4000, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:101 +0xdf
github.com/prometheus/prometheus/web.setPathWithPrefix.func1.1(0x303b980, 0xc0013e4000, 0xc000291000)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/web.go:1140 +0x1dc
github.com/prometheus/common/route.(*Router).handle.func1(0x303b980, 0xc0013e4000, 0xc000290f00, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/common/route/route.go:83 +0x27f
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0005ad740, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/julienschmidt/httprouter/router.go:387 +0xc7e
github.com/prometheus/common/route.(*Router).ServeHTTP(...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/common/route/route.go:121
main.main.func5(0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/cmd/promxy/main.go:411 +0x1ce
net/http.HandlerFunc.ServeHTTP(0xc0006122a0, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0005539e0, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/julienschmidt/httprouter/router.go:448 +0x1d2
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).runHandler(0xc000568180, 0x303b980, 0xc0013e4000, 0xc000290f00, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:142 +0x85
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).ServeHTTP(0xc000568180, 0x3045a30, 0xc000e9a1c0, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:165 +0x225
net/http.serverHandler.ServeHTTP(0xc00094c000, 0x3045a30, 0xc000e9a1c0, 0xc000290f00)
  /usr/local/go/src/net/http/server.go:2867 +0xa3
net/http.(*conn).serve(0xc0019f4140, 0x304cf98, 0xc0026aa1c0)
  /usr/local/go/src/net/http/server.go:1932 +0x8cd
created by net/http.(*Server).Serve
  /usr/local/go/src/net/http/server.go:2993 +0x39b"}

Seems like it is coming somewhere deep from inside prometheus engine though?

@jacksontj
Copy link
Owner

That it is; it looks like something didn't populate with series which is odd. From your report this is new with the latest promxy though; which makes it even more confusing since how we expand series hasn't changed at all. So just to confirm this is new only on this release? If it is maybe I can make some more specific debug messages to see what in the world is going on :/

@sagor999
Copy link
Author

Yes, this is new with 0.0.71. Was not happening with 0.0.70, and started to happen immediately after upgrading promxy to 0.0.71.

@jacksontj
Copy link
Owner

I did some more digging and found a couple race conditions (pushed tmp fixes to that no_recover branch) but I don't think those are the cause of this panic. I'm still looking for repro cases but so far no success. Maybe give those a spin to see if they somehow helped; otherwise we're still looking for some repro mechanism :/

@sagor999
Copy link
Author

I tried new changes, but unfortunately same panic. :(

@jacksontj
Copy link
Owner

I was able to reproduce some other failure with the Walk (presumably the same issue) after the prom dep upgrade (#460) -- so I have reverted the parallel tree walking and will work to re-enable that in #461. Because of that I'll close this one out as this panic will no longer exist on master as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants