v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

sagor999 · 2021-07-23T14:34:55Z

Hello @jacksontj
Recently installed 0.0.71 from 0.0.70 version, and noticed this error that is returned from promxy about once or twice in 24 hours:

 {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"unexpected error: runtime error: invalid memory address or nil pointer dereference\"}"

This is coming from vmalert when it connects to promxy to run alert queries. Simply re-running the query doesn't cause an error.
Here is a full output:

{"ts":"2021-07-23T09:02:18.789Z","level":"error","caller":"VictoriaMetrics/app/vmalert/group.go:256","msg":"group \"/rules/servers.rules\": rule \"UbuntuServerDown\": failed to execute: failed to execute query \"up{osfamily=\\\"debian\\\"} == 0 unless ((up{osfamily=\\\"debian\\\", cluster=\\\"ceph\\\"} == 0) or (up{osfamily=\\\"debian\\\", instance=~\\\"^btea.*\\\"} == 0) or (up{osfamily=\\\"debian\\\", snmp_module=\\\"idrac\\\"} == 0) or (up{osfamily=\\\"debian\\\", snmp_module=\\\"ibm-imm\\\"} == 0))\": unexpected response code 422 for http://promxy-ingress-service:8082/api/v1/query?query=up%7Bosfamily%3D%22debian%22%7D+%3D%3D+0+unless+%28%28up%7Bosfamily%3D%22debian%22%2C+cluster%3D%22ceph%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+instance%3D~%22%5Ebtea.%2A%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+snmp_module%3D%22idrac%22%7D+%3D%3D+0%29+or+%28up%7Bosfamily%3D%22debian%22%2C+snmp_module%3D%22ibm-imm%22%7D+%3D%3D+0%29%29&step=1m0s&time=1627030920. Response body {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"unexpected error: runtime error: invalid memory address or nil pointer dereference\"}"}

I looked at logs in promxy, but there were no errors reported by promxy itself.

Could it be due to parallel execution that was added in 0.71 version?

And thank you for maintaining and keep updating promxy!!

The text was updated successfully, but these errors were encountered:

jacksontj · 2021-07-26T21:00:59Z

Do you happen to have a repro case (e.g. query and config)? A panic like this is definitely unexpected and is a bug.

sagor999 · 2021-07-26T21:28:15Z

It seems to happen on these two rules:

up{osfamily="debian"} == 0 unless ((up{osfamily="debian", cluster="ceph"} == 0) or (up{osfamily="debian", instance=~"^btea.*"} == 0) or (up{osfamily="debian", snmp_module="idrac"} == 0) or (up{osfamily="debian", snmp_module="ibm-imm"} == 0))

          p4_pull_master_journal_sequence{server_services="edge-server"} - p4_pull_replica_journal_sequence_applied{server_services="edge-server"} > 2.5e+10
          or p4_pull_master_journal_number{server_services="edge-server"} - p4_pull_replica_journal_number_applied{server_services="edge-server"} > 0
          or rate(p4_pull_replica_time{server_services="edge-server"}[15m]) == 0

So a query with lots of "or" in it, which means it hits that new parallel execution path.

Though running that query a 1000 times would maybe get you 1 repro. :(
For example, it happened only 6 times since I reported this issue. (and we have a lot of alert rules that constantly bombard promxy).

As for config, here is what we use for each server in server_groups

    - static_configs:
        - targets:
            - victoria-metrics.example.com
      scheme: https
      labels:
        promxy: prometheus.example.com
      anti_affinity: 10s
      remote_read: false
      remote_read_path: api/v1/read
      path_prefix: /select/0/prometheus
      query_params:
        nocache: 1
      http_client:
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
      ignore_error: true

And we have 7 servers like this that promxy checks.
Besides that, here is global portion:

global:
  evaluation_interval: 5s
  external_labels:
    source: promxy

Another thing, promxy is running as 3 pods, and traffic is load balanced between them. Though that shouldn't matter I think, as whole query is being executed on one pod.

jacksontj · 2021-08-02T16:12:35Z

I have spent a bit of time attempting to reproduce and have been unsuccessful so far. I looked and I don't see an obvious nil exception in the parallel child execution; but again I can't repro the issue.

Maybe you can reproduce it with some version against the following config (since Its easy to replicate):

promxy:
  server_groups:
    - static_configs:
      - targets:
        - demo.robustperception.io:9090
        - demo.robustperception.io:9090
      http_client:
        dial_timeout: 1s

sagor999 · 2021-08-02T20:24:59Z

I tried to turn on debug log output, but that did not produce any valuable information.
Is it possible to enable some sort of verbose logging to get a call stack of where the exception happens?
Because I can't reproduce it easily as well, but I do see it quite a bit under production load.
I see 503 error returned by promxy:

GET /api/v1/query_range HTTP/1.1 503 107" 0.040726 query=p4_pull_replica_journal_sequence_persisted%7Binstance%3D~%22192%5C%5C.168%5C%5C.46%5C%5C.150%3A9666%22%2C+server_id%3D~%22x-p4-edge01%22%2C+server_services%3D%22edge-server%22%7D+-+p4_pull_replica_journal_sequence_applied&start=1627931490&end=162793

But cannot find anything else in log. :(

jacksontj · 2021-08-03T22:21:15Z

The issue here is that the promql engine is recovering from the panic to return an error; but not printing out the whole error message:

https://github.com/jacksontj/prometheus/blob/master/promql/engine.go#L849

If you are up to a custom build; you can run with this commit (https://github.com/jacksontj/promxy/compare/no_recover?expand=1) which will allow it to panic (so we can capture that backtrace).

sagor999 · 2021-08-05T03:10:05Z

Ran a custom build and finally caught it:

Response body Error running handler: goroutine 4104383 [running]:
runtime/debug.Stack(0xc0026f1000, 0x24bbd40, 0x4098710)
  /usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).runHandler.func1(0xc0026f3ad0)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:139 +0x46
panic(0x24bbd40, 0x4098710)
  /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/prometheus/prometheus/promql.expandSeriesSet(0x304cf98, 0xc000eea510, 0x0, 0x0, 0xc000eea510, 0x0, 0x0, 0x40b619, 0x18, 0x23e56c0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:737 +0x57
github.com/prometheus/prometheus/promql.checkAndExpandSeriesSet(0x304cf98, 0xc000eea510, 0x304d778, 0xc0005fce70, 0x0, 0x0, 0x1, 0x1, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:729 +0xa6
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0026f2a30, 0x304d778, 0xc0005fce70, 0xc000eea5a0, 0x0, 0x0, 0x1, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:1291 +0x4fa
github.com/prometheus/prometheus/promql.(*evaluator).rangeEval(0xc0026f2a30, 0xc0026f2400, 0xc0026f24c8, 0x2, 0x2, 0x0, 0xc001a123e0, 0xc000a986a8, 0xc000a98708, 0xc000a98690, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:887 +0x1518
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc0026f2a30, 0x304d5b8, 0xc001220380, 0x304cf98, 0xc000eea510, 0xf, 0x4, 0x290634c)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:1275 +0x3d65
github.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc0026f2a30, 0x304d5b8, 0xc001220380, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:809 +0xb2
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc00040dc70, 0x304cf98, 0xc002174a50, 0xc0007ae960, 0xc000e8e2d0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:510 +0x4ed
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc00040dc70, 0x304cf98, 0xc002174a50, 0xc0007ae960, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:462 +0x4fe
github.com/prometheus/prometheus/promql.(*query).Exec(0xc0007ae960, 0x304cf98, 0xc002174930, 0x304cf98)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/engine.go:173 +0x9d
github.com/prometheus/prometheus/web/api/v1.(*API).query(0xc000290c00, 0xc000291100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:376 +0x547
github.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1(0x303c070, 0xc001a12220, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/api/v1/api.go:265 +0xa5
net/http.HandlerFunc.ServeHTTP(0xc000303248, 0x303c070, 0xc001a12220, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP(0x300ad60, 0xc000303248, 0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/util/httputil/compression.go:90 +0x7e
github.com/prometheus/prometheus/web.(*Handler).testReady.func1(0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/web.go:518 +0x55
net/http.HandlerFunc.ServeHTTP(0xc000303278, 0x7f6f675c61a0, 0xc002174750, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:198 +0xee
net/http.HandlerFunc.ServeHTTP(0xc000915bf0, 0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:76 +0xb2
net/http.HandlerFunc.ServeHTTP(0xc000915d70, 0x7f6f675c61a0, 0xc002174720, 0xc000291100)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x303b980, 0xc0013e4000, 0xc000291100)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:101 +0xdf
github.com/prometheus/prometheus/web.setPathWithPrefix.func1.1(0x303b980, 0xc0013e4000, 0xc000291000)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/web/web.go:1140 +0x1dc
github.com/prometheus/common/route.(*Router).handle.func1(0x303b980, 0xc0013e4000, 0xc000290f00, 0x0, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/common/route/route.go:83 +0x27f
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0005ad740, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/julienschmidt/httprouter/router.go:387 +0xc7e
github.com/prometheus/common/route.(*Router).ServeHTTP(...)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/common/route/route.go:121
main.main.func5(0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/cmd/promxy/main.go:411 +0x1ce
net/http.HandlerFunc.ServeHTTP(0xc0006122a0, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /usr/local/go/src/net/http/server.go:2049 +0x44
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0005539e0, 0x303b980, 0xc0013e4000, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/vendor/github.com/julienschmidt/httprouter/router.go:448 +0x1d2
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).runHandler(0xc000568180, 0x303b980, 0xc0013e4000, 0xc000290f00, 0x0, 0x0)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:142 +0x85
github.com/jacksontj/promxy/pkg/logging.(*ApacheLoggingHandler).ServeHTTP(0xc000568180, 0x3045a30, 0xc000e9a1c0, 0xc000290f00)
  /go/src/github.com/jacksontj/promxy/pkg/logging/logging.go:165 +0x225
net/http.serverHandler.ServeHTTP(0xc00094c000, 0x3045a30, 0xc000e9a1c0, 0xc000290f00)
  /usr/local/go/src/net/http/server.go:2867 +0xa3
net/http.(*conn).serve(0xc0019f4140, 0x304cf98, 0xc0026aa1c0)
  /usr/local/go/src/net/http/server.go:1932 +0x8cd
created by net/http.(*Server).Serve
  /usr/local/go/src/net/http/server.go:2993 +0x39b"}

Seems like it is coming somewhere deep from inside prometheus engine though?

jacksontj · 2021-08-16T20:57:59Z

That it is; it looks like something didn't populate with series which is odd. From your report this is new with the latest promxy though; which makes it even more confusing since how we expand series hasn't changed at all. So just to confirm this is new only on this release? If it is maybe I can make some more specific debug messages to see what in the world is going on :/

sagor999 · 2021-08-16T21:53:25Z

Yes, this is new with 0.0.71. Was not happening with 0.0.70, and started to happen immediately after upgrading promxy to 0.0.71.

jacksontj · 2021-08-16T23:28:53Z

I did some more digging and found a couple race conditions (pushed tmp fixes to that no_recover branch) but I don't think those are the cause of this panic. I'm still looking for repro cases but so far no success. Maybe give those a spin to see if they somehow helped; otherwise we're still looking for some repro mechanism :/

sagor999 · 2021-08-19T18:37:09Z

I tried new changes, but unfortunately same panic. :(

jacksontj · 2021-09-21T16:56:23Z

I was able to reproduce some other failure with the Walk (presumably the same issue) after the prom dep upgrade (#460) -- so I have reverted the parallel tree walking and will work to re-enable that in #461. Because of that I'll close this one out as this panic will no longer exist on master as of now.

jacksontj added the bug label Jul 26, 2021

jacksontj mentioned this issue Sep 17, 2021

Parallel ast.Walk has errors with BinaryExpr #461

Closed

jacksontj closed this as completed Sep 21, 2021

jacksontj mentioned this issue Sep 21, 2021

Update prom fork to support parallel walking #462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

sagor999 commented Jul 23, 2021

jacksontj commented Jul 26, 2021

sagor999 commented Jul 26, 2021

jacksontj commented Aug 2, 2021

sagor999 commented Aug 2, 2021

jacksontj commented Aug 3, 2021

sagor999 commented Aug 5, 2021

jacksontj commented Aug 16, 2021

sagor999 commented Aug 16, 2021

jacksontj commented Aug 16, 2021

sagor999 commented Aug 19, 2021

jacksontj commented Sep 21, 2021

v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

v0.0.71: unexpected error: runtime error: invalid memory address or nil pointer dereference #451

Comments

sagor999 commented Jul 23, 2021

jacksontj commented Jul 26, 2021

sagor999 commented Jul 26, 2021

jacksontj commented Aug 2, 2021

sagor999 commented Aug 2, 2021

jacksontj commented Aug 3, 2021

sagor999 commented Aug 5, 2021

jacksontj commented Aug 16, 2021

sagor999 commented Aug 16, 2021

jacksontj commented Aug 16, 2021

sagor999 commented Aug 19, 2021

jacksontj commented Sep 21, 2021