Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: v0.29.0. Opentracing config create gopanic SIGSEGV #5872

Closed
ahurtaud opened this issue Nov 7, 2022 · 7 comments · Fixed by #5887
Closed

query: v0.29.0. Opentracing config create gopanic SIGSEGV #5872

ahurtaud opened this issue Nov 7, 2022 · 7 comments · Fixed by #5887
Labels

Comments

@ahurtaud
Copy link
Contributor

ahurtaud commented Nov 7, 2022

Thanos, Prometheus and Golang version used:
thanos v0.29.0

What happened:
While upgrading thanos version from 0.28.1 to v0.29.0, all our thanos queriers went in Crashloop because of a Segmentation violation (see logs below)

What you expected to happen:
as per #5411 states, there should be no breaking change.
Removing the tracing config from the specs, fix the pod (but remove tracing :) )

            - |
              --tracing.config=type: JAEGER
              config:
                service_name: "thanos-query-main"
                disabled: false
                rpc_metrics: false
                tags: ""
                sampler_type: ""
                sampler_param: 1
                sampler_manager_host_port: ""
                sampler_max_operations: 0
                sampler_refresh_interval: 0s
                reporter_max_queue_size: 0
                reporter_flush_interval: 0s
                reporter_log_spans: false
                endpoint: ""
                user: ""
                password: ""
                agent_host: "jaeger-main-agent.argos.svc"
                agent_port: 6831

How to reproduce it (as minimally and precisely as possible):
hard to tell, on our setup with the working opentracing config above, we just upgraded the docker image version... :/
I am not sure about the healthiness of my jaeger installation though, maybe it panics when the otel target is down only. Still this should not break the thanos process if this happens :)

Full logs to relevant components:

Logs

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x143d0e8]

goroutine 456 [running]:
go.opentelemetry.io/otel/sdk/trace.parentBased.ShouldSample({{0x0, 0x0}, {{0x270e3f0, 0x3a18158}, {0x270e3c8, 0x3a18158}, {0x270e3f0, 0x3a18158}, {0x270e3c8, 0x3a18158}}}, ...)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.10.0/trace/sampling.go:281 +0x1c8
github.com/thanos-io/thanos/pkg/tracing/migration.samplerWithOverride.ShouldSample(...)
	/app/pkg/tracing/migration/sampler.go:42
go.opentelemetry.io/otel/sdk/trace.(*tracer).newSpan(0xc0005b0640, {0x2719670, 0xc000e97980}, {0xc0001a0d50, 0x16}, 0xc000ec3480)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.10.0/trace/tracer.go:95 +0x456
go.opentelemetry.io/otel/sdk/trace.(*tracer).Start(0xc0005b0640, {0x2719670?, 0xc000e97980?}, {0xc0001a0d50, 0x16}, {0xc000e9af40?, 0x4?, 0xc000ec35b0?})
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.10.0/trace/tracer.go:52 +0x153
go.opentelemetry.io/otel/bridge/opentracing.(*WrapperTracer).Start(0xc0006b6228, {0x2719670?, 0xc000e97980?}, {0xc0001a0d50?, 0x1eae540?}, {0xc000e9af40?, 0x1?, 0xc000ea2f70?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/wrapper.go:79 +0x4b
go.opentelemetry.io/otel/bridge/opentracing.(*BridgeTracer).StartSpan(0xc0006ce3c0, {0xc0001a0d50, 0x16}, {0xc000e97920, 0x3, 0x3a18158?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/bridge.go:430 +0x3f5
github.com/thanos-io/thanos/pkg/tracing/migration.(*bridgeTracerWrapper).StartSpan(0x2009940?, {0xc0001a0d50?, 0x3a18158?}, {0xc000e97920?, 0x39acd40?, 0xc000ec3828?})
	/app/pkg/tracing/migration/bridge.go:89 +0x26
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.newClientSpanFromContext({0x2719638, 0xc000ed88a0}, {0x2714460, 0xc00030a080}, {0xc0001a0d50, 0x16})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:92 +0x245
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.(*opentracingClientReportable).ClientReporter(0xc0006b64e0, {0x2719638, 0xc000ed88a0}, {0x0?, 0x0?}, {0x220b8e2, 0x5}, {0x22309c5, 0x10}, {0x22309d6, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:51 +0x127
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.UnaryClientInterceptor.func1({0x2719638, 0xc000ed88a0}, {0x22309c4, 0x16}, {0x21122a0, 0x3a18158}, {0x21123e0, 0xc000ed89c0}, 0xc000ea0c30?, 0x22f4818, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/client.go:19 +0x195
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x2719638?, 0xc000ed88a0?}, {0x22309c4?, 0x38?}, {0x21122a0?, 0x3a18158?}, {0x21123e0?, 0xc000ed89c0?}, 0x0?, {0xc000eb47c0, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1({0x2719638, 0xc000ed88a0}, {0x22309c4, 0x16}, {0x21122a0, 0x3a18158}, {0x21123e0, 0xc000ed89c0}, 0x8?, 0xc000e98a68, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/client_metrics.go:112 +0x117
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x2719638?, 0xc000ed88a0?}, {0x22309c4?, 0x203000?}, {0x21122a0?, 0x3a18158?}, {0x21123e0?, 0xc000ed89c0?}, 0x1?, {0xc000eb47c0, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1({0x2719638, 0xc000ed88a0}, {0x22309c4, 0x16}, {0x21122a0, 0x3a18158}, {0x21123e0, 0xc000ed89c0}, 0x0?, 0x22f4818, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:83 +0x157
google.golang.org/grpc.(*ClientConn).Invoke(0xc00085ea00?, {0x2719638?, 0xc000ed88a0?}, {0x22309c4?, 0x16?}, {0x21122a0?, 0x3a18158?}, {0x21123e0?, 0xc000ed89c0?}, {0xc000e94a10, ...})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/call.go:35 +0x223
github.com/thanos-io/thanos/pkg/info/infopb.(*infoClient).Info(0xc00030a2c8, {0x2719638, 0xc000ed88a0}, 0x8?, {0xc000e94a10, 0x1, 0x1})
	/app/pkg/info/infopb/rpc.pb.go:422 +0xc9
github.com/thanos-io/thanos/pkg/query.(*endpointRef).Metadata(0xc0006fca80, {0x2719638, 0xc000ed88a0}, {0x2700800, 0xc00030a2c8}, {0x271a8d0, 0xc00030a2d0})
	/app/pkg/query/endpointset.go:61 +0xe3
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).updateEndpoint(0xc000a48300, {0x2719638, 0xc000ed88a0}, 0xc000e987c8, 0xc0006fca80)
	/app/pkg/query/endpointset.go:409 +0x105
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.func2(0xc000e987c8)
	/app/pkg/query/endpointset.go:349 +0x2cb
created by github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update
	/app/pkg/query/endpointset.go:338 +0x60a

@GiedriusS
Copy link
Member

cc @metonymic-smokey

@yeya24
Copy link
Contributor

yeya24 commented Nov 8, 2022

If sampler_type is set to empty string, it seems like we will pass an empty root sampler...
https://github.com/thanos-io/thanos/blob/main/pkg/tracing/jaeger/config_yaml.go#L141

I don't remember the default behavior of the Jaeger go sdk, if sampler_type is not set then does it fallback to a default sampler type?
Anyway I think the root cause of the panic is clear. Help wanted.

@yeya24 yeya24 added the bug label Nov 8, 2022
@metonymic-smokey
Copy link
Contributor

I'll look into this on the weekend, apologies for the crash.

@ahurtaud
Copy link
Contributor Author

thank you, also if we need to set a sampler type, would be great to update the jeager/thanos doc we followed :)
https://thanos.io/tip/thanos/tracing.md/#jaeger

@matej-g
Copy link
Collaborator

matej-g commented Nov 10, 2022

Hey @ahurtaud, sorry about that, I thought I have flagged this during review but looks like not. If remote was default before it should be default now as well.

@matej-g
Copy link
Collaborator

matej-g commented Nov 10, 2022

I created #5887 to correct this. I think we could even go with a patch release to make sure if more folks hit this they can just use a patch version.

@ahurtaud
Copy link
Contributor Author

regarding patch release I would love it, however we are affected as well by #5874 and #5862.
So I would love to wait them. (objstore is already fix, need to update go modules). store api noStore match error seems harder though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants