Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/thanos] thanos helm chart renders strange hostname for sidecarsService dnsDiscovery #24527

Open
danfinn opened this issue Mar 18, 2024 · 7 comments
Labels
on-hold Issues or Pull Requests with this label will never be considered stale tech-issues The user has a technical issue about an application thanos

Comments

@danfinn
Copy link

danfinn commented Mar 18, 2024

Name and Version

bitnami/thanos 12.23.0

What architecture are you using?

amd64

What steps will reproduce the bug?

I'm installing the helm chart like so:

helm upgrade --install thanos bitnami/thanos --values ~/git/thanos/thanos_values.yml

with the values below. What is happening is that the DNS entry for my service is being rendered by the helm chart incorrectly (as far as I can tell) and according to helm template it's coming out looking like this:

          args:
            - query
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --query.replica-label=replica
            - --endpoint=dnssrv+_grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
            - --endpoint=dnssrv+_grpc._tcp.thanos-storegateway.prometheus.svc.cluster.local

I don't know where or why that _grpc._tcp. prefix is coming from but it's causing DNS resolution to break and I get the following errors from the query pod:

ts=2024-03-18T20:45:54.559924978Z caller=resolver.go:99 level=error msg="failed to lookup SRV records" host=_grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local err="no such host"

without the _grpc._tcp. prefix, dns resolution works as expected:

nslookup prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Server:		10.0.0.10
Address:	10.0.0.10:53


Name:	prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Address: 10.0.75.135

once you add that on though:

nslookup _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Server:		10.0.0.10
Address:	10.0.0.10:53

** server can't find _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local: NXDOMAIN

** server can't find _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local: NXDOMAIN

Are you using any custom parameters or values?

query:
  nodeSelector:
    kubernetes.io/os: linux
  dnsDiscovery:
    sidecarsService: "prometheus-thanos-sidecar-server"
    sidecarsNamespace: "prometheus"

queryFrontend:
  nodeSelector:
    kubernetes.io/os: linux

bucketweb:
  nodeSelector:
    kubernetes.io/os: linux

compactor:
  nodeSelector:
    kubernetes.io/os: linux
  enabled: true

storegateway:
  nodeSelector:
    kubernetes.io/os: linux
  enabled: true

ruler:
  nodeSelector:
    kubernetes.io/os: linux

receive:
  nodeSelector:
    kubernetes.io/os: linux

receiveDistributor:
  nodeSelector:
    kubernetes.io/os: linux

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

objstoreConfig: |-
  type: AZURE
  config:
      storage_account: "storage_account_name"
      storage_account_key: "storage_account_key"
      container: "thanos"

What is the expected behavior?

I'm not sure why it's adding that strange looking prefix onto the DNS entry for the service

What do you see instead?

see above

Additional information

No response

@danfinn danfinn added the tech-issues The user has a technical issue about an application label Mar 18, 2024
@github-actions github-actions bot added the triage Triage is needed label Mar 18, 2024
@danfinn
Copy link
Author

danfinn commented Mar 18, 2024

this looks like it might be related to thanos-io/thanos#5366 however there is no info on what the fix was for that and I'm not sure what pod labels they are talking about

@danfinn
Copy link
Author

danfinn commented Mar 18, 2024

you can see he where the prefix is added by the helm chart:

- --endpoint=dnssrv+_grpc._tcp.{{- include "common.tplvalues.render" ( dict "value" .Values.query.dnsDiscovery.sidecarsService "context" $) -}}.{{- include "common.tplvalues.render" ( dict "value" .Values.query.dnsDiscovery.sidecarsNamespace "context" $) -}}.svc.{{ .Values.clusterDomain }}

@github-actions github-actions bot removed the triage Triage is needed label Mar 19, 2024
@github-actions github-actions bot assigned FraPazGal and unassigned javsalgar Mar 19, 2024
@javsalgar javsalgar changed the title thanos helm chart renders strange hostname for sidecarsService dnsDiscovery [bitnami/thanos] thanos helm chart renders strange hostname for sidecarsService dnsDiscovery Mar 19, 2024
@illyul
Copy link

illyul commented Mar 21, 2024

I got same issue with Thanos and CoreDNS. Our k8s cluster is using both of kube-dns and CoreDNS.
With kube-dns, everything is okie. But, CoreDNS can't resolve A record.

Screenshot 2024-03-21 at 4 56 32 PM Workaround: Follow this docs

https://github.com/thanos-io/thanos/blob/main/docs/service-discovery.md#dns-service-discovery

I have changed dnssrv+_grpc._tcp to dns+_grpc._tcp:port.

Postscript:
This bug is reported and fixed from 2021
thanos-io/thanos#3672

@FraPazGal
Copy link
Contributor

Hello @danfinn, if I'm understanding your issue correctly the issue comes from the set endpoint to your external prometheus service right? Looking at the SRV records, the dnssrv+_grpc._tcp.service_url will look for the service's port named grpc. Would it be possible the prometheus service port you are connecting to is named differently?

Besides that, could you also try @illyul workaround? In that case you'll be directly setting the port number instead of the port name from the service.

It seems to me we should evaluate having an additional parameter to define the sidecar's portName or portNumber depending on whether we end up using dnssrv+ or dns+.

@illyul
Copy link

illyul commented Mar 24, 2024

Besides that, could you also try @illyul workaround? In that case you'll be directly setting the port number instead of the port name from the service.

I configured dns+_grpc._tcp:port. to workaround and it worked for me.

Copy link

github-actions bot commented Apr 9, 2024

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Apr 9, 2024
@FraPazGal
Copy link
Contributor

Hello @illyul, @danfinn, I have created an internal task for our dev team to look into this and provide a permanent solution. I'll put this issue on-hold and we'll update it as soon as there is any news.

@FraPazGal FraPazGal added on-hold Issues or Pull Requests with this label will never be considered stale and removed stale 15 days without activity in-progress labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale tech-issues The user has a technical issue about an application thanos
Projects
None yet
Development

No branches or pull requests

4 participants