Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update thanos 15.7.15 to 15.7.16, sidecars no longer show up on thanos query stores #29310

Open
Bah27 opened this issue Sep 9, 2024 · 7 comments
Assignees
Labels
in-progress tech-issues The user has a technical issue about an application thanos

Comments

@Bah27
Copy link

Bah27 commented Sep 9, 2024

Name and Version

thanos/15.7.16

What architecture are you using?

None

What steps will reproduce the bug?

Update charts thanos 15.7.15 to 15.7.16

Are you using any custom parameters or values?

existingObjstoreSecret: thanos-objstore
metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: "monitoring"   
query:
  enable: true
  dnsDiscovery:
  enable: false
  sidecarsService: prometheus-operated
  sidecarsNamespace: monitoring
  grpc:
    client:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true
  stores: 
    - "@domain1:443"
    - "@domain2:443"
    - "@domain2:443"
  extraFlags:
     - --grpc-client-tls-skip-verify
     - --store.response-timeout=0  
  replicatLabel: prometheus_replica   
  resources:
    requests:
      cpu: 150m 
      memory: 150Mi 
    limits:
      #cpu: 50m
      #memory: 200Mi 

  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
                   
queryFrontend:
  enabled: true
  config: |-
    type: IN-MEMORY
    config:
      max_size: 1GB
      max_size_items: 0
      validity: 0s

  resources:
    requests:
      cpu: 10m
      memory: 100Mi
    limits:
      #cpu: 100m
      memory: 100Mi

  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
    
  ingress:
    enabled: false

compactor:
  enabled: true
  retentionResolutionRaw: 14d
  retentionResolution5m: 14d
  retentionResolution1h: 20d
  consistencyDelay: 30m
  extraFlags:
  - --delete-delay=2h

  persistence:
    enabled: false
 
  resources:
    requests:
      cpu: 200m
      memory: 200Mi
    limits:
      #cpu: 100m
      #memory: 200Mi
      
  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

receive:
 enabled: false

bucketweb:
 enabled: false

storegateway:
  enabled: true
  grpc:
    server:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true 

  persistence:
    enabled: false

  resources:
    requests:
      cpu: 100m
      memory: 100Mi 
    limits:
      #cpu: 100m
      #memory: 100Mi
  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

NB: @domain1 is the domain name of each sidecar.

What do you see instead?

ts=2024-09-09T12:28:26.896367373Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain1:443
ts=2024-09-09T12:28:26.896680946Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain2:443
ts=2024-09-09T12:28:26.896714506Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing: dial tcp 10.38.88.81:10901: connect: connection refused\"" address=10.38.88.81:10901
@Bah27 Bah27 added the tech-issues The user has a technical issue about an application label Sep 9, 2024
@github-actions github-actions bot added the triage Triage is needed label Sep 9, 2024
@github-actions github-actions bot removed the triage Triage is needed label Sep 11, 2024
@github-actions github-actions bot assigned juan131 and unassigned carrodher Sep 11, 2024
@juan131
Copy link
Contributor

juan131 commented Sep 12, 2024

Hi @Bah27

We updated the Thanos version to 0.36.0 on that release, see #28607

It seems that, on version 0.36.1 a fix for a regression on TLS config was included in Query:

Could you try with that version? It's already available in the Bitnami chart using latest version.

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Sep 28, 2024
@Bah27
Copy link
Author

Bah27 commented Sep 30, 2024

Hello @juan131

I apologize for the lack of follow-up; I was on vacation. I will test version 0.36.1, as recommended, to check if the issue related to the TLS configuration is resolved with the fix mentioned in this pull request.

In the meantime, I have observed several errors in the logs, including:

ts=2024-09-30T08:30:06.159166083Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain1:443
ts=2024-09-30T08:30:06.159645638Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain2:443
ts=2024-09-30T08:30:11.162302928Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain3:443
ts=2024-09-30T08:30:11.162285816Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"" address=100.64.41.170:10901

These logs show errors related to timeouts and TLS authentication issues on the following endpoints:

Thank you for your patience!

@github-actions github-actions bot removed the stale 15 days without activity label Oct 1, 2024
@juan131
Copy link
Contributor

juan131 commented Oct 2, 2024

Thanks @Bah27 ! Please let us know about your insights once you try it with the latest chart version.

@Bah27
Copy link
Author

Bah27 commented Oct 2, 2024

Thank you @juan131!
I proceeded with the tests using the latest chart version, but unfortunately, I am still encountering the same errors.

@juan131
Copy link
Contributor

juan131 commented Oct 8, 2024

Hi @Bah27

Sorry for the delay in my response. I've been reviewing the values you shared paying special attention to the block below:

query:
  (...)
  grpc:
    client:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true
  stores: 
    - "@domain1:443"
    - "@domain2:443"
    - "@domain2:443"
  extraFlags:
     - --grpc-client-tls-skip-verify
     - --store.response-timeout=0 

It seems you enabled TLS for GRPC in the client side but you didn't do the same for the server side (query.grpc.server.tls.enabled is false by default and you didn't modify it). Also, you're setting the property query.grpc.client.tls.clientAuthEnabled which doesn't exist, I guess you meant query.grpc.server.tls.clientAuthEnabled, right? See:

Also, regarding this block:

query:
  dnsDiscovery:
    enable: false
    sidecarsService: prometheus-operated
    sidecarsNamespace: monitoring

Please note query.dnsDiscovery.sidecarsService and query.dnsDiscovery.sidecarsNamespace will be ignored if query.dnsDiscovery.enabled is false, see:

@Bah27
Copy link
Author

Bah27 commented Oct 9, 2024

Hi @juan131,

Thanks for your reply and for taking the time to carefully review the configuration details.

TLS for gRPC server
You're absolutely right. I had enabled TLS on the client side but missed doing so on the server side. I'll correct this by adding query.grpc.server.tls.enabled: true. And yes, I mistakenly used clientAuthEnabled in the wrong place. What I meant to use was query.grpc.server.tls.clientAuthEnabled.

Thanks for pointing that out—it really helped me understand the mistake. I’ll adjust the configuration as you suggested.

dnsDiscovery
Regarding DNS discovery, good catch! I didn't realize that query.dnsDiscovery.sidecarsService and sidecarsNamespace would be ignored if enable is set to false. I’ll either enable dnsDiscovery.enable or remove those parameters if they're not needed.

Thanks again for the clarifications and for linking the documentation—this was super helpful!

I’ll update everything and run some tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-progress tech-issues The user has a technical issue about an application thanos
Projects
None yet
Development

No branches or pull requests

3 participants