Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry not Retrying #51762

Closed
2 tasks done
Stono opened this issue Jun 27, 2024 · 10 comments
Closed
2 tasks done

Retry not Retrying #51762

Stono opened this issue Jun 27, 2024 · 10 comments

Comments

@Stono
Copy link
Contributor

Stono commented Jun 27, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

We have the following VirtualService which configures a retryOn: 503

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  annotations:
    atcloud.io/fault-injection: "true"
    meta.helm.sh/release-name: unified-registration-system
    meta.helm.sh/release-namespace: unified-registration-system
  name: iovox
  namespace: unified-registration-system
spec:
  exportTo:
  - .
  hosts:
  - api.voxanalytics.com
  http:
  - match:
    - port: 80
    name: httpPort
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,502,503
    route:
    - destination:
        host: api.voxanalytics.com
        port:
          number: 80
      headers:
        request:
          remove:
          - x-envoy-attempt-count
          - x-envoy-decorator-operation
          - x-envoy-expected-rq-timeout-ms
          - x-envoy-hedge-on-per-try-timeout
          - x-envoy-is-timeout-retry
          - x-envoy-max-retries
          - x-envoy-original-path
          - x-envoy-peer-metadata
          - x-envoy-peer-metadata-id
          - x-envoy-retriable-header-names
          - x-envoy-retriable-status-codes
          - x-envoy-retry-grpc-on
          - x-envoy-retry-on
          - x-envoy-upstream-alt-stat-name
          - x-envoy-upstream-rq-per-try-timeout-ms
          - x-envoy-upstream-rq-timeout-alt-response
          - x-envoy-upstream-rq-timeout-ms
          - x-envoy-upstream-service-time
          - x-envoy-upstream-stream-duration-ms
          set:
            host: api.voxanalytics.com
      weight: 100
    timeout: 10s
  - match:
    - port: 444
    name: httpsPort
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,502,503
    route:
    - destination:
        host: api.voxanalytics.com
        port:
          number: 444
      headers:
        request:
          remove:
          - x-envoy-attempt-count
          - x-envoy-decorator-operation
          - x-envoy-expected-rq-timeout-ms
          - x-envoy-hedge-on-per-try-timeout
          - x-envoy-is-timeout-retry
          - x-envoy-max-retries
          - x-envoy-original-path
          - x-envoy-peer-metadata
          - x-envoy-peer-metadata-id
          - x-envoy-retriable-header-names
          - x-envoy-retriable-status-codes
          - x-envoy-retry-grpc-on
          - x-envoy-retry-on
          - x-envoy-upstream-alt-stat-name
          - x-envoy-upstream-rq-per-try-timeout-ms
          - x-envoy-upstream-rq-timeout-alt-response
          - x-envoy-upstream-rq-timeout-ms
          - x-envoy-upstream-service-time
          - x-envoy-upstream-stream-duration-ms
          set:
            host: api.voxanalytics.com
      weight: 100
    timeout: 10s

However we see requests in Jaeger that aren't be retried:

Screenshot 2024-06-27 at 11 16 30

The response to the app from envoy was: upstream connect error or disconnect/reset before headers. reset reason: connection termination.

I wasn't sure if this was because retries aren't recorded as spans, so I enabled envoy metrics for retries, which confirm that no retry was attempted:

# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry counter
envoy_cluster_voxanalytics_com_upstream_rq_retry{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry{cluster_name="outbound|80||api"} 0
# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_exponential counter
envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_exponential{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_exponential{cluster_name="outbound|80||api"} 0
# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_ratelimited counter
envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_ratelimited{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry_backoff_ratelimited{cluster_name="outbound|80||api"} 0
# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry_limit_exceeded counter
envoy_cluster_voxanalytics_com_upstream_rq_retry_limit_exceeded{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry_limit_exceeded{cluster_name="outbound|80||api"} 0
# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry_overflow counter
envoy_cluster_voxanalytics_com_upstream_rq_retry_overflow{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry_overflow{cluster_name="outbound|80||api"} 0
# TYPE envoy_cluster_voxanalytics_com_upstream_rq_retry_success counter
envoy_cluster_voxanalytics_com_upstream_rq_retry_success{cluster_name="outbound|444||api"} 0
envoy_cluster_voxanalytics_com_upstream_rq_retry_success{cluster_name="outbound|80||api"} 0

(note: i've raised a separate issue because the cluster_name here is not being built correctly in my opinion. However i'm linking it from here because i'm wondering if it's another symptom of the tls origination setup).

Version

1.21.2

Additional Information

No response

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

I wonder if this is because we're removing x-envoy-* from the outbound response, are they used internally to the retry logic?

I'l try removing that and seeing if it helps.

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

Screenshot 2024-06-27 at 13 52 41

No, didn't help.

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

I'm wondering if it's the type of error, the error here is a 503UC, where the upstream connection was terminated. A 503 wasn't actually returned from the remote host, it's just what istio/envoy does for a reset (maps it to a 503)?

I'll try a retryOn reset, to see what happens...

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

OK can confirm that changing it to retryOn: reset correctly retries:

envoy_cluster_voxanalytics_com_upstream_rq_retry_success{cluster_name="outbound|80||api"} 1

So that means either:

  • the http status code retry config isn't working at all (bug)
  • the 503 isn't a real 503 it's mapped from a reset, so isn't considered as part of the retry (arguably expected behaviour albeit bad UX)

@keithmattix
Copy link
Contributor

keithmattix commented Jun 27, 2024

From the envoy docs for retryOn (for 5xx):

Envoy will attempt a retry if the upstream server responds with any 5xx response code, or does not respond at all (disconnect/reset/read timeout). (Includes connect-failure and refused-stream)

So these are connection resets that result in 503s being sent to your application, but the upstream server didn't return a 503. Therefore they aren't retried

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

@keithmattix yeah was just having the same conversation with @howardjohn on slack and i agree, it's #2.

Bit of a funky one from a users perspective, that ultimately boils down to mapping a "fake" status code (503) from a response that didn't get a status code (reset), and then as a user me just acting on that - most users aren't going to be deep enough in envoy to know that it's making up status codes and you can't retry them! However as John pointed out it has to do that because it can't transmit a 0 to the downstream.

Can't think of a way forward here to be honest as all of the options are complex or have tradeoffs! Happy to close it if you agree?

@keithmattix
Copy link
Contributor

Yeah this is exactly one of those least bad option scenarios. Thanks for the feedback and info for debugging! Closing

@keithmattix keithmattix closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024
@howardjohn
Copy link
Member

@Stono
Copy link
Contributor Author

Stono commented Jun 27, 2024

^ think that's a good shout, no harm in calling it out

@howardjohn
Copy link
Member

sent out istio/api#3247

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants