Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple network prober to the activator. #3257

Merged
merged 1 commit into from
Feb 22, 2019

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Feb 17, 2019

When the flag -enable-network-probing is passed the activator will replace its retring transport logic with a simple network probe based on #3256 with a similar number of retries to what the retrying transport was previously configured to use. Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use t.Parallel() since it will parallelize the two times this waits for a scale-to-zero.

Fixes: #3239
Fixes: #2856

@knative-prow-robot knative-prow-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 17, 2019
@mattmoor
Copy link
Member Author

/test pull-knative-serving-unit-tests

@tcnghia
Copy link
Contributor

tcnghia commented Feb 18, 2019

/test pull-knative-serving-integration-tests

cmd/activator/main.go Outdated Show resolved Hide resolved
pkg/activator/handler/handler.go Show resolved Hide resolved
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
pkg/activator/handler/handler_test.go Outdated Show resolved Hide resolved
test/e2e/grpc_test.go Outdated Show resolved Hide resolved
test/e2e/grpc_test.go Show resolved Hide resolved
test/e2e/grpc_test.go Outdated Show resolved Hide resolved
test/e2e/grpc_test.go Outdated Show resolved Hide resolved
test/e2e/grpc_test.go Show resolved Hide resolved
@mattmoor mattmoor force-pushed the probe-based-activation branch 2 times, most recently from 3dec44f to d73c81f Compare February 19, 2019 00:29
@vagababov
Copy link
Contributor

/lgtm
/hold
in case you want someone else to look as well.

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 19, 2019
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@tcnghia
Copy link
Contributor

tcnghia commented Feb 19, 2019

If this works, let's check it in because I feel like this is part of #813 which is work we signed up for in 0.4.0.

Thanks for the fix!

@tcnghia
Copy link
Contributor

tcnghia commented Feb 19, 2019

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@mattmoor
Copy link
Member Author

@tcnghia I'm worried that this is potentially disruptive to a key path. After 0.4.0 cuts, I'll check this in enabled by default and we can discuss cherry picking it into 0.4.x at the WG meeting after it's baked at least a little in master.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mattmoor mattmoor changed the title [WIP] Add a simple network prober to the activator. Add a simple network prober to the activator. Feb 19, 2019
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2019
@mattmoor
Copy link
Member Author

@markusthoemmes The probe endpoints are in 0.4 and unused. However, this would affect a conversation about whether to cherry-pick this (I added this to the #networking agenda on Thursday).

Copy link
Member Author

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, RFAL.

pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
@mattmoor
Copy link
Member Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2019
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
http.CanonicalHeaderKey(network.ProbeHeaderName): []string{"true"},
},
}
settings := wait.Backoff{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this invalidates the comment above where maxRetries says that the sum will evaluate to 1 minute. My math shows you'll be getting 37.151 seconds (or 48397, depending on how wait internals are done), but still not a minute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment was written for the other code path. We can clean that up when we remove the other path?

@mattmoor
Copy link
Member Author

/test pull-knative-serving-integration-tests

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mattmoor
Copy link
Member Author

/test pull-knative-serving-integration-tests

@vagababov
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@mattmoor
Copy link
Member Author

/test pull-knative-serving-integration-tests

a.Logger.Errorw("Pod probe failed", zap.Error(err))
return false, nil
}
if probeResp.StatusCode != http.StatusOK {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do similar to the retry roundtripper where we only continue on 503? I worry that retrying on any !ok response will surface e.g. 500 with a 1min delay on activation rather than instant fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to only continue on 503, and below to only declare success on http.StatusOK.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@mattmoor
Copy link
Member Author

/test pull-knative-serving-upgrade-tests

pkg/activator/handler/handler.go Show resolved Hide resolved
pkg/activator/handler/handler.go Outdated Show resolved Hide resolved
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
@k4leung4
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

pkg/activator/handler/handler.go Show resolved Hide resolved
@mattmoor
Copy link
Member Author

/retest

1 similar comment
@jonjohnsonjr
Copy link
Contributor

/retest

@knative-prow-robot knative-prow-robot merged commit 7fc1438 into knative:master Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix gRPC Streaming after Cold Start Activator to perform health checks before forwarding real requests
9 participants