Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce concurrency limit for GET requests and a general timeout for HTTP #1743

Merged
merged 4 commits into from
Feb 15, 2019

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Feb 6, 2019

Fixes #1723 .

@beorn7 beorn7 requested a review from mxinden February 6, 2019 17:43
@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

@mxinden I'm banging my head against a wall when investigating the Travis errors: cannot find package "github.com/prometheus/alertmanager/ui/app" Hmm, yes, that package does not exist. But why does it even want it? It's not referred anywhere I can see. Also, it works locally.

@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

I have a lead: the go generate command does not use Go Modules in my local run because apparently the env variables don't make it through to it. However, in Travis, presumably GOPATH is not set, so Go Modules are activated automatically.

@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

That's definitely the reason. Now trying to find a solution. (BTW: This would fail in master in the same way.)

@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

Still banging my head against a wall. The way to reproduce: set GOPATH to something else, and then run make assets. I tried to make the assets package a go module on its own, but then we get double-imports. @simonpasquier as our local go module expert: Do you know a way out of this?

api/api.go Outdated Show resolved Hide resolved
api/api.go Outdated Show resolved Hide resolved
api/api.go Outdated Show resolved Hide resolved
@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

Hmm, I just pushed a whitespace change to a branch freshly branched off master, and there the go generate works, but then errcheck finds a lot of problems all of a sudden.

@mxinden
Copy link
Member

mxinden commented Feb 6, 2019

Interesting. Ignoring err on Log should be excluded. Will look further into that.

Travis build link for reference: https://travis-ci.org/prometheus/alertmanager/builds/489679923

@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

I think a bunch of things in our build are pulled in with current versions. That might explain changing behavior (and is at odds with the go-modules approach of reproducible builds).

@beorn7
Copy link
Member Author

beorn7 commented Feb 6, 2019

I'll get to the bottom of this tomorrow with a clear head… Very mysterious. 🤔

@beorn7 beorn7 force-pushed the beorn7/api branch 2 times, most recently from 0318737 to 5374774 Compare February 7, 2019 15:21
@simonpasquier
Copy link
Member

Regarding the errcheck failures, I suspect that Travis CI messes up with GOPATH. I'm trying https://github.com/prometheus/alertmanager/pull/1748to investigate/fix it.

@beorn7
Copy link
Member Author

beorn7 commented Feb 7, 2019

Rebased on top of master with @simonpasquier 's and my fixes. Let's see what's happening now.

@beorn7
Copy link
Member Author

beorn7 commented Feb 7, 2019

Tadaaa! 🎉
It worked.
@mxinden now we can get to the normal review. Waiting for your further comments.

api/api.go Outdated
)

func init() {
prometheus.MustRegister(requestsInFlight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of the global metric registry. It might be less verbose but also implicit and more difficult when it comes to vendoring. I am trying the steer this project away from it. What do you think in this regard? Would passing a Registerer to API.New() also be an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the global registry, either. I just wanted to stay consistent with the way it is and then have a separate refactoring to make it local.

But now I have seen what you did in #1741, which is already merged. Thus, a incremental migration to local Registerers is already ongoing. Happy to follow that path here.

api/api.go Outdated Show resolved Hide resolved
api/api.go Outdated Show resolved Hide resolved
api/v1/api.go Show resolved Hide resolved
@simonpasquier simonpasquier self-requested a review February 8, 2019 13:47
@simonpasquier
Copy link
Member

Nothing much to add to @mxinden's comments from my side.

@beorn7
Copy link
Member Author

beorn7 commented Feb 8, 2019

Responded to the comments, but haven't yet pushed the commit to actually address them.
Give me some time...

beorn7 added 3 commits February 11, 2019 19:34
The default concurrency limit is max(GOMAXPROCS, 8). That should not
imply that each GET requests eats a whole CPU. It's more to get some
reasonable heuristics for the processing power of the hosting machine
(while allowing at least 8 concurrent requests even on the smallest
machines). As GET requests can easily overload the Alertmanager,
rendering it incapable of doing its main task, namely sending alert
notifications, we need to limit GET requests by default.

In contrast, no timeout is set by default. The http.TimeoutHandler
inovkes quite a bit of machinery behind the scenes, in particular an
additional layer of buffering. Thus, we should first get a bit of
experience with it before we consider enforcing a timeout by default,
even if setting a timeout is in general the safer setting for
resiliency.

Signed-off-by: beorn7 <beorn@soundcloud.com>
The context is created by the http.TimeoutHandler we use to set the
timeout.

I believe this is the only endpoint where propagating the timeout is
feasible and needed.

Signed-off-by: beorn7 <beorn@soundcloud.com>
While the newly added in-flight instrumentation works for all GET
requests, the existing HTTP instrumentation omits api/v2 calls. This
commit adds a TODO note about that.

Signed-off-by: beorn7 <beorn@soundcloud.com>
@beorn7
Copy link
Member Author

beorn7 commented Feb 12, 2019

Sorry for the delay. Got sidetracked all the time. Finally pushed that new commit. @stuartnelson3 is gone for vacation again. @mxinden please have another look.

Most importantly, `api.New` now takes an `Options` struct as an
argument, which allows some other things done here as well:

- Timout and concurrency limit are now in the options, streamlining
  the registration and the implementation of the limiting middleware.

- A local registry is used for metrics, and the metrics used so far
  inside any of the api packages are using it now.

The 'in flight' metric now contains the 'get' as a method label. I
have also added a TODO to instrument other methods in the same way
(otherwise, the label doesn't reall make sense, semantically). I have
also added an explicit error counter for requests rejected because of
the concurrency limit. (They also show up as 503s in the generic HTTP
instrumentation (or they would, if v2 were instrumented, too), but
those 503s might have a number of reasons, while users might want to
alert on concurrency limit problems explicitly).

Signed-off-by: beorn7 <beorn@soundcloud.com>
@beorn7
Copy link
Member Author

beorn7 commented Feb 13, 2019

@mxinden Do you think you could have a look at this any time soon?

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small thing. Otherwise this looks good to me.

requestsInFlight prometheus.Gauge
concurrencyLimitExceeded prometheus.Counter
timeout time.Duration
inFlightSem chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in the handler, right? How about initializing it in the handler constructor closure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we call limitHandler twice, once to handle the router handler, once to handle the apiv2 handler. If we moved the inFlightSem into limitHandler, we would limit the GET requests separately for each one. But we want one limit for all of the HTTP handling. (That was the reason why I had a function that returns a function (the middleware) that returns a function (the HandlerFunc) in the previous version. I could avoid that confusion by having inFlightSem as a member of API and limitHandler a method of API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden did my comment clarify the issue? Or do you think this should be changed in another way?

inFlightSem chan struct{}
}

// Options for the creation of an API object. Alerts, Silences, and StatusFunc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding detailed comments here.

@beorn7
Copy link
Member Author

beorn7 commented Feb 15, 2019

Thanks @simonpasquier . @mxinden I assume things are clarified now. We can still rework this if needed. I just would to like to get this merged as #1733 needs to address some merge conflicts with it and I have yet another change on top of all that in my pipeline. Thus, I'm merging this now to get all those ends tied up. It also unblocks your work on #1744.

@beorn7 beorn7 merged commit da6e2a8 into master Feb 15, 2019
@beorn7 beorn7 deleted the beorn7/api branch February 15, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants