Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding circuit breakers on ingester server side for write path #8180

Merged
merged 16 commits into from
Jun 4, 2024

Conversation

duricanikolic
Copy link
Contributor

@duricanikolic duricanikolic commented May 24, 2024

What this PR does

Similarly to what was done in #5650, when circuit breakers were added to ingester client side, this PR adds circuit breakers to the server side.

Which issue(s) this PR fixes or relates to

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@duricanikolic duricanikolic self-assigned this May 24, 2024
@duricanikolic duricanikolic force-pushed the yuri/ingester-server-cb branch 15 times, most recently from 8666953 to 26fd19e Compare May 25, 2024 07:47
@duricanikolic duricanikolic marked this pull request as ready for review May 27, 2024 04:59
@duricanikolic duricanikolic requested review from jdbaldry and a team as code owners May 27, 2024 04:59
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Yuri. I haven't reviewed tests yet, but I have enough comments to send this first batch now.

pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/metrics.go Outdated Show resolved Hide resolved
@pstibrany
Copy link
Member

pstibrany commented May 28, 2024

In internal discussion with Yuri, I proposed to change the direction of this PR, and use "Standalone Usage" of circuit breakers: try to acquire permit from circuit breaker in StartPushRequest, and then report success/failure after push request. To detect push request running for too long, we don't need context, just take timing of the push request, and report failure to CB if it's over specified timeout. I think by not using executor-pattern we can simplify the code quite a bit.

@duricanikolic duricanikolic force-pushed the yuri/ingester-server-cb branch 2 times, most recently from 134a503 to cd7e066 Compare May 28, 2024 15:23
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished my review yet, because I've run out of time. Will continue tomorrow.

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Show resolved Hide resolved
pkg/ingester/circuitbreaker_test.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker_test.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker_test.go Outdated Show resolved Hide resolved
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch of comments.

pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
}
if shouldFinish {
defer i.finishPushRequest(reqSize)
defer func() {
i.finishPushRequest(ctx, reqSize, time.Since(start), acquiredCircuitBreakerPermit, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be fine to just call i.FinishPushRequest. For that we need to preserve ctx returned by startPushRequest, but we need to do that anyway if we want to update pushRequestState in the defer statement below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have applied your suggestion. The only thing to be aware of was if a call to this PushWithCleanup() actually acquired a permit (during a call to i.startPushRequest()). In that case, the value of pushRequestState.acquiredCircuitBreakerPermit also has to be updated in the defer statement below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, since in the previous version of this implementation was:

_, shouldFinish, err := i.startPushRequest(ctx, reqSize)
if err != nil {
	return middleware.DoNotLogError{Err: err}
}
if shouldFinish {
	defer i.finishPushRequest(reqSize)
}

does it mean that we had a bug?

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
return cb.active.Load()
}

func (cb *circuitBreaker) setActive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to:

Suggested change
func (cb *circuitBreaker) setActive() {
func (cb *circuitBreaker) activate() {

I would expect setActive to receive an active bool arg to set active property to a desired value, but this method only activates the circuit breaker.

Comment on lines 172 to 175
if cb == nil {
return false
}
return cb.active.Load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this shorter:

Suggested change
if cb == nil {
return false
}
return cb.active.Load()
return cb != nil && cb.active.Load()

f.DurationVar(&cfg.PushTimeout, prefix+"push-timeout", defaultPushTimeout, "How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors")
}

type circuitBreaker struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type circuitBreaker struct {
// circuitBreaker abstracts the ingester's server-side circuit breaker functionality.
// A nil *circuitBreaker is a valid noop implementation.
type circuitBreaker struct {

pkg/ingester/circuitbreaker.go Outdated Show resolved Hide resolved
Comment on lines 190 to 206
func (cb *circuitBreaker) tryAcquirePermit() (bool, error) {
if !cb.isActive() {
return false, nil
}
if !cb.cb.TryAcquirePermit() {
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc()
return false, newCircuitBreakerOpenError(cb.cb.RemainingDelay())
}
return true, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that if tryAcquirePermit returns false, nil, then I don't have permit to proceed. However, it's not like that.

I would try to refactor the method to either avoid having to read the comment to understand the return values (what's the first return value? prod code calls it "acquired" while tests call it "status") or at least to force the user to think about it.

The comment above says:

If it was not possible to acquire a permit, success flag false is returned. In this case no call to finishPushRequest is needed.

Which puts logic in the caller of this method who shouldn't care whether it's active or not, and I think we can do that better by just returning a callback:

Suggested change
func (cb *circuitBreaker) tryAcquirePermit() (bool, error) {
if !cb.isActive() {
return false, nil
}
if !cb.cb.TryAcquirePermit() {
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc()
return false, newCircuitBreakerOpenError(cb.cb.RemainingDelay())
}
return true, nil
}
func (cb *circuitBreaker) tryAcquirePermit() (finish func(duration time.Duration, pushErr error), err error)
if !cb.isActive() {
return func(_ time.Duration, _ error,) {}, nil
}
if !cb.cb.TryAcquirePermit() {
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc()
return nil, newCircuitBreakerOpenError(cb.cb.RemainingDelay())
}
return cb.finishPushRequest, nil
}

Then the caller doesn't have more logic than just:

finish, err := cb.tryAquirePermit()
if err != nil {
    return ..., err
}
// ...
finish(time.Since(t0), err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand your rationale (client just calls returned function), it looks like slightly over-engineered solution to a trivial problem, where a simple boolean works just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also keep the current implementation as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like slightly over-engineered solution

Closures are first class citizens just like bools :)

to a trivial problem

The difference is that it removes the problem of having to decide things from the caller.

return true, nil
}

func (cb *circuitBreaker) recordResult(err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you move this method below finishPushRequest()?

Comment on lines 401 to 404
if cfg.CircuitBreakerConfig.Enabled {
// We create an inactive circuit breaker, which will be activated on a successful completion of starting.
i.circuitBreaker = newCircuitBreaker(cfg.CircuitBreakerConfig, false, logger, registerer)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you're passing the entire cfg.CircuitBreakerConfig to newCircuitBreaker, and nil instance of *circuitBreaker is valid, can we move this logic to newCircuitBreaker?

Otherwise this method is taking decisions based on that entity's config and if it's false, it relies on that entity's nil behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the logic in newCircuitBreaker().

Comment on lines 647 to 653
if i.cfg.CircuitBreakerConfig.InitialDelay == 0 {
i.circuitBreaker.setActive()
} else {
time.AfterFunc(i.cfg.CircuitBreakerConfig.InitialDelay, func() {
i.circuitBreaker.setActive()
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, IMO, this logic should be inside of newCircuitBreaker.
This method shouldn't read circuit breaker's config and take decisions on it, it should just call i.circuitBreaker.onStarting() or something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be inside of newCircuitBreaker specifically (how would we know when ingester finished starting?), but moving it to circuit breaker method sounds fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, sorry, I wrote half comment and then realised that this wasn't part of newIngester anymore. I agree, it shouldn't be in newCircuitBreaker, but should be in a method of circuit breaker (as I stated in the second part of my comment 😂 )

Copy link
Contributor Author

@duricanikolic duricanikolic Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the logic in the circuitBreaker.activate() method: if cfg.InitialDelay is 0, the circuit breaker gets activated immediately. Otherwise it gets activated after the initial delay.

Then, ingester.startig() will just call i.circuitbreaker.activate(), which will activate the circuit breaker in the right moment. WDYT @colega?

pkg/ingester/ingester.go Show resolved Hide resolved
pkg/ingester/ingester.go Show resolved Hide resolved
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thank you very much for addressing all my feedback!

@duricanikolic duricanikolic merged commit 011c02f into main Jun 4, 2024
29 checks passed
@duricanikolic duricanikolic deleted the yuri/ingester-server-cb branch June 4, 2024 17:37
duricanikolic added a commit that referenced this pull request Jun 4, 2024
* Adding circuit breakers on ingester server side for write path

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findigs

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Implementing the gauge as NewGaugeFunc

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing lint issues

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Adding test for hitting deadline when ingester.Push is used

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fix additional review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of finishPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Add unit test for startPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Activate circuit breaker on a successful completion of ingester.starting

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Rename the output error of ingester.PushWithCleanup()

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of test-delay key from in context

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

---------

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
narqo pushed a commit to narqo/grafana-mimir that referenced this pull request Jun 6, 2024
…na#8180)

* Adding circuit breakers on ingester server side for write path

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findigs

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Implementing the gauge as NewGaugeFunc

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing lint issues

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Adding test for hitting deadline when ingester.Push is used

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fix additional review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of finishPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Add unit test for startPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Activate circuit breaker on a successful completion of ingester.starting

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Rename the output error of ingester.PushWithCleanup()

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of test-delay key from in context

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

---------

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants