Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

Closed
Tracked by #5207
JorTurFer opened this issue Oct 15, 2023 · 1 comment · Fixed by #5084
Assignees
Labels
bug Something isn't working

Comments

@JorTurFer
Copy link
Member

JorTurFer commented Oct 15, 2023

Report

Recently we have faced with an unexpected behavior when accidentally we deployed a ScaledObject targeting to a server producing timeouts (in our case, due to network firewall). When that happened, KEDA stop working or started working wrong.

I have replicated the scenario deploying a ScaledObject with a cron, adding/removing 1 instance on each minute. Suddenly, I deployed a Kafka Scaler targeting to a Kafka cluster blocked by network (and producing timeouts during the initial setup on reconciliation loop):
image

After this, KEDA stopped working because the operator was stuck on the reconciliation loop even though I set GOMAXPROCS: 8, so it's not a matter of routines. It's like if we have any deadlock at some point.
The only solution in this case has been removing the wrong ScaledObject.

In our internal case, we are using v2.11.2, so it's not related with latest version.

Expected Behavior

KEDA must prevent service disruptions just for a single ScaledObject wrongly configured

Actual Behavior

KEDA stops working as expected

@JorTurFer JorTurFer added the bug Something isn't working label Oct 15, 2023
@JorTurFer
Copy link
Member Author

I think that the problem is related with this code:

func (h *scaleHandler) performGetScalersCache(ctx context.Context, key string, scalableObject interface{}, scalableObjectGeneration *int64, scalableObjectKind, scalableObjectNamespace, scalableObjectName string) (*cache.ScalersCache, error) {
h.scalerCachesLock.RLock()
if cache, ok := h.scalerCaches[key]; ok {
// generation was specified -> let's include it in the check as well
if scalableObjectGeneration != nil {
if cache.ScalableObjectGeneration == *scalableObjectGeneration {
h.scalerCachesLock.RUnlock()
return cache, nil
}
} else {
h.scalerCachesLock.RUnlock()
return cache, nil
}
}
h.scalerCachesLock.RUnlock()
h.scalerCachesLock.Lock()
defer h.scalerCachesLock.Unlock()
if cache, ok := h.scalerCaches[key]; ok {

After the Lock() we are generating the scalers, which produces timeouts, so we are locking the cache during the timeout. I have to go deeper on this direction, but I reckon that we need to just lock the cache for setting the value and not during the whole cache element generation (no not block the whole cache until timeout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready To Ship
Development

Successfully merging a pull request may close this issue.

2 participants