KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

JorTurFer · 2023-10-15T12:33:48Z

Report

Recently we have faced with an unexpected behavior when accidentally we deployed a ScaledObject targeting to a server producing timeouts (in our case, due to network firewall). When that happened, KEDA stop working or started working wrong.

I have replicated the scenario deploying a ScaledObject with a cron, adding/removing 1 instance on each minute. Suddenly, I deployed a Kafka Scaler targeting to a Kafka cluster blocked by network (and producing timeouts during the initial setup on reconciliation loop):

After this, KEDA stopped working because the operator was stuck on the reconciliation loop even though I set GOMAXPROCS: 8, so it's not a matter of routines. It's like if we have any deadlock at some point.
The only solution in this case has been removing the wrong ScaledObject.

In our internal case, we are using v2.11.2, so it's not related with latest version.

Expected Behavior

KEDA must prevent service disruptions just for a single ScaledObject wrongly configured

Actual Behavior

KEDA stops working as expected

The text was updated successfully, but these errors were encountered:

JorTurFer · 2023-10-15T13:52:53Z

I think that the problem is related with this code:

keda/pkg/scaling/scale_handler.go

Lines 292 to 310 in 0462144

    
           func (h *scaleHandler) performGetScalersCache(ctx context.Context, key string, scalableObject interface{}, scalableObjectGeneration *int64, scalableObjectKind, scalableObjectNamespace, scalableObjectName string) (*cache.ScalersCache, error) { 
        
           	h.scalerCachesLock.RLock() 
        
           	if cache, ok := h.scalerCaches[key]; ok { 
        
           		// generation was specified -> let's include it in the check as well 
        
           		if scalableObjectGeneration != nil { 
        
           			if cache.ScalableObjectGeneration == *scalableObjectGeneration { 
        
           				h.scalerCachesLock.RUnlock() 
        
           				return cache, nil 
        
           			} 
        
           		} else { 
        
           			h.scalerCachesLock.RUnlock() 
        
           			return cache, nil 
        
           		} 
        
           	} 
        
           	h.scalerCachesLock.RUnlock() 
        
           	h.scalerCachesLock.Lock() 
        
           	defer h.scalerCachesLock.Unlock() 
        
           	if cache, ok := h.scalerCaches[key]; ok {

After the Lock() we are generating the scalers, which produces timeouts, so we are locking the cache during the timeout. I have to go deeper on this direction, but I reckon that we need to just lock the cache for setting the value and not during the whole cache element generation (no not block the whole cache until timeout)

JorTurFer added the bug Something isn't working label Oct 15, 2023

JorTurFer mentioned this issue Oct 15, 2023

fix: Prevented stuck status due to timeouts during scalers generation #5084

Merged

3 tasks

JorTurFer self-assigned this Oct 15, 2023

zroubalik self-assigned this Oct 16, 2023

JorTurFer closed this as completed in #5084 Oct 16, 2023

zroubalik mentioned this issue Nov 27, 2023

Release: 2.12.1 #5207

Closed

16 tasks

JorTurFer mentioned this issue Dec 12, 2023

Processing of ScaledJobs intermittently halts at random intervals (trigger: Azure Service Bus) #5283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

JorTurFer commented Oct 15, 2023 •

edited

Loading

JorTurFer commented Oct 15, 2023

KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

KEDA operator works slowly (or doesn't work) when some ScaledObject reconciliation produces timeouts #5083

Comments

JorTurFer commented Oct 15, 2023 • edited Loading

Report

Expected Behavior

Actual Behavior

JorTurFer commented Oct 15, 2023

JorTurFer commented Oct 15, 2023 •

edited

Loading