Set a default for consumer_timeout to 15 minutes #2990

kjnilsson · 2021-04-20T16:45:59Z

So that faulty consumers that will never ack a pending messages have
their channels closed after 15 minutes.

Without this change quorum queue compaction can be severely delayed by a consumer that never acknowledges a delivery,
running the node out of disk space much faster than anticipated.

michaelklishin · 2021-04-20T17:45:40Z

We agreed to document this in the advanced.config example as part of this PR.

So that faulty consumers that will never ack a pending messages have their channels closed after 15 minutes.

Set a default for consumer_timeout (cherry picked from commit 80e3992)

michaelklishin · 2021-04-21T21:35:59Z

Backported to v3.8.x.

someniatko · 2021-05-07T15:19:20Z

Guys, I'm not sure it was worth it to make such change in a minor release. There are legit cases when you would not ACK the message for a long period of time.

For me, the change of the default setting caused BC break leading to failures of an application (it processed the same message again and again).

And it was really hard to find any information about this behavior by just googling it and even by searching in the docs on the website. I came to this PR by chance. I'm sure there are a lot of other apps relying on lack of the ACK timeouts by default.

someniatko · 2021-05-07T15:21:02Z

@kjnilsson

jdavidheiser · 2021-05-09T19:58:10Z

I've seen a few reports of this causing breaking changes in the Airflow community, where some of the example Airflow implementations that folks start with are pinned to version 3.8 but not a minor version. I think this was a pretty unfortunate choice to release in a minor version update.

michaelklishin · 2021-05-11T15:21:11Z

There is no timeout value that would work for everyone. Not having a timeout is not an option: without it, a buggy consumer will eventually prevent its quorum queue from performing on disk data compaction. We waited for a over a year before
picking a default, with real world evidence of exactly that kind of buggy consumers. Only then we introduced a timeout that is objectively a lot higher than average delivery processing time in well over 90% of systems (in most apps it takes fractions of a second or seconds).

The value is configurable. You can set it to several hours if you want or even disable it. This is how it works with defaults: no value works for everyone but some values work for 90% of people who would never notice the change. Time to move on.

michaelklishin · 2021-05-11T15:26:30Z

@someniatko I'm going to do the unspeakable and argue that most apps don't process deliveries for over 15 minutes. Some do but not a lot. Your case is an outlier. Sorry to be so blunt but we talk to RabbitMQ users every single day.

What can be done to make it easier to understand is an error message improvement. It can mention that the timeout
can be controlled by the operator.

jdavidheiser · 2021-05-11T15:28:30Z

A better error message is definitely needed if you insist on backporting the change in a minor version update.

michaelklishin · 2021-05-11T15:50:25Z

We will add a dedicated doc section to the Consumers guide, change the message to mention the configurability, and bump the timeout x2 to 30 minutes. 1 hour sounds too long to several members of our team. We believe a very high % of consumers do not process deliveries for anywhere near 30 minutes but maybe a 30m default would cover a few more % compared to 15m.

now that 3.8.15 has finally introduced a default. References rabbitmq/rabbitmq-server#2990, rabbitmq/rabbitmq-server#3032

Apparently 15 minutes is not enough for some. 1 hour seems to be unreasonably long to our team, though. References #2990, #3032

in the logs and consumer channel error message. References #2990, #3032.

samjsharpe · 2021-05-21T11:26:56Z

@michaelklishin how would one disable it completely?

Sorry to ask, but it's not clear from the docs and I'm unfortunately not great at Erlang or with rabbitmq-server internals.

I think I see a validator in the schema for non_zero_positive_integer and also that get_consumer_timeout() goes undefined if it's not an integer, but I don't know if that means I can set it to 0 or -1 or what the maximum value is if I have to set it to a big number.

chadknutson · 2021-05-21T15:27:33Z

Setting consumer_timeout to false appears to disable this feature. I agree with others above that this needs to be documented.

c2nes · 2021-05-24T16:45:55Z

The release notes for 3.8.15 suggest this change only applied to quorum queues, but based on the documentation added here it seems like this change applies to all consumers. Is that correct? If so, could the release notes for 3.8.15 be amended?

nahum-litvin-hs · 2021-08-26T16:09:33Z

hi this just broke one of our envs. why is this a part of a minor update?
how can it be disabled using the management app?

lhoguin · 2021-08-27T10:56:01Z

@nahum-litvin-hs You need to configure consumer_timeout to false: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

nahum-litvin-hs · 2021-08-27T12:20:45Z

@nahum-litvin-hs You need to configure consumer_timeout to false: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

do u have any idea how this can be done using the API?

chrisjsewell · 2021-08-30T14:47:34Z

do u have any idea how this can be done using the API?

yep similar question: does this have to be set from the server side, or can it be set from the client side (i.e. via the connection URL) (edit: I guess not currently https://www.rabbitmq.com/uri-query-parameters.html#tls)
and if not, could you make this the case 😬?

I think this would be a good compromise, for use cases like mine, where we don't have "control" of the server-side and, like heartbeat, is something the client should have a say in? (opened #3344)

It seems that no consumer timeout was the default a while ago. Relates rabbitmq/rabbitmq-server#2990

joekohlsdorf · 2021-12-28T12:51:44Z

@someniatko I'm going to do the unspeakable and argue that most apps don't process deliveries for over 15 minutes. Some do but not a lot. Your case is an outlier. Sorry to be so blunt but we talk to RabbitMQ users every single day.

An outlier??
Celery is a widely used framework for asynchronous task processing in Python.
It has a feature called ETA tasks, used for retrying failed tasks and for scheduling tasks in the future. It works by reading the task from RabbitMQ and putting it in a worker-internal queue until the ETA is reached. To not lose the task in case of a worker failure it is not ACKed immediately, the task is only ACKed after processing it. With retryable tasks this can be hours in the future, depending on your backoff strategy.
Every Celery installation is now broken because retryable tasks are not working properly anymore with default broker settings.

sojournerc · 2022-01-20T16:54:35Z

Our use case is similar to that of Celery - scheduling notifications to be sent in the future, sometimes days - we use the manual ack to ensure the notification is sent at-least once and only once. Just a +1 for delayed ack's.

langemeijer · 2022-04-01T20:57:50Z

I am baffled by this conversation.

Let me state first that I very much respect the work that has been done by you and your fellow developers. RabbitMQ is a very stable and robust piece of software. I very much appreciate the time that you have put into it.

But: You cannot change the fundamental workings of something which is such an important piece of infrastructure to projects ran by so many people and businesses.

I stumbled on this PR by accident, because we have been debugging a very long-running background process on a development machine. In a closely monitored situation with only a few jobs in the queue this still has cost me a few hours before I figured this one out. We were very lucky. We have running 3.8.14 in production, the last minor release that would not kill my software and data. By chance the development machine had a 3.9.x version that was affected by this feature.

I run a few job queues on a RabbitMQ cluster. I have designed a distributed system where the jobs submitted to these queues should only ever run once per queue message. This has now changed, in a minor version no less, because a new feature had to be added and it was decided it should be enabled by default too.

most apps don't process deliveries for over 15 minutes.

This will affect many more people running RabbitMQ, and it is certainly no edge-case. Sure, RabbitMQ might process millions of messages for every message that runs over 15 (or now 30) minutes, but the problem is that most users will not notice any problems this will cause until it actually happens. And when it does the effects can be huge.

I seriously don't understand why this feature has been implemented to be default-on. I use RabbitMQ as a queue server. RabbitMQ does not manage my worker processes. This is excelent, I can write my own code around this. Scale up and scale down workers as needed, and start specific workers only when the need arises. Around these workers I have monitoring setup that ensures me that my workers are happy little ants. If my ants fall asleep doing their business, it should not be of any concern for the stack of work that is waiting for them.

In my opinion RabbitMQ has no business meddling with my workers' happiness. It is advertised as a message broker.

Not having a timeout is not an option: without it, a buggy consumer will eventually prevent its quorum queue from performing on disk data compaction.

I don't think I have ever been in this situation, If I have I didn't notice. Still I can imagine lots of better ways to handle this. For example:

Allow for better monitoring, for example by providing an api call to inspect current queue consumers, the queue messages they have received and the time they have been chewing on them
Implement a high-watermark like system for this situation, not allowing any new messages to queues, similar to memory usage.
Just do not try to fix buggy consumers.

I can blow up my rabbitmq servers, my redis instances, my database servers or overload my loadbalancers in many ways by doing stupid things. Sometimes there are days that I don't. Why do I need protection from my consumers blocking disk data compaction?

I will have to disable the timeout anyway, because it is punching a hole in the reliability of my systems. So I need to know what I have to do to avoid the situation your fix is for, or at least monitor for it.

It is well possible that I am wrong, not seeing things correctly, maybe because I am missing the bigger picture here.
Please explain to me, and to other users why this change is made.

michaelklishin · 2022-04-01T21:09:00Z

@langemeijer well we have seen nodes running out of disk because of stuck consumers. No one changes things for the kicks of it. The modern default is 30m, and very few consumers take longer than that. It's very convenient to assume that every user runs into what you run into, and does things the same way. The core team sees all kinds of people because they eventually come to the mailing list, Slack and so on with questions. And of course they do not observe or foresee nodes running out of disk
space because of their apps until they hit it. This kind of thinking is extremely common in this field and others.

Implementing a high watermark that blocks queues is extremely complex and very likely to backfire. A timeout is
very easy to reason about, and works the same way for everyone. It is also extremely easy to explain and disable.

Why do some need protection from stuck consumers? Because many do not run with sufficient monitoring on their apps.
And yet when they run out of disk space, it can be very time consuming to determine what happened. We can't afford
to spend several engineer-days on that for free every single time, so we introduced a very effective and easy to reason
about solution. It may be blunt, but as with many engineering and product decisions, sometimes you have to pick
the least of N evils.

michaelklishin · 2022-04-01T21:12:47Z

As for what can be monitored, monitoring for how long quorum queue consumers take to process a delivery with 95th and 99th percentile should be enough. But it can be too late to save a node or multiple nodes from exhausting all disk space
because consumer behaviour prevents Raft log truncation from happening. So monitoring is great but not enough.
A hard limit that allows nodes to "protect" themselves from such abuse would still be necessary.

michaelklishin · 2022-04-01T21:18:55Z

What baffles me in this discussion is how the reasoning behind this change is ignored or devalued. The maintainers are
blamed for changing things that address a real problem that can lead to catastrophic cluster availability events.
I'm afraid changing software is the only way to improve it, and RabbitMQ hardly introduces client-facing changes often.
You can take a RabbitMQ client from 2010 and it would still successfully connect and perform all key operations
against a release produced in 2022.

In my opinion RabbitMQ has no business meddling with my workers' happiness

but you are fine with consumers doing things that prevent RabbitMQ from reclaiming disk space. Again, the concerns
of infrastructure developers are ignored and devalued. Changing your app or one line of configuration is too much to ask for. Software that you get for free, including some degree of free support apparently cannot have an opinion or guard rails in place, even if that objectively improves the resiliency in certain areas and frees up the time of the core team.

Just set the timeout higher than 30m if you are willing to take the risk, and move on.

kjnilsson requested a review from michaelklishin April 20, 2021 16:45

Set a default for consumer_timeout

2ab05b8

So that faulty consumers that will never ack a pending messages have their channels closed after 15 minutes.

kjnilsson force-pushed the consumer-timeout-default branch from 6fa81e4 to 2ab05b8 Compare April 21, 2021 08:58

michaelklishin approved these changes Apr 21, 2021

View reviewed changes

michaelklishin merged commit 80e3992 into master Apr 21, 2021

michaelklishin deleted the consumer-timeout-default branch April 21, 2021 20:39

michaelklishin added a commit that referenced this pull request Apr 21, 2021

Merge pull request #2990 from rabbitmq/consumer-timeout-default

f6df03c

Set a default for consumer_timeout (cherry picked from commit 80e3992)

michaelklishin added this to the 3.8.15 milestone Apr 21, 2021

michaelklishin mentioned this pull request May 11, 2021

A model shutdown occurs if the consumer does not acknowledge the received message in 15 minutes #3032

Closed

michaelklishin changed the title ~~Set a default for consumer_timeout~~ Set a default for consumer_timeout to 15 minutes May 11, 2021

jdavidheiser mentioned this pull request May 11, 2021

Pinning rabbitmq without a minor version broke when a default timeout was added to rabbitmq mumoshu/kube-airflow#53

Open

michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this pull request May 11, 2021

Document consumer acknowledgement timeout

5faa0c7

now that 3.8.15 has finally introduced a default. References rabbitmq/rabbitmq-server#2990, rabbitmq/rabbitmq-server#3032

michaelklishin added a commit that referenced this pull request May 11, 2021

Bump consumer acknowledgement timeout default to 30 minutes

2793f5e

Apparently 15 minutes is not enough for some. 1 hour seems to be unreasonably long to our team, though. References #2990, #3032

michaelklishin mentioned this pull request May 11, 2021

Bump consumer acknowledgement timeout default to 30 minutes #3033

Merged

michaelklishin added a commit that referenced this pull request May 11, 2021

Mention delivery ack timeout configurability

6cd4d62

in the logs and consumer channel error message. References #2990, #3032.

michaelklishin mentioned this pull request May 11, 2021

Mention delivery ack timeout configurability #3034

Merged

rcwoolston mentioned this pull request Jun 4, 2021

celery worker timed out with rabbitmq apache/airflow#15799

Closed

sambles mentioned this pull request Jul 8, 2021

RabbitMQ timeout errors for tasks running over 15mins OasisLMF/OasisPlatform#515

Closed

chrisjsewell mentioned this pull request Aug 30, 2021

AiiDA will no longer work with rabbitmq>3.7 by default aiidateam/aiida-core#5105

Open

chrisjsewell mentioned this pull request Aug 30, 2021

Make consumer_timeout configurable via TLS Parameters #3344

Closed

ansd added a commit to rabbitmq/rabbitmq-website that referenced this pull request Sep 23, 2021

Correct consumer timeout

2069ec6

It seems that no consumer timeout was the default a while ago. Relates rabbitmq/rabbitmq-server#2990

ansd mentioned this pull request Sep 23, 2021

Correct consumer timeout rabbitmq/rabbitmq-website#1279

Merged

alphavector mentioned this pull request Oct 3, 2021

Retry causes unrecoverable error celery/celery#6760

Open

18 tasks

andreasohlund mentioned this pull request Oct 25, 2021

Channel exceptions and consumer cancellations cause the transport to stop processing messages Particular/NServiceBus.RabbitMQ#843

Closed

lukebakken mentioned this pull request Jan 24, 2022

Refresh ack timeout #4029

Closed

kycrow32 mentioned this pull request Feb 3, 2022

Consumer Timeout Configuration Per Queue #4099

Closed

rabbitmq locked and limited conversation to collaborators Apr 1, 2022

mergify bot added the make label Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set a default for consumer_timeout to 15 minutes #2990

Set a default for consumer_timeout to 15 minutes #2990

kjnilsson commented Apr 20, 2021 •

edited by michaelklishin

Loading

michaelklishin commented Apr 20, 2021

michaelklishin commented Apr 21, 2021

someniatko commented May 7, 2021 •

edited

Loading

someniatko commented May 7, 2021

jdavidheiser commented May 9, 2021 •

edited

Loading

michaelklishin commented May 11, 2021 •

edited

Loading

michaelklishin commented May 11, 2021 •

edited

Loading

jdavidheiser commented May 11, 2021

michaelklishin commented May 11, 2021 •

edited

Loading

samjsharpe commented May 21, 2021

chadknutson commented May 21, 2021

c2nes commented May 24, 2021

nahum-litvin-hs commented Aug 26, 2021

lhoguin commented Aug 27, 2021

nahum-litvin-hs commented Aug 27, 2021

chrisjsewell commented Aug 30, 2021 •

edited

Loading

joekohlsdorf commented Dec 28, 2021

sojournerc commented Jan 20, 2022

langemeijer commented Apr 1, 2022

michaelklishin commented Apr 1, 2022 •

edited

Loading

michaelklishin commented Apr 1, 2022 •

edited

Loading

michaelklishin commented Apr 1, 2022 •

edited

Loading

Set a default for consumer_timeout to 15 minutes #2990

Set a default for consumer_timeout to 15 minutes #2990

Conversation

kjnilsson commented Apr 20, 2021 • edited by michaelklishin Loading

michaelklishin commented Apr 20, 2021

michaelklishin commented Apr 21, 2021

someniatko commented May 7, 2021 • edited Loading

someniatko commented May 7, 2021

jdavidheiser commented May 9, 2021 • edited Loading

michaelklishin commented May 11, 2021 • edited Loading

michaelklishin commented May 11, 2021 • edited Loading

jdavidheiser commented May 11, 2021

michaelklishin commented May 11, 2021 • edited Loading

samjsharpe commented May 21, 2021

chadknutson commented May 21, 2021

c2nes commented May 24, 2021

nahum-litvin-hs commented Aug 26, 2021

lhoguin commented Aug 27, 2021

nahum-litvin-hs commented Aug 27, 2021

chrisjsewell commented Aug 30, 2021 • edited Loading

joekohlsdorf commented Dec 28, 2021

sojournerc commented Jan 20, 2022

langemeijer commented Apr 1, 2022

michaelklishin commented Apr 1, 2022 • edited Loading

michaelklishin commented Apr 1, 2022 • edited Loading

michaelklishin commented Apr 1, 2022 • edited Loading

kjnilsson commented Apr 20, 2021 •

edited by michaelklishin

Loading

someniatko commented May 7, 2021 •

edited

Loading

jdavidheiser commented May 9, 2021 •

edited

Loading

michaelklishin commented May 11, 2021 •

edited

Loading

michaelklishin commented May 11, 2021 •

edited

Loading

michaelklishin commented May 11, 2021 •

edited

Loading

chrisjsewell commented Aug 30, 2021 •

edited

Loading

michaelklishin commented Apr 1, 2022 •

edited

Loading

michaelklishin commented Apr 1, 2022 •

edited

Loading

michaelklishin commented Apr 1, 2022 •

edited

Loading