Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a default for consumer_timeout to 15 minutes #2990

Merged
merged 1 commit into from
Apr 21, 2021

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Apr 20, 2021

So that faulty consumers that will never ack a pending messages have
their channels closed after 15 minutes.

Without this change quorum queue compaction can be severely delayed by a consumer that never acknowledges a delivery,
running the node out of disk space much faster than anticipated.

@michaelklishin
Copy link
Member

We agreed to document this in the advanced.config example as part of this PR.

So that faulty consumers that will never ack a pending messages have
their channels closed after 15 minutes.
@michaelklishin michaelklishin merged commit 80e3992 into master Apr 21, 2021
@michaelklishin michaelklishin deleted the consumer-timeout-default branch April 21, 2021 20:39
michaelklishin added a commit that referenced this pull request Apr 21, 2021
Set a default for consumer_timeout

(cherry picked from commit 80e3992)
@michaelklishin
Copy link
Member

Backported to v3.8.x.

@michaelklishin michaelklishin added this to the 3.8.15 milestone Apr 21, 2021
@someniatko
Copy link

someniatko commented May 7, 2021

Guys, I'm not sure it was worth it to make such change in a minor release. There are legit cases when you would not ACK the message for a long period of time.

For me, the change of the default setting caused BC break leading to failures of an application (it processed the same message again and again).

And it was really hard to find any information about this behavior by just googling it and even by searching in the docs on the website. I came to this PR by chance. I'm sure there are a lot of other apps relying on lack of the ACK timeouts by default.

@someniatko
Copy link

@kjnilsson

@jdavidheiser
Copy link

jdavidheiser commented May 9, 2021

I've seen a few reports of this causing breaking changes in the Airflow community, where some of the example Airflow implementations that folks start with are pinned to version 3.8 but not a minor version. I think this was a pretty unfortunate choice to release in a minor version update.

@michaelklishin
Copy link
Member

michaelklishin commented May 11, 2021

There is no timeout value that would work for everyone. Not having a timeout is not an option: without it, a buggy consumer will eventually prevent its quorum queue from performing on disk data compaction. We waited for a over a year before
picking a default, with real world evidence of exactly that kind of buggy consumers. Only then we introduced a timeout that is objectively a lot higher than average delivery processing time in well over 90% of systems (in most apps it takes fractions of a second or seconds).

The value is configurable. You can set it to several hours if you want or even disable it. This is how it works with defaults: no value works for everyone but some values work for 90% of people who would never notice the change. Time to move on.

@michaelklishin
Copy link
Member

michaelklishin commented May 11, 2021

@someniatko I'm going to do the unspeakable and argue that most apps don't process deliveries for over 15 minutes. Some do but not a lot. Your case is an outlier. Sorry to be so blunt but we talk to RabbitMQ users every single day.

What can be done to make it easier to understand is an error message improvement. It can mention that the timeout
can be controlled by the operator.

@jdavidheiser
Copy link

A better error message is definitely needed if you insist on backporting the change in a minor version update.

@michaelklishin michaelklishin changed the title Set a default for consumer_timeout Set a default for consumer_timeout to 15 minutes May 11, 2021
@michaelklishin
Copy link
Member

michaelklishin commented May 11, 2021

We will add a dedicated doc section to the Consumers guide, change the message to mention the configurability, and bump the timeout x2 to 30 minutes. 1 hour sounds too long to several members of our team. We believe a very high % of consumers do not process deliveries for anywhere near 30 minutes but maybe a 30m default would cover a few more % compared to 15m.

michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this pull request May 11, 2021
michaelklishin added a commit that referenced this pull request May 11, 2021
Apparently 15 minutes is not enough for some. 1 hour seems to be
unreasonably long to our team, though.

References #2990, #3032
michaelklishin added a commit that referenced this pull request May 11, 2021
in the logs and consumer channel error message.

References #2990, #3032.
@samjsharpe
Copy link

@michaelklishin how would one disable it completely?

Sorry to ask, but it's not clear from the docs and I'm unfortunately not great at Erlang or with rabbitmq-server internals.

I think I see a validator in the schema for non_zero_positive_integer and also that get_consumer_timeout() goes undefined if it's not an integer, but I don't know if that means I can set it to 0 or -1 or what the maximum value is if I have to set it to a big number.

@chadknutson
Copy link

Setting consumer_timeout to false appears to disable this feature. I agree with others above that this needs to be documented.

@c2nes
Copy link

c2nes commented May 24, 2021

The release notes for 3.8.15 suggest this change only applied to quorum queues, but based on the documentation added here it seems like this change applies to all consumers. Is that correct? If so, could the release notes for 3.8.15 be amended?

@nahum-litvin-hs
Copy link

hi this just broke one of our envs. why is this a part of a minor update?
how can it be disabled using the management app?

@lhoguin
Copy link
Contributor

lhoguin commented Aug 27, 2021

@nahum-litvin-hs You need to configure consumer_timeout to false: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

@nahum-litvin-hs
Copy link

@nahum-litvin-hs You need to configure consumer_timeout to false: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

do u have any idea how this can be done using the API?

@chrisjsewell
Copy link

chrisjsewell commented Aug 30, 2021

do u have any idea how this can be done using the API?

yep similar question: does this have to be set from the server side, or can it be set from the client side (i.e. via the connection URL) (edit: I guess not currently https://www.rabbitmq.com/uri-query-parameters.html#tls)
and if not, could you make this the case 😬?

I think this would be a good compromise, for use cases like mine, where we don't have "control" of the server-side and, like heartbeat, is something the client should have a say in? (opened #3344)

@joekohlsdorf
Copy link

@someniatko I'm going to do the unspeakable and argue that most apps don't process deliveries for over 15 minutes. Some do but not a lot. Your case is an outlier. Sorry to be so blunt but we talk to RabbitMQ users every single day.

An outlier??
Celery is a widely used framework for asynchronous task processing in Python.
It has a feature called ETA tasks, used for retrying failed tasks and for scheduling tasks in the future. It works by reading the task from RabbitMQ and putting it in a worker-internal queue until the ETA is reached. To not lose the task in case of a worker failure it is not ACKed immediately, the task is only ACKed after processing it. With retryable tasks this can be hours in the future, depending on your backoff strategy.
Every Celery installation is now broken because retryable tasks are not working properly anymore with default broker settings.

@sojournerc
Copy link

Our use case is similar to that of Celery - scheduling notifications to be sent in the future, sometimes days - we use the manual ack to ensure the notification is sent at-least once and only once. Just a +1 for delayed ack's.

@langemeijer
Copy link

I am baffled by this conversation.

Let me state first that I very much respect the work that has been done by you and your fellow developers. RabbitMQ is a very stable and robust piece of software. I very much appreciate the time that you have put into it.

But: You cannot change the fundamental workings of something which is such an important piece of infrastructure to projects ran by so many people and businesses.

I stumbled on this PR by accident, because we have been debugging a very long-running background process on a development machine. In a closely monitored situation with only a few jobs in the queue this still has cost me a few hours before I figured this one out. We were very lucky. We have running 3.8.14 in production, the last minor release that would not kill my software and data. By chance the development machine had a 3.9.x version that was affected by this feature.

I run a few job queues on a RabbitMQ cluster. I have designed a distributed system where the jobs submitted to these queues should only ever run once per queue message. This has now changed, in a minor version no less, because a new feature had to be added and it was decided it should be enabled by default too.

most apps don't process deliveries for over 15 minutes.

This will affect many more people running RabbitMQ, and it is certainly no edge-case. Sure, RabbitMQ might process millions of messages for every message that runs over 15 (or now 30) minutes, but the problem is that most users will not notice any problems this will cause until it actually happens. And when it does the effects can be huge.

I seriously don't understand why this feature has been implemented to be default-on. I use RabbitMQ as a queue server. RabbitMQ does not manage my worker processes. This is excelent, I can write my own code around this. Scale up and scale down workers as needed, and start specific workers only when the need arises. Around these workers I have monitoring setup that ensures me that my workers are happy little ants. If my ants fall asleep doing their business, it should not be of any concern for the stack of work that is waiting for them.

In my opinion RabbitMQ has no business meddling with my workers' happiness. It is advertised as a message broker.

Not having a timeout is not an option: without it, a buggy consumer will eventually prevent its quorum queue from performing on disk data compaction.

I don't think I have ever been in this situation, If I have I didn't notice. Still I can imagine lots of better ways to handle this. For example:

  • Allow for better monitoring, for example by providing an api call to inspect current queue consumers, the queue messages they have received and the time they have been chewing on them
  • Implement a high-watermark like system for this situation, not allowing any new messages to queues, similar to memory usage.
  • Just do not try to fix buggy consumers.

I can blow up my rabbitmq servers, my redis instances, my database servers or overload my loadbalancers in many ways by doing stupid things. Sometimes there are days that I don't. Why do I need protection from my consumers blocking disk data compaction?

I will have to disable the timeout anyway, because it is punching a hole in the reliability of my systems. So I need to know what I have to do to avoid the situation your fix is for, or at least monitor for it.

It is well possible that I am wrong, not seeing things correctly, maybe because I am missing the bigger picture here.
Please explain to me, and to other users why this change is made.

@michaelklishin
Copy link
Member

michaelklishin commented Apr 1, 2022

@langemeijer well we have seen nodes running out of disk because of stuck consumers. No one changes things for the kicks of it. The modern default is 30m, and very few consumers take longer than that. It's very convenient to assume that every user runs into what you run into, and does things the same way. The core team sees all kinds of people because they eventually come to the mailing list, Slack and so on with questions. And of course they do not observe or foresee nodes running out of disk
space because of their apps until they hit it. This kind of thinking is extremely common in this field and others.

Implementing a high watermark that blocks queues is extremely complex and very likely to backfire. A timeout is
very easy to reason about, and works the same way for everyone. It is also extremely easy to explain and disable.

Why do some need protection from stuck consumers? Because many do not run with sufficient monitoring on their apps.
And yet when they run out of disk space, it can be very time consuming to determine what happened. We can't afford
to spend several engineer-days on that for free every single time, so we introduced a very effective and easy to reason
about solution. It may be blunt, but as with many engineering and product decisions, sometimes you have to pick
the least of N evils.

@michaelklishin
Copy link
Member

michaelklishin commented Apr 1, 2022

As for what can be monitored, monitoring for how long quorum queue consumers take to process a delivery with 95th and 99th percentile should be enough. But it can be too late to save a node or multiple nodes from exhausting all disk space
because consumer behaviour prevents Raft log truncation from happening. So monitoring is great but not enough.
A hard limit that allows nodes to "protect" themselves from such abuse would still be necessary.

@michaelklishin
Copy link
Member

michaelklishin commented Apr 1, 2022

What baffles me in this discussion is how the reasoning behind this change is ignored or devalued. The maintainers are
blamed for changing things that address a real problem that can lead to catastrophic cluster availability events.
I'm afraid changing software is the only way to improve it, and RabbitMQ hardly introduces client-facing changes often.
You can take a RabbitMQ client from 2010 and it would still successfully connect and perform all key operations
against a release produced in 2022.

In my opinion RabbitMQ has no business meddling with my workers' happiness

but you are fine with consumers doing things that prevent RabbitMQ from reclaiming disk space. Again, the concerns
of infrastructure developers are ignored and devalued. Changing your app or one line of configuration is too much to ask for. Software that you get for free, including some degree of free support apparently cannot have an opinion or guard rails in place, even if that objectively improves the resiliency in certain areas and frees up the time of the core team.

Just set the timeout higher than 30m if you are willing to take the risk, and move on.

@rabbitmq rabbitmq locked and limited conversation to collaborators Apr 1, 2022
@mergify mergify bot added the make label Apr 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.