-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set a default for consumer_timeout to 15 minutes #2990
Conversation
We agreed to document this in the |
So that faulty consumers that will never ack a pending messages have their channels closed after 15 minutes.
6fa81e4
to
2ab05b8
Compare
Set a default for consumer_timeout (cherry picked from commit 80e3992)
Backported to |
Guys, I'm not sure it was worth it to make such change in a minor release. There are legit cases when you would not ACK the message for a long period of time. For me, the change of the default setting caused BC break leading to failures of an application (it processed the same message again and again). And it was really hard to find any information about this behavior by just googling it and even by searching in the docs on the website. I came to this PR by chance. I'm sure there are a lot of other apps relying on lack of the ACK timeouts by default. |
I've seen a few reports of this causing breaking changes in the Airflow community, where some of the example Airflow implementations that folks start with are pinned to version 3.8 but not a minor version. I think this was a pretty unfortunate choice to release in a minor version update. |
There is no timeout value that would work for everyone. Not having a timeout is not an option: without it, a buggy consumer will eventually prevent its quorum queue from performing on disk data compaction. We waited for a over a year before The value is configurable. You can set it to several hours if you want or even disable it. This is how it works with defaults: no value works for everyone but some values work for 90% of people who would never notice the change. Time to move on. |
@someniatko I'm going to do the unspeakable and argue that most apps don't process deliveries for over 15 minutes. Some do but not a lot. Your case is an outlier. Sorry to be so blunt but we talk to RabbitMQ users every single day. What can be done to make it easier to understand is an error message improvement. It can mention that the timeout |
A better error message is definitely needed if you insist on backporting the change in a minor version update. |
We will add a dedicated doc section to the Consumers guide, change the message to mention the configurability, and bump the timeout x2 to 30 minutes. 1 hour sounds too long to several members of our team. We believe a very high % of consumers do not process deliveries for anywhere near 30 minutes but maybe a 30m default would cover a few more % compared to 15m. |
now that 3.8.15 has finally introduced a default. References rabbitmq/rabbitmq-server#2990, rabbitmq/rabbitmq-server#3032
@michaelklishin how would one disable it completely? Sorry to ask, but it's not clear from the docs and I'm unfortunately not great at Erlang or with rabbitmq-server internals. I think I see a validator in the schema for |
Setting consumer_timeout to false appears to disable this feature. I agree with others above that this needs to be documented. |
The release notes for 3.8.15 suggest this change only applied to quorum queues, but based on the documentation added here it seems like this change applies to all consumers. Is that correct? If so, could the release notes for 3.8.15 be amended? |
hi this just broke one of our envs. why is this a part of a minor update? |
@nahum-litvin-hs You need to configure |
do u have any idea how this can be done using the API? |
yep similar question: does this have to be set from the server side, or can it be set from the client side (i.e. via the connection URL) (edit: I guess not currently https://www.rabbitmq.com/uri-query-parameters.html#tls) I think this would be a good compromise, for use cases like mine, where we don't have "control" of the server-side and, like |
It seems that no consumer timeout was the default a while ago. Relates rabbitmq/rabbitmq-server#2990
An outlier?? |
Our use case is similar to that of Celery - scheduling notifications to be sent in the future, sometimes days - we use the manual ack to ensure the notification is sent at-least once and only once. Just a +1 for delayed ack's. |
I am baffled by this conversation. Let me state first that I very much respect the work that has been done by you and your fellow developers. RabbitMQ is a very stable and robust piece of software. I very much appreciate the time that you have put into it. But: You cannot change the fundamental workings of something which is such an important piece of infrastructure to projects ran by so many people and businesses. I stumbled on this PR by accident, because we have been debugging a very long-running background process on a development machine. In a closely monitored situation with only a few jobs in the queue this still has cost me a few hours before I figured this one out. We were very lucky. We have running 3.8.14 in production, the last minor release that would not kill my software and data. By chance the development machine had a 3.9.x version that was affected by this feature. I run a few job queues on a RabbitMQ cluster. I have designed a distributed system where the jobs submitted to these queues should only ever run once per queue message. This has now changed, in a minor version no less, because a new feature had to be added and it was decided it should be enabled by default too.
This will affect many more people running RabbitMQ, and it is certainly no edge-case. Sure, RabbitMQ might process millions of messages for every message that runs over 15 (or now 30) minutes, but the problem is that most users will not notice any problems this will cause until it actually happens. And when it does the effects can be huge. I seriously don't understand why this feature has been implemented to be default-on. I use RabbitMQ as a queue server. RabbitMQ does not manage my worker processes. This is excelent, I can write my own code around this. Scale up and scale down workers as needed, and start specific workers only when the need arises. Around these workers I have monitoring setup that ensures me that my workers are happy little ants. If my ants fall asleep doing their business, it should not be of any concern for the stack of work that is waiting for them. In my opinion RabbitMQ has no business meddling with my workers' happiness. It is advertised as a message broker.
I don't think I have ever been in this situation, If I have I didn't notice. Still I can imagine lots of better ways to handle this. For example:
I can blow up my rabbitmq servers, my redis instances, my database servers or overload my loadbalancers in many ways by doing stupid things. Sometimes there are days that I don't. Why do I need protection from my consumers blocking disk data compaction? I will have to disable the timeout anyway, because it is punching a hole in the reliability of my systems. So I need to know what I have to do to avoid the situation your fix is for, or at least monitor for it. It is well possible that I am wrong, not seeing things correctly, maybe because I am missing the bigger picture here. |
@langemeijer well we have seen nodes running out of disk because of stuck consumers. No one changes things for the kicks of it. The modern default is 30m, and very few consumers take longer than that. It's very convenient to assume that every user runs into what you run into, and does things the same way. The core team sees all kinds of people because they eventually come to the mailing list, Slack and so on with questions. And of course they do not observe or foresee nodes running out of disk Implementing a high watermark that blocks queues is extremely complex and very likely to backfire. A timeout is Why do some need protection from stuck consumers? Because many do not run with sufficient monitoring on their apps. |
As for what can be monitored, monitoring for how long quorum queue consumers take to process a delivery with 95th and 99th percentile should be enough. But it can be too late to save a node or multiple nodes from exhausting all disk space |
What baffles me in this discussion is how the reasoning behind this change is ignored or devalued. The maintainers are
but you are fine with consumers doing things that prevent RabbitMQ from reclaiming disk space. Again, the concerns Just set the timeout higher than 30m if you are willing to take the risk, and move on. |
So that faulty consumers that will never ack a pending messages have
their channels closed after 15 minutes.
Without this change quorum queue compaction can be severely delayed by a consumer that never acknowledges a delivery,
running the node out of disk space much faster than anticipated.