Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

Closed
chrisvanderpennen opened this issue Jul 14, 2021 · 14 comments · Fixed by #894
Labels

Comments

@chrisvanderpennen
Copy link

chrisvanderpennen commented Jul 14, 2021

Symptoms

The endpoint stops processing messages with a single error message logged and doesn't resume processing until it is restarted.

Who's affected

All users of the transport are affected by the issue, however the consumer acknowledgement timeout introduced in RabbitMQ 3.8.15 has increased the likelihood of experiencing the problem. Any handler running longer than the default timeout timeout value (15 minutes in 3.8.15, changed to 30 minutes in 3.8.17) will cause the problem to occur.

Root cause

The message pump was handling only connection shutdown events and didn't detect channel or consumer failures, causing the consumer to stop receiving new messages from the broker.

Backported to

Original issue content

From the RabbitMQ docs:

In modern RabbitMQ versions, a timeout is enforced on consumer delivery acknowledgement. This helps detect buggy (stuck) consumers that never acknowledge deliveries. Such consumers can affect node's on disk data compaction and potentially drive nodes out of disk space.
If a consumer does not ack its delivery for more than the timeout value (30 minutes by default), its channel will be closed with a PRECONDITION_FAILED channel exception. The error will be logged by the node that the consumer was connected to.

While it is unexpected that a handler could take longer than 30 minutes to complete, should a service instance trigger this condition it will continue with no indication that there is a problem until manually restarted, but will not receive any further messages from RabbitMQ.

Consumer timeouts are signalled on AsyncEventingBasicConsumer via the ConsumerCancelled event, which is currently not subscribed in MessagePump.

I've posted a gist with a reproduction. To run, assuming Docker for Windows:

docker run --name rabbitmq -d -p 15672:15672 -p 5672:5672 --mount "type=bind,source=$(resolve-path timeout.conf),target=/etc/rabbitmq/conf.d/timeout.conf" rabbitmq:management
dotnet run

Running docker logs -f rabbitmq in a second terminal will eventually print:

2021-07-14 04:08:19.868 [warning] <0.1206.0> Consumer [hostname] - timeoutRepro on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 5000 ms. This timeout value can be configured, see consumers doc guide to learn more
2021-07-14 04:08:19.869 [error] <0.1206.0> Channel error on connection <0.1197.0> (172.17.0.1:37270 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 1:
operation none caused a channel exception precondition_failed: delivery acknowledgement on channel 1 timed out. Timeout value used: 5000 ms. This timeout value can be configured, see consumers doc guide to learn more

The management UI will show 0 consumers for the service queue, while the service continues to run as if nothing is wrong.

@mikeminutillo
Copy link
Member

Thanks @chrisvanderpennen. We'll take a look and see what we can do about this. Did you encounter this issue or in a running system?

@chrisvanderpennen
Copy link
Author

We encountered it in a production system yesterday. We were fortunate enough to have a Kubernetes autoscaler watching queue depth so we only had a mildly degraded system, but the only thing that tipped us off to there being a problem at all is that our telemetry alarm for queues with no consumers kept tripping when the autoscaler terminated the extra instances despite all replicas showing as healthy.

I have no doubt there's a bug in one of our handlers that triggered it which we'll investigate and resolve separately, but because we don't have access to the consumer from our code we can't build an interim detection or recovery without forking the entire library.

@lukasweber
Copy link

Thanks a lot for bringing this up!
We were running into the same issue on our production system recently, but were unable to reproduce where it was coming from. Probably there is also a bug on our side which seems to trigger this and we'll keep investigating. Would be nice if you could also have look at this on your side, since it's really hard to detect the issue with no access to the consumer.

@andreasohlund
Copy link
Member

We're actively working on a fix for this #894

Our research suggests that the problematic timeout was introduced in 3.8.15 with the default value of 15 minutes and then in 3.8.17, the default was changed to 30 minutes.

Can you confirm that you are on 3.8.15 or higher of RabbitMQ?

@chrisvanderpennen
Copy link
Author

Thanks for the update! We're running 3.8.16 in production. I should probably log a job to update...

@andreasohlund andreasohlund changed the title Message pump consumer cancellations due to acknowledgement timeout are never recovered Channel exceptions or consumer cancellation causes the transport to stop processing messages Oct 25, 2021
@andreasohlund
Copy link
Member

@chrisvanderpennen we've tweaked the issue description to add some more details

@bording bording changed the title Channel exceptions or consumer cancellation causes the transport to stop processing messages Channel exceptions and consumer cancellations cause the transport to stop processing messages Oct 29, 2021
@bording
Copy link
Member

bording commented Nov 1, 2021

@chrisvanderpennen @lukasweber We've released NServiceBus.RabbitMQ 6.1.1, so give it a try and let us know if you're still running into problems.

@lukasweber
Copy link

lukasweber commented Nov 15, 2021

We were able to push the fix to our production environment last week and it works perfectly fine and as expected! This saves us some time to figure out the root cause of the issue on our side. Thanks a lot :)

@NArnott
Copy link

NArnott commented Jan 17, 2022

I know this issue is resolved, but it seems that while NSB will now attempt to reconnect, the message Recoverability mechanism doesn't get applied. Short example, if I have a handler that does an infinite loop or sleep (basically never returns), can I short-circuit that message in the pipeline somehow so it doesn't keep getting rehandled on reconnection?

@bording
Copy link
Member

bording commented Jan 19, 2022

Hi @NArnott,

For the scenario you are describing, where the handlers are truly in an infinite loop or otherwise never complete, that's going to be a problem regardless of what transport you're using. Recoverability would never be able to save you from that. Any endpoint with that problem would consume messages up to the concurrency limit, and then stop attempting to process any additional messages.

However, if you have handlers that are consistently taking longer to execute than the configured acknowledgement timeout, but do eventually complete, then that is a scenario that the current transport design is not handling properly. I've opened #927 to track this problem.

@NArnott
Copy link

NArnott commented Jan 25, 2022

@bording,

Maybe this is something that can be addresses in the next NSB major version with Cancellation Tokens? Being that the handler currently has no way of knowing what that timeout would be, would be nice to have that token to monitor and cancel out any long running process if we know the Transport has dropped us.

@andreasohlund
Copy link
Member

@NArnott you are correct and this is something that is already included in the upcoming v8 see

https://docs.particular.net/nservicebus/hosting/cooperative-cancellation?version=core_8#inside-the-message-processing-pipeline

is that what you are looking for?

@NArnott
Copy link

NArnott commented Jan 25, 2022

@andreasohlund Yes, I am aware CTs are coming in the next major release. I just wanted to see if they'd also trigger for this RabbitMQ use case, when it drops the connection due to taking too long.

@andreasohlund
Copy link
Member

Sorry for the delay @NArnott what you are asking about is very reasonable but not currently supported. I have raised #931 to track adding support for this. This means that our next enhancement release will consider this but I have no ETA for this yet.

Hope this make sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants