-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where query-schedulers accumulate stale querier connections #6100
Fix issue where query-schedulers accumulate stale querier connections #6100
Conversation
… that have shut down.
…ith NotifyQuerierShutdown and NotifyQuerierShutdown executes first.
4073871
to
36268ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charleskorn for working on this! To my understanding, the problem is that we never remove connections from waitingQuerierConnections
when the querier disconnects from the query-scheduler.
If my understanding is correct, I have a couple of comments:
- I don't think it's just about "GetNextRequestForQuerier calls for this querier that nothing is coming" (all in all
GetNextRequestForQuerier()
should get context canceled as soon as the querier terminated, because of gRPC context) but I think the problem is that we never cleanupwaitingQuerierConnections
. - I think we should enforce the cleanup even on
unregisterConnection
(which is more important thannotifyShutdown
, because the latter is just a best effort). - Why the issue existed even before? I think we didn't have this issue before. What am I missing?
@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() { | |||
case notifyShutdown: | |||
queues.notifyQuerierShutdown(qe.querierID) | |||
needToDispatchQueries = true | |||
|
|||
// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming. | |||
q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shutdown notification is a best effort. It's not guaranteed to be received (e.g. if the querier OOMs or the K8S node where it was running gets abruptly terminated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the shutdown notification is not received, we'll try to send a request to the querier (in Scheduler.QuerierLoop()
), get an error, call unregisterConnection
and not call GetNextRequestForQuerier
again - so this is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the same page: what you're describing is the logic in this PR, not main
, right? Cause I can see what you describe happens with the change proposed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the same page: what you're describing is the logic in this PR, not
main
, right? Cause I can see what you describe happens with the change proposed in this PR.
The behaviour in the unclean shutdown case should be the same in both.
I find it a bit difficult to follow this logic. Wouldn't be safer and more clear if we remove a querier from
waitingQuerierConnections
even when handlingunregisterConnection
?
unregisterConnection
is only called by Scheduler.QuerierLoop()
, and so isn't called concurrently with GetNextRequestForQuerier
, so removing the pending connection from waitingQuerierConnections
is not necessary, because it won't be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections
will be cleaned up when trying to dispatch a request to the querier the next time.
I agree with @pracucci's comment. I thikn the user queues already manage well queriers that have disconnected all their connections. Would be nice to have a single management structure for querier connections. Right now they are managed separately in the linked list and in the |
These are the steps that get us into the bad state:
Once we're in this state, when a query arrives:
My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to
I think the previous code assumed the context would be cancelled if the connection was broken, but that's not the case. |
I agree - the problem in this case is that |
I agree we could simplify this a little. However, they represent slightly different things (and may need better names) - the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I now better understand the fix. I propose to improve a couple of comments. Thanks!
if err != nil { | ||
querierConn.sendError(err) | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understand this is the real fix. cancelWaitingConnectionsForQuerier()
called by notifyShutdown
is just a "nice to have" to gracefully handle shutting down queriers, but what really fix the issue in every case (which means when the shutdown notify has not been received) is this one. I suggest to add a comment here to explain it, cause I think will not be obvious when we'll read again this code in the future.
@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() { | |||
case notifyShutdown: | |||
queues.notifyQuerierShutdown(qe.querierID) | |||
needToDispatchQueries = true | |||
|
|||
// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming. | |||
q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections
will be cleaned up when trying to dispatch a request to the querier the next time.
Does this mean that this context cancelation tracking has no effect - here and here? Should we remove it? |
…#6100) * Fix issue where query-scheduler accumulates connections from queriers that have shut down. * Add test to ensure GetNextRequestForQuerier still returns if racing with NotifyQuerierShutdown and NotifyQuerierShutdown executes first. * Add changelog entry. * Address PR feedback: add comments * Clarify comment (cherry picked from commit a6d8fea)
I don't know what I did when I was testing this last week (maybe I was checking the wrong context?), but what I said earlier is incorrect: the context is cancelled when the querier either gracefully ends the stream (eg. due to a graceful shutdown) or when it is terminated abruptly (eg. due to the querier crashing). So my earlier statement that #6090 occurred before and after #5880 was also wrong - the implementation before #5880 was not susceptible to the same issue because it would always return from The issue in #6090 only occurs with the changes from #5880: we didn't remove pending We could be more proactive about detecting context cancellation earlier, but I don't think this is worthwhile as it would only complicate things further - we'd now have two places checking for context cancellation. Open to feedback on this if you disagree though @dimitarvdimitrov @pracucci. The fix in this PR still resolves the issue, so I don't believe any further work is required. |
Thanks @charleskorn for the detailed explanation. I'm happy to hear the old version didn't suffer this issue. I'm fine to keep the current fix as is, without complicating more the logic to handle the cancellation earlier. |
I think it's ok to only detect a cancelled querier connection when we attempt to send to it since the overhead of retrying the query on another connection shouldn't be high and because we should fairly quickly discover the querier connection is cancelled (within seconds with reasonable load?). But your response prompted another question. Since the context is cancelled promptly, then do we need the extra cleanup in |
Agree as well. I'm wondering if we really need that extra logic. Which was the point of a comment I made on this PR, which was rejected because at that point we thought context cancellation wasn't propagated. |
Good point, I've removed that in #6145. |
What this PR does
This PR fixes an issue where query-schedulers could accumulate stale querier connections.
getNextQueueForQuerier
would returnnil
in both the case where there were no queries right now for the querier, and the case where the querier was shutting down and so no queries should be sent.This meant that the dispatcher loop in
RequestQueue
wouldn't remove connections for queriers that are shutting down, and so would accumulate more and more querier connections. Every time a new request arrived, it would try to find a request to send to the stale queriers (all of which would be at the beginning ofwaitingQuerierConnections
, as the oldest connections), and then finally send the request to an active connection. Over time, as more and more queriers stopped, more and more stale querier connections would accumulate and so enqueue latency would increase.I believe this issue was present before #5880, but #5880 made it visible in the enqueue latency:
Which issue(s) this PR fixes or relates to
Fixes #6090
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]