Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSQ: Cancel workers more quickly. #16158

Merged
merged 1 commit into from
Mar 19, 2024
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 19, 2024

Prior to this patch, when canceled, workers would keep trying to contact the controller: they would attempt to report an error, and if they were in the midst of some other call (like a counters push) they would keep trying it.

This can cause cancellation to be delayed, because the controller shuts down its HTTP server before it cancels workers. Workers are then stuck retrying calls to the controller that will never succeed. The retry loops are broken when the controller gives up on them (one minute later) and exits for real. Then, the controller failure detection logic on the worker detects that the controller has failed, and the worker finally shuts down.

This patch speeds up worker cancellation by bypassing communication with the controller. There is no real need for it. If the controller canceled the workers, it isn't interested in further communications from them. If the workers were canceled out-of-band, the controller can detect this through worker monitoring and report it as a WorkerFailed error.

Prior to this patch, when canceled, workers would keep trying to contact
the controller: they would attempt to report an error, and if they were
in the midst of some other call (like a counters push) they would keep
trying it.

This can cause cancellation to be delayed, because the controller shuts
down its HTTP server before it cancels workers. Workers are then stuck
retrying calls to the controller that will never succeed. The retry loops
are broken when the controller gives up on them (one minute later) and
exits for real. Then, the controller failure detection logic on the worker
detects that the controller has failed, and the worker finally shuts down.

This patch speeds up worker cancellation by bypassing communication
with the controller. There is no real need for it. If the controller
canceled the workers, it isn't interested in further communications from
them. If the workers were canceled out-of-band, the controller can
detect this through worker monitoring and report it as a WorkerFailed
error.
@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 19, 2024
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@cryptoe cryptoe merged commit 8ee324c into apache:master Mar 19, 2024
53 checks passed
@gianm gianm deleted the msq-cancel-faster branch March 19, 2024 09:06
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants