-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability when replicating/rebalancing on an active cluster #3024
Comments
Can you inspect the WorkerState to see if the scheduler and workers agree on what is currently being run. For the workers: client = Client()
In [100]: c.run(lambda dask_worker: dask_worker.tasks) An you can get the scheduler's state from |
It looks like they don't agree:
vs
I am calling |
Which task is the one that's stuck in processing?
That's useful info, thanks.
I'm not sure, but this looks like a bug. I don't think it should be possible to get the cluster into this state (cc @mrocklin, in case you know offhand). |
|
Inside I'm cheering that you two are having this conversation without my involvement :) If the scheduler and worker's perspective of tasks remains different for a non-trivial amount of time then yes, that seems like a bug to me
As a warning, replicate isn't super robust, especially when things are actively running. It could use being improved/replaced. |
I've played around a bit more and it seems I'm not sure how common the replicate-to-all-workers pattern is in practice; I actually would be equally happy to re-load my data on each worker but If you think there's any low-hanging fruit in the current |
I think that the current replicate and rebalance implementations only really work robustly if there isn't any work going on in the cluster. They're not resilient to anything that could happen. Instead, I think that we need to design something from the ground up. It's not a trivial task, but probably has more to do with general design than the internals of Dask. |
Maybe tangentially related to #391: I have noticed with long-running jobs that sometimes one worker will stop making progress and those tasks will remain in "Processing" state indefinitely. I've tried to figure out what might be going on but am at a loss:
top
)I assumed this was due to some peculiarity of my task function, but the fact that the call stack is completely empty (in this case w/ 5 of the same task "Processing") makes me think it might be dask-related instead. Any suggestions on how to go about debugging why a worker might be stuck the next time this happens?
EDIT: renaming since we decided the culprit was
replicate
The text was updated successfully, but these errors were encountered: