[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-autoscaler
autoscaler related issues
core-clusters
For launching and managing Ray clusters/jobs/kubernetes
P1
Issue that should be fixed within a few weeks
Ray 2.4
What happened + What you expected to happen
I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).
After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in
/tmp/ray/session_latest/logs/monitor.log
at the head node; the relevant log part is:This exception repeats again and again with the same worker id
ray-research-worker-cbcbb628-compute
.The
ray-research-worker-cbcbb628-compute
instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).I believe (though not sure) that situation is something like this:
The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.
Versions / Dependencies
Google cloud platform is used, and preemptible instances are used for workers (see condig).
Reproduction script
Config:
Script:
In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: