[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

neex · 2022-10-26T10:50:50Z

What happened + What you expected to happen

I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).

After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in /tmp/ray/session_latest/logs/monitor.log at the head node; the relevant log part is:

2022-10-26 13:26:33,018 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+STAGI
NG%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+research%29&alt=json
2022-10-26 13:26:33,136 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json
2022-10-26 13:26:33,195 ERROR autoscaler.py:341 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 338, in update
    self._update()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 397, in _update
    self.process_completed_updates()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 732, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node.py", line 407, in get_instance
    .execute()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/http.py", line 851, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json returned "The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found">
2022-10-26 13:26:33,196 CRITICAL autoscaler.py:350 -- StandardAutoscaler: Too many errors, abort.

This exception repeats again and again with the same worker id ray-research-worker-cbcbb628-compute.

The ray-research-worker-cbcbb628-compute instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).

I believe (though not sure) that situation is something like this:

Ray started setting up the instance for worker and added it to some internal data structures.
At some point (probably during the setup), it was shut down as I use preemptible instances.
The Google Cloud Platform immediately forgot about it, starting to return 404 for all requests related to the instance.
The autoscaler did not handle this corner case correctly and did not remove it from their lists.

The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.

Versions / Dependencies

$ ray --version
ray, version 2.0.1
$ python --version
Python 3.10.6
$ uname -a
Linux ray-research-head-3c5e32a6-compute 5.15.0-1021-gcp #28-Ubuntu SMP Fri Oct 14 15:46:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

Google cloud platform is used, and preemptible instances are used for workers (see condig).

Reproduction script

Config:

cluster_name: ray-debug
max_workers: 30

provider:
  type: gcp
  region: europe-west1
  availability_zone: europe-west1-c
  project_id: wunderfund-research


available_node_types:
    head:
        resources: {"CPU": 0}
        node_config:
            machineType: n2-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts

                  # ubuntu-2204-jammy-v20220712a
    worker:
        # memory 640 GB =  640*1024*1024*1024 = 687194767360
        resources: {"CPU": 1, "memory": 687194767360}
        node_config:
            machineType: n2-standard-2

            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
            scheduling:
              - preemptible: true
            serviceAccounts:
            - email: "ray-worker@wunderfund-research.iam.gserviceaccount.com"
              scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: head
idle_timeout_minutes: 1
upscaling_speed: 2


auth:
   ssh_user: ubuntu


setup_commands:
  - sudo apt update
  - sudo DEBIAN_FRONTEND=noninteractive apt install python3-pip python-is-python3 -y
  - sudo pip install -U pip
  - sudo pip install ray[all]


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Script:

import time
import ray

def test_job(delay):
    time.sleep(delay)
    return f"Waited for {delay} secs"


def run_jobs():
    delays = [i * 10 for i in range(1, 30)]
    jobs = [ray.remote(test_job).options(num_cpus=1).remote(d) for d in delays]

    while jobs:
        done_ids, jobs = ray.wait(jobs)
        for ref in done_ids:
            result = ray.get(ref)
            print(ref, result)


if __name__ == "__main__":
    run_jobs()

In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

cadedaniel · 2022-10-28T21:44:49Z

@wuisawesome could you help triage this?

cheremovsky · 2022-11-01T07:21:15Z

Same story as OP 😢

neex added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022

hora-anyscale assigned cadedaniel Oct 28, 2022

hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

cadedaniel assigned wuisawesome Oct 28, 2022

wuisawesome added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks labels Nov 1, 2022

cadedaniel removed their assignment Nov 1, 2022

wuisawesome removed their assignment Nov 8, 2022

DmitriGekhtman assigned wuisawesome Nov 15, 2022

richardliaw added core-autoscaler autoscaler related issues infra autoscaler, ray client, kuberay, related issues and removed core-autoscaler autoscaler related issues labels Nov 21, 2022

hora-anyscale added the Ray 2.4 label Dec 14, 2022

architkulkarni self-assigned this Jan 17, 2023

scv119 added core-autoscaler autoscaler related issues and removed core Issues that should be addressed in Ray Core labels Feb 16, 2023

richardliaw added core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed infra autoscaler, ray client, kuberay, related issues labels Mar 20, 2023

jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

neex commented Oct 26, 2022

cadedaniel commented Oct 28, 2022

cheremovsky commented Nov 1, 2022

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

Comments

neex commented Oct 26, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

cadedaniel commented Oct 28, 2022

cheremovsky commented Nov 1, 2022