Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

Open
neex opened this issue Oct 26, 2022 · 2 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks Ray 2.4

Comments

@neex
Copy link

neex commented Oct 26, 2022

What happened + What you expected to happen

I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).

After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in /tmp/ray/session_latest/logs/monitor.log at the head node; the relevant log part is:

2022-10-26 13:26:33,018 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+STAGI
NG%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+research%29&alt=json
2022-10-26 13:26:33,136 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json
2022-10-26 13:26:33,195 ERROR autoscaler.py:341 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 338, in update
    self._update()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 397, in _update
    self.process_completed_updates()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 732, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node.py", line 407, in get_instance
    .execute()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/http.py", line 851, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json returned "The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found">
2022-10-26 13:26:33,196 CRITICAL autoscaler.py:350 -- StandardAutoscaler: Too many errors, abort.

This exception repeats again and again with the same worker id ray-research-worker-cbcbb628-compute.

The ray-research-worker-cbcbb628-compute instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).

I believe (though not sure) that situation is something like this:

  1. Ray started setting up the instance for worker and added it to some internal data structures.
  2. At some point (probably during the setup), it was shut down as I use preemptible instances.
  3. The Google Cloud Platform immediately forgot about it, starting to return 404 for all requests related to the instance.
  4. The autoscaler did not handle this corner case correctly and did not remove it from their lists.

The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.

Versions / Dependencies

$ ray --version
ray, version 2.0.1
$ python --version
Python 3.10.6
$ uname -a
Linux ray-research-head-3c5e32a6-compute 5.15.0-1021-gcp #28-Ubuntu SMP Fri Oct 14 15:46:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

Google cloud platform is used, and preemptible instances are used for workers (see condig).

Reproduction script

Config:

cluster_name: ray-debug
max_workers: 30

provider:
  type: gcp
  region: europe-west1
  availability_zone: europe-west1-c
  project_id: wunderfund-research


available_node_types:
    head:
        resources: {"CPU": 0}
        node_config:
            machineType: n2-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts

                  # ubuntu-2204-jammy-v20220712a
    worker:
        # memory 640 GB =  640*1024*1024*1024 = 687194767360
        resources: {"CPU": 1, "memory": 687194767360}
        node_config:
            machineType: n2-standard-2

            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
            scheduling:
              - preemptible: true
            serviceAccounts:
            - email: "ray-worker@wunderfund-research.iam.gserviceaccount.com"
              scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: head
idle_timeout_minutes: 1
upscaling_speed: 2


auth:
   ssh_user: ubuntu


setup_commands:
  - sudo apt update
  - sudo DEBIAN_FRONTEND=noninteractive apt install python3-pip python-is-python3 -y
  - sudo pip install -U pip
  - sudo pip install ray[all]


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Script:

import time
import ray

def test_job(delay):
    time.sleep(delay)
    return f"Waited for {delay} secs"


def run_jobs():
    delays = [i * 10 for i in range(1, 30)]
    jobs = [ray.remote(test_job).options(num_cpus=1).remote(d) for d in delays]

    while jobs:
        done_ids, jobs = ray.wait(jobs)
        for ref in done_ids:
            result = ray.get(ref)
            print(ref, result)


if __name__ == "__main__":
    run_jobs()

In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.

Issue Severity

High: It blocks me from completing my task.

@neex neex added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
@hora-anyscale hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@cadedaniel
Copy link
Member

@wuisawesome could you help triage this?

@cheremovsky
Copy link

Same story as OP 😢

@wuisawesome wuisawesome added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks labels Nov 1, 2022
@cadedaniel cadedaniel removed their assignment Nov 1, 2022
@wuisawesome wuisawesome removed their assignment Nov 8, 2022
@richardliaw richardliaw added core-autoscaler autoscaler related issues infra autoscaler, ray client, kuberay, related issues and removed core-autoscaler autoscaler related issues labels Nov 21, 2022
@architkulkarni architkulkarni self-assigned this Jan 17, 2023
@scv119 scv119 added core-autoscaler autoscaler related issues and removed core Issues that should be addressed in Ray Core labels Feb 16, 2023
@richardliaw richardliaw added core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed infra autoscaler, ray client, kuberay, related issues labels Mar 20, 2023
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks Ray 2.4
Projects
None yet
Development

No branches or pull requests

9 participants