Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Autoscaler] Ray V2 Autoscaler stalls when scaling multiple worker Pods with KubeRay #2759

Closed
2 tasks done
ryanaoleary opened this issue Jan 16, 2025 · 3 comments · Fixed by #2814
Closed
2 tasks done
Labels
bug Something isn't working triage

Comments

@ryanaoleary
Copy link
Contributor

ryanaoleary commented Jan 16, 2025

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Environment:

  • Ray 2.40.0
  • KubeRay operator version 1.2.2

Currently the V2 Autoscaler test case for TestRayClusterAutoscalerMinReplicasUpdate in /kuberay/ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go is stalling when attempting to scale up 5 worker Pods using detached actors:

    raycluster_autoscaler_test.go:351: 
        Timed out after 300.001s.
        Expected
            <int32>: 4
        to equal
            <int32>: 5
=== NAME  TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaler_v2_enabled
    testing.go:1651: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== NAME  TestRayClusterAutoscalerMinReplicasUpdate
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-head-hccws/ray-head logs
    test.go:98: Creating ephemeral output directory as KUBERAY_TEST_OUTPUT_DIR env variable is unset
    test.go:101: Output directory has been created at: /tmp/TestRayClusterAutoscalerMinReplicasUpdate682423055/001
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-head-hccws/autoscaler logs
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-78ckm/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-dbgr6/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-fsxx9/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-g9t8b/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-head-g2vxh/ray-head logs
    test.go:98: Creating ephemeral output directory as KUBERAY_TEST_OUTPUT_DIR env variable is unset
    test.go:101: Output directory has been created at: /tmp/TestRayClusterAutoscalerMinReplicasUpdate2655252486/002
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-head-g2vxh/autoscaler logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-4qc8d/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-69xkh/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-ml4wg/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-wvwvl/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-wx4fp/ray-worker logs
--- FAIL: TestRayClusterAutoscalerMinReplicasUpdate (367.76s)
    --- PASS: TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaling_enabled (33.68s)
    --- FAIL: TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaler_v2_enabled (331.37s)
FAIL
FAIL    github.com/ray-project/kuberay/ray-operator/test/e2eautoscaler  367.821s
FAIL

I edited the above test to use TestTimeoutLong to check whether it was simply a latency issue. The same test passes for the V1 Autoscaler test case. I'm able to recreate this bug by creating a RayCluster with the following manifest:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-autoscaler
spec:
  rayVersion: '2.40.0'
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: rayproject/ray:2.40.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "1"
              memory: "2G"
          env:
            - name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
              value: "1"
          volumeMounts:
            - mountPath: /home/ray/samples
              name: ray-example-configmap
        volumes:
          - name: ray-example-configmap
            configMap:
              name: ray-example
              defaultMode: 0777
              items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: terminate_detached_actor.py
                  path: terminate_detached_actor.py
        restartPolicy: Never # No restart to avoid reuse of pod for different ray nodes.
  workerGroupSpecs:
  - replicas: 0
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.40.0
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "1"
              memory: "1G"
        restartPolicy: Never # Never restart a pod to avoid pod reuse
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray
    import sys

    @ray.remote(num_cpus=1)
    class Actor:
      pass

    ray.init(namespace="default_namespace")
    Actor.options(name=sys.argv[1], lifetime="detached").remote()

  terminate_detached_actor.py: |
    import ray
    import sys

    ray.init(namespace="default_namespace")
    detached_actor = ray.get_actor(sys.argv[1])
    ray.kill(detached_actor)

and scaling up multiple worker Pods using detached actors as follows:

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)

for i in `seq 0 10`; do kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor_$i; done

Less than the expected number of Pods are scaled up (8 rather than 11):

kubectl get pods

kubectl get pods
NAME                                             READY   STATUS    RESTARTS   AGE
kuberay-operator-5c4f69b57-8vzt6                 1/1     Running   0          6h39m
raycluster-autoscaler-head-m9cbt                 2/2     Running   0          74s
raycluster-autoscaler-small-group-worker-2kt9v   0/1     Running   0          14s
raycluster-autoscaler-small-group-worker-bg2ht   0/1     Running   0          20s
raycluster-autoscaler-small-group-worker-bxqb8   1/1     Running   0          20s
raycluster-autoscaler-small-group-worker-k4jrn   0/1     Running   0          20s
raycluster-autoscaler-small-group-worker-khzl4   0/1     Running   0          20s
raycluster-autoscaler-small-group-worker-m9p6t   0/1     Running   0          14s
raycluster-autoscaler-small-group-worker-svhpc   0/1     Running   0          14s
raycluster-autoscaler-small-group-worker-wmcp6   1/1     Running   0          26s

The Actors created show that there should be 10 workers scaled up:

kubectl exec -it $HEAD_POD -- ray list actors

Defaulted container "ray-head" out of: ray-head, autoscaler

======== List: 2025-01-16 03:51:36.628032 ========
Stats:
------------------------------
Total: 11

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME    STATE             JOB_ID    NAME      NODE_ID                                                     PID  RAY_NAMESPACE
 0  1db88b5eff174346c94f304408000000  Actor         PENDING_CREATION  08000000  actor_7                                                                 0  default_namespace
 1  41b8f0832180ec95ba5be9350a000000  Actor         PENDING_CREATION  0a000000  actor_9                                                                 0  default_namespace
 2  5850a4a349801b9cd994255b01000000  Actor         PENDING_CREATION  01000000  actor_0                                                                 0  default_namespace
 3  7068aa7e8cc399030c341f420b000000  Actor         ALIVE             0b000000  actor_10  dffcc7dc9d323aaa6d14fac0b40dc8e4007cb95b1d0b95f7998f3b33     79  default_namespace
 4  73954bd88c4cfbc18301d36a03000000  Actor         ALIVE             03000000  actor_2   b4d9cb9779486f33e7040dfaa736798dcabcf407a2d287223289bddd     79  default_namespace
 5  959664a41d2197ddef7c34a302000000  Actor         ALIVE             02000000  actor_1   133f16940fe5e8299c6a46948011e4df444167b6b89d4e5f5c2f3c40     79  default_namespace
 6  9aaa13c7cd632c65e69a032f09000000  Actor         ALIVE             09000000  actor_8   8892c00aac6fd01c414ed57105b490428d8e6a6d780d2b75c0439298     79  default_namespace
 7  9c50b72699b516710960bae405000000  Actor         ALIVE             05000000  actor_4   43098baaab0d3c702ffec751aae3d8af423152644f5dc8a7d5a16ff1     79  default_namespace
 8  d57ef699174cfab8bcff1a2606000000  Actor         ALIVE             06000000  actor_5   a80c53244cb31c736a1c25852043db8d1d9203d52e21e69db373fdf7     79  default_namespace
 9  e9322eb810e7e2adf742a14307000000  Actor         ALIVE             07000000  actor_6   63ec8e964d21c375e8d13bd3a63f846e666da81f77caa8d452d69ac6     79  default_namespace
10  fb4c5974fb71dac58f2176f504000000  Actor         ALIVE             04000000  actor_3   af91b1b67eccea0937216e89443cc1d5aa584b85e071684b6bd1069c     81  default_namespace

I attached below the Kuberay Operator logs and the V2 Autoscaler logs from the Head Pod container.

operator-1.2.2-logs.txt
autoscaler-v2-logs.txt

Reproduction script

From /kuberay/ray-operator run:

go test -timeout 30m -v ./test/e2eautoscaler

or follow the above instructions to validate this bug manually

Anything else

I've found this bug reliably occurs, but is more common when creating a larger number of tasks/actors. For the manual reproduction, I increased the number of created detached actors to 11 because it was correctly scaling up 5 Pods (despite the automated test consistently failing for the same number).

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@ryanaoleary ryanaoleary added bug Something isn't working triage labels Jan 16, 2025
@ryanaoleary
Copy link
Contributor Author

Related Issues:
#2600
#2561

@ryanaoleary
Copy link
Contributor Author

Update: upgrading to the nightly image fixes the issue in my manual tests, so the automated test failures can be ignored until the RayVersion returned by GetRayVersion is updated to the next Ray release

@ryanaoleary ryanaoleary closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2025
@kevin85421 kevin85421 reopened this Jan 22, 2025
@kevin85421
Copy link
Member

We should try to use Ray 2.41 in CI when it is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants