You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Environment:
Ray 2.40.0
KubeRay operator version 1.2.2
Currently the V2 Autoscaler test case for TestRayClusterAutoscalerMinReplicasUpdate in /kuberay/ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go is stalling when attempting to scale up 5 worker Pods using detached actors:
raycluster_autoscaler_test.go:351:
Timed out after 300.001s.
Expected
<int32>: 4
to equal
<int32>: 5
=== NAME TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaler_v2_enabled
testing.go:1651: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== NAME TestRayClusterAutoscalerMinReplicasUpdate
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-head-hccws/ray-head logs
test.go:98: Creating ephemeral output directory as KUBERAY_TEST_OUTPUT_DIR env variable is unset
test.go:101: Output directory has been created at: /tmp/TestRayClusterAutoscalerMinReplicasUpdate682423055/001
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-head-hccws/autoscaler logs
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-78ckm/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-dbgr6/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-fsxx9/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-nnpgt/ray-cluster-test-group-worker-g9t8b/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-head-g2vxh/ray-head logs
test.go:98: Creating ephemeral output directory as KUBERAY_TEST_OUTPUT_DIR env variable is unset
test.go:101: Output directory has been created at: /tmp/TestRayClusterAutoscalerMinReplicasUpdate2655252486/002
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-head-g2vxh/autoscaler logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-4qc8d/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-69xkh/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-ml4wg/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-wvwvl/ray-worker logs
test.go:110: Retrieving Pod Container test-ns-rwmwl/ray-cluster-test-group-worker-wx4fp/ray-worker logs
--- FAIL: TestRayClusterAutoscalerMinReplicasUpdate (367.76s)
--- PASS: TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaling_enabled (33.68s)
--- FAIL: TestRayClusterAutoscalerMinReplicasUpdate/Create_a_RayCluster_with_autoscaler_v2_enabled (331.37s)
FAIL
FAIL github.com/ray-project/kuberay/ray-operator/test/e2eautoscaler 367.821s
FAIL
I edited the above test to use TestTimeoutLong to check whether it was simply a latency issue. The same test passes for the V1 Autoscaler test case. I'm able to recreate this bug by creating a RayCluster with the following manifest:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-autoscaler
spec:
rayVersion: '2.40.0'
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
headGroupSpec:
rayStartParams:
num-cpus: "0"
template:
spec:
containers:
# The Ray head container
- name: ray-head
image: rayproject/ray:2.40.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
memory: "2G"
requests:
cpu: "1"
memory: "2G"
env:
- name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
value: "1"
volumeMounts:
- mountPath: /home/ray/samples
name: ray-example-configmap
volumes:
- name: ray-example-configmap
configMap:
name: ray-example
defaultMode: 0777
items:
- key: detached_actor.py
path: detached_actor.py
- key: terminate_detached_actor.py
path: terminate_detached_actor.py
restartPolicy: Never # No restart to avoid reuse of pod for different ray nodes.
workerGroupSpecs:
- replicas: 0
minReplicas: 0
maxReplicas: 10
groupName: small-group
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.40.0
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "1"
memory: "1G"
restartPolicy: Never # Never restart a pod to avoid pod reuse
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example
data:
detached_actor.py: |
import ray
import sys
@ray.remote(num_cpus=1)
class Actor:
pass
ray.init(namespace="default_namespace")
Actor.options(name=sys.argv[1], lifetime="detached").remote()
terminate_detached_actor.py: |
import ray
import sys
ray.init(namespace="default_namespace")
detached_actor = ray.get_actor(sys.argv[1])
ray.kill(detached_actor)
and scaling up multiple worker Pods using detached actors as follows:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
for i in `seq 0 10`; do kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor_$i; done
Less than the expected number of Pods are scaled up (8 rather than 11):
or follow the above instructions to validate this bug manually
Anything else
I've found this bug reliably occurs, but is more common when creating a larger number of tasks/actors. For the manual reproduction, I increased the number of created detached actors to 11 because it was correctly scaling up 5 Pods (despite the automated test consistently failing for the same number).
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Update: upgrading to the nightly image fixes the issue in my manual tests, so the automated test failures can be ignored until the RayVersion returned by GetRayVersion is updated to the next Ray release
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Environment:
Currently the V2 Autoscaler test case for
TestRayClusterAutoscalerMinReplicasUpdate
in/kuberay/ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go
is stalling when attempting to scale up 5 worker Pods using detached actors:I edited the above test to use
TestTimeoutLong
to check whether it was simply a latency issue. The same test passes for the V1 Autoscaler test case. I'm able to recreate this bug by creating a RayCluster with the following manifest:and scaling up multiple worker Pods using detached actors as follows:
Less than the expected number of Pods are scaled up (8 rather than 11):
The Actors created show that there should be 10 workers scaled up:
I attached below the Kuberay Operator logs and the V2 Autoscaler logs from the Head Pod container.
operator-1.2.2-logs.txt
autoscaler-v2-logs.txt
Reproduction script
From
/kuberay/ray-operator
run:or follow the above instructions to validate this bug manually
Anything else
I've found this bug reliably occurs, but is more common when creating a larger number of tasks/actors. For the manual reproduction, I increased the number of created detached actors to 11 because it was correctly scaling up 5 Pods (despite the automated test consistently failing for the same number).
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: