[Doc] Replace Head Node ServiceMonitor with PodMonitor (#49476)

Since KubeRay has changed the collection of Head Node metrics from `ServiceMonitor` to `PodMonitor`, this PR will update the Ray doc to reflect the current usage. Ref: ray-project/kuberay#2689 --------- Signed-off-by: win5923 <[email protected]> Signed-off-by: Blocka <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: angelinalg <[email protected]>
ray-project · Jan 7, 2025 · af0e2df · af0e2df
1 parent abb11c3
commit af0e2df
Show file tree

Hide file tree

Showing 2 changed files with 56 additions and 36 deletions.
diff --git a/doc/source/cluster/kubernetes/images/prometheus_web_ui.png b/doc/source/cluster/kubernetes/images/prometheus_web_ui.png
diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.md b/doc/source/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.md
@@ -33,7 +33,7 @@ kubectl get all -n prometheus-system
 # deployment.apps/prometheus-kube-state-metrics         1/1     1            1           46s
 ```
 
-* KubeRay provides an [install.sh script](https://github.com/ray-project/kuberay/blob/master/install/prometheus/install.sh) to install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources, including **ServiceMonitor**, **PodMonitor** and **PrometheusRule**, in the namespace `prometheus-system` automatically.
+* KubeRay provides an [install.sh script](https://github.com/ray-project/kuberay/blob/master/install/prometheus/install.sh) to install the [kube-prometheus-stack v48.2.1](https://github.com/prometheus-community/helm-charts/tree/kube-prometheus-stack-48.2.1/charts/kube-prometheus-stack) chart and related custom resources, including **PodMonitor** and **PrometheusRule**, in the namespace `prometheus-system` automatically.
 
 * We made some modifications to the original `values.yaml` in kube-prometheus-stack chart to allow embedding Grafana panels in Ray Dashboard. See [overrides.yaml](https://github.com/ray-project/kuberay/tree/master/install/prometheus/overrides.yaml) for more details.
   ```yaml
@@ -62,8 +62,8 @@ kubectl apply -f ray-cluster.embed-grafana.yaml
 kubectl get pod -l ray.io/node-type=head
 
 # Example output:
-# NAME                            READY   STATUS    RESTARTS   AGE
-# raycluster-kuberay-head-btwc2   1/1     Running   0          63s
+# NAME                                  READY   STATUS    RESTARTS   AGE
+# raycluster-embed-grafana-head-98fqt   1/1     Running   0          11m
 
 # Wait until all Ray Pods are running and forward the port of the Prometheus metrics endpoint in a new terminal.
 kubectl port-forward ${RAYCLUSTER_HEAD_POD} 8080:8080
@@ -72,13 +72,15 @@ curl localhost:8080
 # Example output (Prometheus metrics format):
 # # HELP ray_spill_manager_request_total Number of {spill, restore} requests.
 # # TYPE ray_spill_manager_request_total gauge
-# ray_spill_manager_request_total{Component="raylet",NodeAddress="10.244.0.13",Type="Restored",Version="2.0.0"} 0.0
+# ray_spill_manager_request_total{Component="raylet", NodeAddress="10.244.0.13", SessionName="session_2025-01-02_07-58-21_419367_11", Type="FailedDeletion", Version="2.9.0", container="ray-head", endpoint="metrics", instance="10.244.0.13:8080", job="prometheus-system/ray-head-monitor", namespace="default", pod="raycluster-embed-grafana-head-98fqt", ray_io_cluster="raycluster-embed-grafana"} 0
 
 # Ensure that the port (8080) for the metrics endpoint is also defined in the head's Kubernetes service.
 kubectl get service
 
-# NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
-# raycluster-kuberay-head-svc   ClusterIP   10.96.201.142   <none>        6379/TCP,8265/TCP,8080/TCP,8000/TCP,10001/TCP   106m
+# NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                    AGE
+# kuberay-operator                    ClusterIP   10.96.137.190   <none>        8080/TCP                                                   13m
+# kubernetes                          ClusterIP   10.96.0.1       <none>        443/TCP                                                    14m
+# raycluster-embed-grafana-head-svc   ClusterIP   None            <none>        44217/TCP,10001/TCP,44227/TCP,8265/TCP,6379/TCP,8080/TCP   13m
 ```
 
 * KubeRay exposes a Prometheus metrics endpoint in port **8080** via a built-in exporter by default. Hence, we do not need to install any external exporter.
@@ -103,37 +105,54 @@ kubectl get service
     Because we forward the port of Grafana to `127.0.0.1:3000` in this example, we set `RAY_GRAFANA_IFRAME_HOST` to `http://127.0.0.1:3000`.
   * `http://` is required.
 
-## Step 5: Collect Head Node metrics with a ServiceMonitor
+## Step 5: Collect Head Node metrics with a PodMonitor
+
+RayService creates two Kubernetes services for the head Pod; one managed by the RayService and the other by the underlying RayCluster. Therefore, it's recommended to use a PodMonitor to monitor the metrics for head Pods to avoid misconfigurations that could result in double counting the same metrics when using a ServiceMonitor.
 
 ```yaml
 apiVersion: monitoring.coreos.com/v1
-kind: ServiceMonitor
+kind: PodMonitor
 metadata:
-  name: ray-head-monitor
-  namespace: prometheus-system
   labels:
-    # `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.
+    # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label.
     release: prometheus
+  name: ray-head-monitor
+  namespace: prometheus-system
 spec:
   jobLabel: ray-head
-  # Only select Kubernetes Services in the "default" namespace.
+  # Only select Kubernetes Pods in the "default" namespace.
   namespaceSelector:
     matchNames:
       - default
-  # Only select Kubernetes Services with "matchLabels".
+  # Only select Kubernetes Pods with "matchLabels".
   selector:
     matchLabels:
       ray.io/node-type: head
-  # A list of endpoints allowed as part of this ServiceMonitor.
-  endpoints:
+  # A list of endpoints allowed as part of this PodMonitor.
+  podMetricsEndpoints:
     - port: metrics
-  targetLabels:
-  - ray.io/cluster
+      relabelings:
+        - action: replace
+          sourceLabels:
+            - __meta_kubernetes_pod_label_ray_io_cluster
+          targetLabel: ray_io_cluster
+    - port: as-metrics # autoscaler metrics
+      relabelings:
+        - action: replace
+          sourceLabels:
+            - __meta_kubernetes_pod_label_ray_io_cluster
+          targetLabel: ray_io_cluster
+    - port: dash-metrics # dashboard metrics
+      relabelings:
+        - action: replace
+          sourceLabels:
+            - __meta_kubernetes_pod_label_ray_io_cluster
+          targetLabel: ray_io_cluster
 ```
 
-* The YAML example above is [serviceMonitor.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/serviceMonitor.yaml), and it is created by **install.sh**. Hence, no need to create anything here.
-* See [ServiceMonitor official document](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#servicemonitor) for more details about the configurations.
-* `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.
+* The **install.sh** script creates the above YAML example, [podMonitor.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/podMonitor.yaml#L26-L63) so you don't need to create anything.
+* See the [PodMonitor official document](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmonitor) for more details about the configurations.
+* `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. See [here](#prometheus-can-only-detect-this-label) for more details.
 
 (prometheus-can-only-detect-this-label)=
   ```sh
@@ -143,9 +162,6 @@ spec:
   # prometheus      prometheus-system       1               2023-02-06 06:27:05.530950815 +0000 UTC deployed        kube-prometheus-stack-44.3.1    v0.62.0
 
   kubectl get prometheuses.monitoring.coreos.com -n prometheus-system -oyaml
-  # serviceMonitorSelector:
-  #   matchLabels:
-  #     release: prometheus
   # podMonitorSelector:
   #   matchLabels:
   #     release: prometheus
@@ -154,20 +170,26 @@ spec:
   #     release: prometheus
   ```
 
-* `namespaceSelector` and `seletor` are used to select exporter's Kubernetes service. Because Ray uses a built-in exporter, the **ServiceMonitor** selects Ray's head service which exposes the metrics endpoint (i.e. port 8080 here).
+* Prometheus uses `namespaceSelector` and `selector` to select Kubernetes Pods.
   ```sh
-  kubectl get service -n default -l ray.io/node-type=head
-  # NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
-  # raycluster-kuberay-head-svc   ClusterIP   10.96.201.142   <none>        6379/TCP,8265/TCP,8080/TCP,8000/TCP,10001/TCP   153m
+  kubectl get pod -n default -l ray.io/node-type=head
+  # NAME                                  READY   STATUS    RESTARTS   AGE
+  # raycluster-embed-grafana-head-khfs4   1/1     Running   0          4m38s
+  ```
+
+* `relabelings`: This configuration renames the label `__meta_kubernetes_pod_label_ray_io_cluster` to `ray_io_cluster` in the scraped metrics. It ensures that each metric includes the name of the RayCluster to which the Pod belongs. This configuration is especially useful for distinguishing metrics when deploying multiple RayClusters. For example, a metric with the `ray_io_cluster` label might look like this:
+
+  ```
+  ray_node_cpu_count{SessionName="session_2025-01-02_07-58-21_419367_11", container="ray-head", endpoint="metrics", instance="10.244.0.13:8080", ip="10.244.0.13", job="raycluster-embed-grafana-head-svc", namespace="default", pod="raycluster-embed-grafana-head-98fqt", ray_io_cluster="raycluster-embed-grafana", service="raycluster-embed-grafana-head-svc"}
   ```
 
-* `targetLabels`: We added `spec.targetLabels[0].ray.io/cluster` because we want to include the name of the RayCluster in the metrics that will be generated by this ServiceMonitor. The `ray.io/cluster` label is part of the Ray head node service and it will be transformed into a `ray_io_cluster` metric label. That is, any metric that will be imported, will also contain the following label `ray_io_cluster=<ray-cluster-name>`. This may seem optional but it becomes mandatory if you deploy multiple RayClusters.
+  In this example, `raycluster-embed-grafana` is the name of the RayCluster.
 
 ## Step 6: Collect Worker Node metrics with PodMonitors
 
-KubeRay operator does not create a Kubernetes service for the Ray worker Pods, therefore we cannot use a Prometheus ServiceMonitor to scrape the metrics from the worker Pods. To collect worker metrics, we can use `Prometheus PodMonitors CRD` instead.
+Similar to the head Pod, this tutorial also uses a PodMonitor to collect metrics from worker Pods. The reason for using separate PodMonitors for head Pods and worker Pods is that the head Pod exposes multiple metric endpoints, whereas a worker Pod exposes only one.
 
-**Note**: We could create a Kubernetes service with selectors a common label subset from our worker pods, however, this is not ideal because our workers are independent from each other, that is, they are not a collection of replicas spawned by replicaset controller. Due to that, we should avoid using a Kubernetes service for grouping them together.
+**Note**: You could create a Kubernetes service with selectors a common label subset from our worker pods, however, this configuration is not ideal because the workers are independent from each other, that is, they aren't a collection of replicas spawned by replicaset controller. Due to this behavior, avoid using a Kubernetes service for grouping them together.
 
 ```yaml
 apiVersion: monitoring.coreos.com/v1
@@ -178,7 +200,6 @@ metadata:
   labels:
     # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label.
     release: prometheus
-    ray.io/cluster: raycluster-kuberay # $RAY_CLUSTER_NAME: "kubectl get rayclusters.ray.io"
 spec:
   jobLabel: ray-workers
   # Only select Kubernetes Pods in the "default" namespace.
@@ -192,22 +213,21 @@ spec:
   # A list of endpoints allowed as part of this PodMonitor.
   podMetricsEndpoints:
   - port: metrics
+    relabelings:
+    - sourceLabels: [__meta_kubernetes_pod_label_ray_io_cluster]
+      targetLabel: ray_io_cluster
 ```
 
-* `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. See [here](#prometheus-can-only-detect-this-label) for more details.
-
 * **PodMonitor** in `namespaceSelector` and `selector` are used to select Kubernetes Pods.
   ```sh
   kubectl get pod -n default -l ray.io/node-type=worker
   # NAME                                          READY   STATUS    RESTARTS   AGE
   # raycluster-kuberay-worker-workergroup-5stpm   1/1     Running   0          3h16m
   ```
 
-* `ray.io/cluster: $RAY_CLUSTER_NAME`: We also define `metadata.labels` by manually adding `ray.io/cluster: <ray-cluster-name>` and then instructing the PodMonitors resource to add that label in the scraped metrics via `spec.podTargetLabels[0].ray.io/cluster`.
-
 ## Step 7: Collect custom metrics with Recording Rules
 
-[Recording Rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) allow us to precompute frequently needed or computationally expensive [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) expressions and save their result as custom metrics. Note this is different from [Custom Application-level Metrics](application-level-metrics) which aim for the visibility of ray applications.
+[Recording Rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) allow KubeRay to precompute frequently needed or computationally expensive [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) expressions and save their result as custom metrics. Note that this behavior is different from [Custom application-level metrics](application-level-metrics), which are for the visibility of Ray applications.
 
 ```yaml
 apiVersion: monitoring.coreos.com/v1