[Jobs] Revisit Ray Job execution and monitoring #45120

alexeykudinkin · 2024-05-03T02:17:00Z

Why are these changes needed?

Context

Motivations for this refactoring are multiple:

Consolidating all of the job management in one place (JobSupervisor; previously spread b/w JobManager and JobSupervisor)
Decoupling management and monitoring of the job execution from the execution of the job driver (previously was coupled inside JobSupervisor)

These steps are necessary to be able

Perform job management and monitoring exclusively from the head-node
Run actual job drivers on any node (worker or head)

Changes

With stated goals in mind following are primary changes that were implemented (with the rest just to facilitate this migration):

All of the job management and monitoring is consolidated inside JobSupervisor (always running on a head node)
Actual job driver execution is performed by (descendent) JobExecutor actor (could run on any node)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

alexeykudinkin · 2024-05-14T02:07:45Z

dashboard/modules/job/job_manager.py

-        finally:
-            self.monitored_jobs.remove(job_id)
-
-    async def _monitor_job_internal(


This moved to JobSupervisor._monitor_job_internal

alexeykudinkin · 2024-05-14T02:08:42Z

dashboard/modules/job/job_supervisor.py

-        self._logger = logging.getLogger(f"{__name__}.supervisor-{job_id}")
-        self._configure_logger()
-
-    def _configure_logger(self) -> None:


Has been promoted to module method to be reused

alexeykudinkin · 2024-05-14T02:09:58Z

dashboard/modules/job/job_supervisor.py

-                polling_task.cancel()
-                if sys.platform == "win32" and self._win32_job_object:
-                    win32job.TerminateJobObject(self._win32_job_object, -1)
-                elif sys.platform != "win32":
-                    stop_signal = os.environ.get("RAY_JOB_STOP_SIGNAL", "SIGTERM")
-                    if stop_signal not in self.VALID_STOP_SIGNALS:
-                        self._logger.warning(
-                            f"{stop_signal} not a valid stop signal. Terminating "
-                            "job with SIGTERM."
-                        )
-                        stop_signal = "SIGTERM"
-
-                    job_process = psutil.Process(child_pid)
-                    proc_to_kill = [job_process] + job_process.children(recursive=True)
-
-                    # Send stop signal and wait for job to terminate gracefully,
-                    # otherwise SIGKILL job forcefully after timeout.
-                    self._kill_processes(proc_to_kill, getattr(signal, stop_signal))
-                    try:
-                        stop_job_wait_time = int(
-                            os.environ.get(
-                                "RAY_JOB_STOP_WAIT_TIME_S",
-                                self.DEFAULT_RAY_JOB_STOP_WAIT_TIME_S,
-                            )
-                        )
-                        poll_job_stop_task = create_task(self._poll_all(proc_to_kill))
-                        await asyncio.wait_for(poll_job_stop_task, stop_job_wait_time)
-                        self._logger.info(
-                            f"Job {self._job_id} has been terminated gracefully "
-                            f"with {stop_signal}."
-                        )
-                    except asyncio.TimeoutError:
-                        self._logger.warning(
-                            f"Attempt to gracefully terminate job {self._job_id} "
-                            f"through {stop_signal} has timed out after "
-                            f"{stop_job_wait_time} seconds. Job is now being "
-                            "force-killed with SIGKILL."
-                        )
-                        self._kill_processes(proc_to_kill, signal.SIGKILL)


Extracted as stop_process

alexeykudinkin · 2024-05-14T02:11:19Z

dashboard/modules/job/job_supervisor.py

+    # Timeout to finalize job status after job driver exiting
+    JOB_STATUS_FINALIZATION_TIMEOUT_S = 60
+
+    def __init__(


Most important changes in JobSupervisor:

JS took over whole job management lifecycle from JobManager

Actual driver execution have been moved to JobRunner

alexeykudinkin · 2024-05-18T01:27:24Z

dashboard/modules/job/job_manager.py

-    def _get_actor_for_job(self, job_id: str) -> Optional[ActorHandle]:
-        try:
-            return ray.get_actor(
-                JOB_ACTOR_NAME_TEMPLATE.format(job_id=job_id),
-                namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE,
-            )
-        except ValueError:  # Ray returns ValueError for nonexistent actor.
-            return None


Extracted to utils (renamed to _get_supervisor_actor_for_job)

alexeykudinkin · 2024-05-18T01:27:39Z

dashboard/modules/job/job_manager.py

-        self._log_client = JobLogStorageClient()
-        self._supervisor_actor_cls = ray.remote(JobSupervisor)
-        self.monitored_jobs = set()
-        try:
-            self.event_logger = get_event_logger(Event.SourceType.JOBS, logs_dir)
-        except Exception:
-            self.event_logger = None

-        self._recover_running_jobs_event = asyncio.Event()
-        run_background_task(self._recover_running_jobs())


Monitoring moved to JobSupervisor

alexeykudinkin · 2024-05-18T01:27:53Z

dashboard/modules/job/job_manager.py

-    async def _recover_running_jobs(self):
-        """Recovers all running jobs from the status client.
-
-        For each job, we will spawn a coroutine to monitor it.
-        Each will be added to self._running_jobs and reconciled.
-        """
-        try:
-            all_jobs = await self._job_info_client.get_all_jobs()
-            for job_id, job_info in all_jobs.items():
-                if not job_info.status.is_terminal():
-                    run_background_task(self._monitor_job(job_id))
-        finally:
-            # This event is awaited in `submit_job` to avoid race conditions between
-            # recovery and new job submission, so it must always get set even if there
-            # are exceptions.
-            self._recover_running_jobs_event.set()


alexeykudinkin · 2024-05-18T01:33:47Z

dashboard/modules/job/job_supervisor.py

+    def _get_driver_scheduling_strategy(
+        self, resources_specified: bool
+    ) -> SchedulingStrategyT:
+        """Get the scheduling strategy for the job.
+
+        If resources_specified is true, or if the environment variable is set to
+        allow the job's driver (entrypoint) to run on worker nodes, we will use Ray's
+        default actor placement strategy. Otherwise, we will force the job to use the
+        head node.
+
+        Args:
+            resources_specified: Whether the job specified any resources
+                (CPUs, GPUs, or custom resources).
+
+        Returns:
+            The scheduling strategy to use for the job.
+        """
+        if resources_specified:
+            return "DEFAULT"
+        elif os.environ.get(RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES_ENV_VAR, "0") == "1":
+            self._logger.info(
+                f"({self._job_id}) {RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES_ENV_VAR} was set to 1. "
+                "Using Ray's default actor scheduling strategy for the job "
+                "driver instead of running it on the head node."
+            )
+            return "DEFAULT"
+
+        # If the user did not specify any resources or set the driver on worker nodes
+        # env var, we will run the driver on the head node.
+        #
+        # NOTE: This is preserved for compatibility reasons
+        return NodeAffinitySchedulingStrategy(
+            node_id=ray.worker.global_worker.current_node_id.hex(), soft=True
+        )


Moved from JobManager (no logic changes)

alexeykudinkin · 2024-05-18T01:34:56Z

dashboard/modules/job/job_supervisor.py

+    def _get_runner_runtime_env(
+        self,
+        *,
+        user_runtime_env: Dict[str, Any],
+        submission_id: str,
+        entrypoint_resources_specified: bool,
+    ) -> Dict[str, Any]:
+        """Configure and return the runtime_env for the supervisor actor.
+
+        Args:
+            user_runtime_env: The runtime_env specified by the user.
+            entrypoint_resources_specified: Whether the user specified resources in the
+                submit_job() call. If so, we will skip the workaround introduced
+                in #24546 for GPU detection and just use the user's resource
+                requests, so that the behavior matches that of the user specifying
+                resources for any other actor.
+
+        Returns:
+            The runtime_env for the supervisor actor.
+        """
+        # Make a copy to avoid mutating passed runtime_env.
+        runtime_env = (
+            copy.deepcopy(user_runtime_env) if user_runtime_env is not None else {}
+        )
+
+        # NOTE(edoakes): Can't use .get(, {}) here because we need to handle the case
+        # where env_vars is explicitly set to `None`.
+        env_vars = runtime_env.get("env_vars")
+        if env_vars is None:
+            env_vars = {}
+
+        env_vars[ray_constants.RAY_WORKER_NICENESS] = "0"
+
+        if not entrypoint_resources_specified:
+            # Don't set CUDA_VISIBLE_DEVICES for the supervisor actor so the
+            # driver can use GPUs if it wants to. This will be removed from
+            # the driver's runtime_env so it isn't inherited by tasks & actors.
+            env_vars[ray_constants.NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR] = "1"
+        runtime_env["env_vars"] = env_vars
+
+        if os.getenv(RAY_STREAM_RUNTIME_ENV_LOG_TO_JOB_DRIVER_LOG_ENV_VAR, "0") == "1":
+            config = runtime_env.get("config")
+            # Empty fields may be set to None, so we need to check for None explicitly.


Moved from JobManager (_get_supervisor_runtime_env previously), no logic changes

alexeykudinkin · 2024-05-18T01:36:20Z

dashboard/modules/job/job_supervisor.py

-        curr_info = await self._job_info_client.get_info(self._job_id)
-        if curr_info is None:
-            raise RuntimeError(f"Status could not be retrieved for job {self._job_id}.")
-        curr_status = curr_info.status
-        curr_message = curr_info.message
-        if curr_status == JobStatus.RUNNING:
-            raise RuntimeError(
-                f"Job {self._job_id} is already in RUNNING state. "
-                f"JobSupervisor.run() should only be called once. "
-            )
-        if curr_status != JobStatus.PENDING:
-            raise RuntimeError(
-                f"Job {self._job_id} is not in PENDING state. "
-                f"Current status is {curr_status} with message {curr_message}."
-            )
-
-        if _start_signal_actor:
-            # Block in PENDING state until start signal received.
-            await _start_signal_actor.wait.remote()
-
-        driver_agent_http_address = (
-            "http://"
-            f"{ray.worker.global_worker.node.node_ip_address}:"
-            f"{ray.worker.global_worker.node.dashboard_agent_listen_port}"
-        )
-        driver_node_id = ray.worker.global_worker.current_node_id.hex()
-
-        await self._job_info_client.put_status(
-            self._job_id,
-            JobStatus.RUNNING,
-            jobinfo_replace_kwargs={
-                "driver_agent_http_address": driver_agent_http_address,
-                "driver_node_id": driver_node_id,
-            },
-        )


All state management has been lifted up to JobSupervisor (to consolidate ownership scope, avoid race-conditions)

alexeykudinkin · 2024-05-18T01:37:57Z

dashboard/modules/job/job_supervisor.py

+            os.environ.update(self._get_driver_env_vars())
+
+            self._logger.info(f"({self._job_id}) Executing job driver's entrypoint")
+
            log_path = self._log_client.get_log_file_path(self._job_id)
-            child_process = self._exec_entrypoint(log_path)
-            child_pid = child_process.pid

+            # Execute job's entrypoint in the subprocess
+            self._driver_process = self._exec_entrypoint(log_path)
+
+        except Exception as e:
+            self._logger.error(
+                f"({self._job_id}) Got unexpected exception while executing job's entrypoint: {repr(e)}",
+                exc_info=e,
+            )
+
+    async def join(self) -> JobExecutionResult:
+        """
+        Joins job execution blocking until either of the following conditions is true
+
+        1. Job driver has completed (subprocess exited)
+        2. Job has been interrupted (due to Ray job being stopped)
+
+        NOTE: This method should only be called after `JobExecutor.start`
+        """
+
+        try:
+            assert self._driver_process, "Job's driver is not running"
+
+            child_process = self._driver_process
+
+            # Block until either of the following occurs:
+            #   - Process executing job's entrypoint completes (exits, returning specific exit-code)


Changes

Splitting job running sequence in 2 methods (so that job is annotated as RUNNING only if we're able to successfully start the entrypoint)

No logic changes otherwise)

edoakes · 2024-05-20T15:36:30Z

My understanding is that there are two primary issues we are trying to solve here:

Always run the control plane/supervising logic on the head node to avoid issues if worker nodes are terminated (which should be assumed to happen frequently).
Enable running job drivers on worker nodes if requested.

The minimal change we could make to address the above is to simply only run the JobManager on the head node only (in the API server) but allow the spawned JobSupervisor actors to optionally run on worker nodes. This would require very little code change.

What is the reasoning for adding an additional component in the middle here (what you have renamed to the JobSupervisor)? The PR description says that this is necessary to satisfy (1) and (2) above but I don't follow. I also don't see how this is "consolidating all of the job management in one place" -- if anything, it is less consolidated now as there is a 3rd component in the mix which complicates edge case handling (e.g., if the JobSupervisor and/or the JobExecutor fail to start or exit unexpectedly.

Independent of what we decide as the end state, this is a huge refactor on code that has had a lot of subtle issues ironed out over time (and is currently very stable), so I'm not comfortable with merging a complete overhaul in one shot (the risk/reward ratio is out of whack). So let's discuss how to make the changes more incrementally.

alexeykudinkin · 2024-06-01T01:34:09Z

@edoakes made the changes we've discussed regarding running JobSupervisor in process in separate PR to make review easier:

#45664

We can review these independently, then i'll squash them into 1

Signed-off-by: Alexey Kudinkin <[email protected]>

- Appropriately report job-run status from the runner - Handle exceptions in the JobSupervisor Signed-off-by: Alexey Kudinkin <[email protected]>

Properly handle job being stopped; Signed-off-by: Alexey Kudinkin <[email protected]>

…e_info` method; Wire in job driver's Dashboard Agent details into `JobInfo` Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

…fy whether job driver is running Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

…as job-status in GCS as RUNNING Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin force-pushed the ak/jbm-lst-jbs-fix branch from fba74fa to aa4da30 Compare May 14, 2024 02:03

alexeykudinkin commented May 14, 2024

View reviewed changes

alexeykudinkin commented May 18, 2024

View reviewed changes

alexeykudinkin changed the title ~~[WIP] Revisit Ray Job management~~ [Jobs] Revisit Ray Job execution and monitoring May 18, 2024

alexeykudinkin requested a review from edoakes May 18, 2024 01:48

alexeykudinkin force-pushed the ak/jbm-lst-jbs-fix branch from 8d97c79 to 0ddd831 Compare May 31, 2024 01:00

alexeykudinkin added the go add ONLY when ready to merge, run all tests label May 31, 2024

This was referenced Jun 1, 2024

[WIP][Jobs] Revisit Job Agent to run Job Supervisors in-process alexeykudinkin/ray#1

Closed

[WIP][Jobs] Revisit Job Agent to run Job Supervisors in-process #45664

Open

alexeykudinkin force-pushed the ak/jbm-lst-jbs-fix branch from 0ddd831 to ef6bcec Compare June 25, 2024 01:06

alexeykudinkin added 18 commits July 2, 2024 18:26

Cleaning up

8504b10

Signed-off-by: Alexey Kudinkin <[email protected]>

Abstracted runner actor creation

b7619a8

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

d4ea37f

Signed-off-by: Alexey Kudinkin <[email protected]>

Proxy stop, ping from JobSupervisor to JobRunner

f453d56

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

5ae48ac

Signed-off-by: Alexey Kudinkin <[email protected]>

Relocate marking job as PENDING into JobSupervisor

9ebf8c4

Signed-off-by: Alexey Kudinkin <[email protected]>

Lift updating job state from JobRunner into JobSupervisor

ab2a578

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed exception handling to

2071c72

- Appropriately report job-run status from the runner - Handle exceptions in the JobSupervisor Signed-off-by: Alexey Kudinkin <[email protected]>

Abstracted driver stopping sequence;

e8800ed

Properly handle job being stopped; Signed-off-by: Alexey Kudinkin <[email protected]>

Abstracted job driver's node info to be exposed by `JobRunner.get_nod…

468687a

…e_info` method; Wire in job driver's Dashboard Agent details into `JobInfo` Signed-off-by: Alexey Kudinkin <[email protected]>

Abstracted _execute_sync method;

68c0427

Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

07d65f0

Signed-off-by: Alexey Kudinkin <[email protected]>

Missing imports

5091f4a

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed tests

121c2b9

Signed-off-by: Alexey Kudinkin <[email protected]>

lint

02a6019

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed args

0c72b7f

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed scheduling strategy for JobRunner

5bf6677

Signed-off-by: Alexey Kudinkin <[email protected]>

Wired missing args for JobRunner

a979b7d

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin added 25 commits July 2, 2024 18:26

Runner > Executor

a2f720a

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

7e7f937

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up pydocs

5197359

Signed-off-by: Alexey Kudinkin <[email protected]>

Split job driver starting and joining into 2 separate steps

0911709

Signed-off-by: Alexey Kudinkin <[email protected]>

Clarifying comments

4261115

Signed-off-by: Alexey Kudinkin <[email protected]>

Prefixed logs with job id

3486286

Signed-off-by: Alexey Kudinkin <[email protected]>

Adding log statements

e118853

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

1ad92d7

Signed-off-by: Alexey Kudinkin <[email protected]>

Revisited health-checking procedure in JobSupervisor to actually veri…

8914cc9

…fy whether job driver is running Signed-off-by: Alexey Kudinkin <[email protected]>

Updating fixtures

761320e

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed more tests

5fd7eff

Signed-off-by: Alexey Kudinkin <[email protected]>

Start monitoring loop upon launching of the job

e2ed350

Signed-off-by: Alexey Kudinkin <[email protected]>

Raise more human-readable exceptions

e865749

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixing tests

8153314

Signed-off-by: Alexey Kudinkin <[email protected]>

Make sure monitoring loop asserts both job-driver as running as well …

995f272

…as job-status in GCS as RUNNING Signed-off-by: Alexey Kudinkin <[email protected]>

lint

e3d407b

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

2267d1c

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up pydocs

397fb97

Signed-off-by: Alexey Kudinkin <[email protected]>

Properly handle failures to set up runtime env

1bc8134

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed tests

5424e3e

Signed-off-by: Alexey Kudinkin <[email protected]>

Aligned job status messaging b/w monitoring and supervisor actors

c2416e4

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up messaging

c12d583

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up logging

69f8ac3

Signed-off-by: Alexey Kudinkin <[email protected]>

Restored previous message format

729c226

Signed-off-by: Alexey Kudinkin <[email protected]>

Updated tests

d91b1d6

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin force-pushed the ak/jbm-lst-jbs-fix branch from c16bf51 to d91b1d6 Compare July 3, 2024 01:26

alexeykudinkin added 4 commits July 3, 2024 13:53

Fixing tests

4a0cc4a

Signed-off-by: Alexey Kudinkin <[email protected]>

Fixing test fixture

bd30ed0

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up

470b9e8

Signed-off-by: Alexey Kudinkin <[email protected]>

Removing inapplicable tests

073c226

Signed-off-by: Alexey Kudinkin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jobs] Revisit Ray Job execution and monitoring #45120

[Jobs] Revisit Ray Job execution and monitoring #45120

alexeykudinkin commented May 3, 2024 •

edited

Loading

alexeykudinkin May 14, 2024

alexeykudinkin May 14, 2024

alexeykudinkin May 14, 2024

alexeykudinkin May 14, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

alexeykudinkin May 18, 2024

edoakes commented May 20, 2024

alexeykudinkin commented Jun 1, 2024

[Jobs] Revisit Ray Job execution and monitoring #45120

Are you sure you want to change the base?

[Jobs] Revisit Ray Job execution and monitoring #45120

Conversation

alexeykudinkin commented May 3, 2024 • edited Loading

Why are these changes needed?

Context

Changes

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes commented May 20, 2024

alexeykudinkin commented Jun 1, 2024

alexeykudinkin commented May 3, 2024 •

edited

Loading