Skip to content

Commit

Permalink
[jobs] set a global limit of 2000 parallel jobs
Browse files Browse the repository at this point in the history
Based on testing, we start to hit other significant issues past this
point, such as global_user_state sqlite database breaking.
  • Loading branch information
cg505 committed Feb 4, 2025
1 parent 25ae5f2 commit 4d9badb
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion sky/jobs/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,16 @@ def _get_job_parallelism() -> int:
# Assume a running job uses 350MB memory.
# We observe 230-300 in practice.
job_memory = 350 * 1024 * 1024
return max(psutil.virtual_memory().total // job_memory, 1)

# Past 2000 simultaneous jobs, we become unstable due to other scale issues.
max_job_limit = 2000

job_limit = min(psutil.virtual_memory().total // job_memory, max_job_limit)

if job_limit < 1:
return 1

return job_limit


def _get_launch_parallelism() -> int:
Expand Down

0 comments on commit 4d9badb

Please sign in to comment.