Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] Ensure client and scheduler are resilient to server autoscaling #2277

Closed
wants to merge 6 commits into from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Oct 25, 2024

While profiling distributed build cluster performance, forcing the client to fallback to local compilation is the largest contributor to overall build time. Presently this happens due to at least one bug, but also sub-optimal error handling in the client and scheduler.

These issues are amplified when autoscaling sccache-dist servers, as the errors happen more frequently, can lead to sub-optimal autoscaling behavior, leading to more errors, etc.

So this PR is a collection of fixes for the sccache client, scheduler, and server to better support dist-server autoscaling, as well as general improvements for tracing and debugging distributed compilation across clients, schedulers, and workers.

  • 3c4547b Adds the output file, job_id, and server_id to client logs, and adds job_id to scheduler and server logs. This makes it significantly easier to trace build cluster failures across client and server logs.
    Nit: Searching through unstructured/ad-hoc log lines is difficult, using something like structured_logger instead of env_logger would also improve this experience.
  • df2e4a1 Ensures the client and scheduler use the latest certificates for each server. This is necessary for resiliency when the servers scale in or out (more on this below).
  • fe83892 Ensures the scheduler is resilient to server errors, and attempts to allocate jobs to the next-best server candidate (more on this below).
  • 4657454 Adds the ability for clients to retry distributed compilations. In conjunction with the two prior commits, this ensures clients with jobs assigned to servers that are scaled in can ask the scheduler to allocate the job to a new server (more on this below).
  • 6cd9ff3 Adds envvar-configurable connection and request timeouts. Fixes make "REQUEST_TIMEOUT_SECS" configurable #2276

Build cluster configuration

Before diving into these changes, I should describe the architecture of the cluster for which these changes are necessary.

  1. A Traefik API Gateway to terminate SSL and expose a single endpoint for clients. This could be any API Gateway/LB/router, I just like Traefik.
  2. An sccache-dist scheduler instance, which receives forwarded connections from Traefik.
  3. An autoscaling group of sccache-dist servers, which scale in and out based on load, and are associated with one of a fixed pool of ports on the Traefik instance. For example, if the ASG includes up to 10 instances, Traefik will open 10 ports (e.g. 10500-10509) and associate each port with a worker.

Workers are associated with and forwarded traffic from one of Traefik's open ports when they start up, and un-associated with that port when they shut down. When a new worker starts up, it could be associated with any free port, even ports previously associated with a different worker.

Note: While this PR isn't related, this description assumes sccache has been compiled with the changes in #1922, as that's necessary for the workers to report the public_url of the API Gateway instead of their private VPC address.

Certificate handling for server scale in and out

When the server cluster goes through a cycle of scaling out, in, then out again, the new servers may be available at addresses that were previously associated with an old server. This presents a challenge for certificate handling, because the client and scheduler may have cached certificates for the initial instance, and those certs are not valid for communicating with the new instance:

# scale out to two instances:
127.0.0.1:10500 - server A
127.0.0.1:10501 - server B

# scale in to one instance:
127.0.0.1:10501 - server B

# scale out to three instances:
127.0.0.1:10500 - server C
127.0.0.1:10501 - server B
127.0.0.1:10502 - server D

In the initial state, the client and scheduler cached certificates for servers A and B. After scaling in and out again, the client and scheduler attempt and fail to use the certificates generated by server A to communicate with server C. I believe this is because the certificates for A and C both embed 127.0.0.1:10500 as their SubjectAlternativeName, and this confuses reqwest.

df2e4a1 updates the client to track certificates by server_id like the server does, and updates both the client and scheduler to remove the old certificate from the certs map before adding the existing certs to the reqwest client builder.

Scheduler job allocation resiliency

There's a delay between when servers scale in and when the scheduler prunes them from the list of active servers. In this time, the scheduler may attempt to allocate jobs to these servers. When this fails, and the current behavior is to return an error to the client to run a local compile.

This is sub-optimal for an autoscaling strategy, since by rejecting the jobs, the additional work sent back to the client to do isn't captured by the autoscaler.

For example, if the autoscaler scales in from 64 to 32 CPUs, and in the meantime the scheduler rejects the next 32 jobs to compile locally, the autoscaler believes it is in a steady-state rather than recognizing there are 64 units of work to handle.

At best, this leads to delays in scaling up, and at worst it can cause the autoscaler to believe it can continue to scale down.

The best solution is for the scheduler to handle the alloc_job failure and attempt to allocate to the next-best server candidate, until either the job is allocated or the candidate list is exhausted. This ensures the autoscaler will see the existing instances get busier, and stop scaling in/start scaling out again.

Example of starting a cluster with 3 initial workers, scaling down to 1, then running a distributed compile before the scheduler has pruned the dead servers:
$ docker compose up -d --scale worker=3
# ... wait till cluster is up
$ docker compose up -d --scale worker=1
$ sccache ...
[INFO  sccache::dist::http::server] Scheduler listening for clients on 0.0.0.0:80
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10500 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10500)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10501 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10501)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10502 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10502)
[WARN  sccache_dist] [alloc_job(0)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10502/api/v1/distserver/assign_job/0), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(0)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[WARN  sccache_dist] [alloc_job(1)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10500/api/v1/distserver/assign_job/1), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(2)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(1)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(3)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Started to Complete
[WARN  sccache_dist] Server 172.18.0.2:10500 appears to be dead, pruning it in the scheduler
[WARN  sccache_dist] Server 172.18.0.2:10502 appears to be dead, pruning it in the scheduler

Client job execution resiliency

It's also possible for a server to be taken offline while it's running jobs for clients. In this scenario the scheduler alloc_job succeeds when the worker is still alive, but the worker is destroyed while the client is waiting on the run_job response.

To avoid the expensive local compilation, the client should handle the failure and allow retrying the job on a new server assigned by the scheduler. When combined with the feature described in the previous section, the scheduler should reallocate the job on an alive server.

Example client logs when worker shuts down during run_job, and client retries:
$ docker compose up -d --scale worker=10
# ... wait till cluster is up
$ SCCACHE_DIST_RETRY_LIMIT=5 sccache ...
$ docker compose up -d --scale worker=1
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 1 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 2
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 2 on server 172.18.0.2:10504
[WARN  sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Error running distributed compilation (attempt 1 of 6), retrying. Could not run distributed compilation job on 172.18.0.2:10504: error sending request for url (https://172.18.0.2:10504/api/v1/distserver/run_job/2): client error (SendRequest): connection closed before message completed
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 2 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 21
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 21 on server 172.18.0.2:10500
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Fetched [("/tmp/sccache_nvcc5K0cEY/0.cudafe1.c", "Size: 209->174"), ("/tmp/sccache_nvcc5K0cEY/1.cudafe1.stub.c", "Size: 1429->635"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.cudafe1.gpu", "Size: 25849->3584"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.ptx", "Size: 874->445")]
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Compiled in 3.963 s, storing in cache
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Created cache artifact in 0.006 s
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: compile result: cache miss
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: CompileFinished retcode: exit status: 0
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Stored in cache successfully!

@codecov-commenter
Copy link

codecov-commenter commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 32.96089% with 120 lines in your changes missing coverage. Please review.

Project coverage is 40.78%. Comparing base (0cc0c62) to head (5f1d50e).
Report is 122 commits behind head on main.

Files with missing lines Patch % Lines
src/compiler/compiler.rs 38.51% 28 Missing and 55 partials ⚠️
src/dist/http.rs 0.00% 21 Missing ⚠️
src/server.rs 0.00% 4 Missing and 7 partials ⚠️
src/compiler/rust.rs 0.00% 3 Missing ⚠️
src/util.rs 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2277      +/-   ##
==========================================
+ Coverage   30.91%   40.78%   +9.87%     
==========================================
  Files          53       55       +2     
  Lines       20112    20978     +866     
  Branches     9755     9677      -78     
==========================================
+ Hits         6217     8556    +2339     
- Misses       7922     8247     +325     
+ Partials     5973     4175    -1798     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…HE_DIST_REQUEST_TIMEOUT` seconds are stale and should be removed
@sylvestre sylvestre requested a review from glandium November 8, 2024 10:50
@trxcllnt trxcllnt changed the title Ensure client and scheduler are resilient to server autoscaling [do not merge] Ensure client and scheduler are resilient to server autoscaling Nov 8, 2024
@trxcllnt
Copy link
Contributor Author

Closing this, as the root issue is much larger in scope than this PR originally intended.

@trxcllnt trxcllnt closed this Dec 21, 2024
@trxcllnt trxcllnt deleted the fea/autoscaling branch December 21, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make "REQUEST_TIMEOUT_SECS" configurable
2 participants