Releases: pytorch/test-infra
Releases · pytorch/test-infra
Runner lambdas v20221124-080939
Fixes to correctly assign the security group to the ec2 runner (#1144) AWS security groups need to be assigned to the correct VPC, with the new design of multiple VPCs there is a need to manage multiple SG and the hierarchical relationship for Region -> VPCs -> (subnet, sg);
Runner lambdas v20221118-104528
Jeanschmidt - Add partially multi-region logic && separate VPCs for c…
Runner lambdas v20221102-115928
rewrite metrics CW to leverage dimensions and be compatible with meta…
Runner lambdas v20221101-172204
reference version over lambda alias (#995)
Runner lambdas v20221031-105322
limit cloudwatch metrics for linux disk to /, other mount points are …
Runner lambdas v20221025-105328
GHA runners - Separate AMI owner filters for linux and windows instan…
Runner lambdas v20221021-231347
Fix runaway runner deletion on scale-down when API quota is hit (#938) Rethrow the octokit 'API rate limit exceeded' errors when fetching runner info on `scale-down` instead of consuming it. Please see SEV pytorch/pytorch#87500 for details. Note: I've decided on the least invasive implementation (re-throwing only very specific class of exceptions) to avoid unintended side effects. Need to discuss with @jeanschmidt if all exceptions could be safely rethrown. Testing: * Unit tests
Runner lambdas v20221019-100329
FIX: Don't remove EC2 instance when fails to remove githubRunner (#904) `removeGithubRunner[Org || Repo]` used to remove the EC2 instance, so no need to call `terminateRunner` again. This potentially could cause runners that failed to be unregistered from GHA to be terminated on EC2. As a fix, `removeGithubRunner` won't terminate the instance, nor generate logs. This will enable `scaleDown` to control when to call `terminateRunner` and generate the proper logs and metrics. Avoiding having this issue in the future. This bug also explains why we had in the past more EC2 instances being kept at its minimum time: instances with less than minimum time got unregistered and terminated without being tracked on main application metric. This is obvious when we compare the API calls to terminate and the count of app level termination. ![Screenshot 2022-10-18 at 09 21 47](https://user-images.githubusercontent.com/4520845/196364535-5aaab331-2080-44be-b6af-0702f99d50d9.png) ![Screenshot 2022-10-18 at 09 26 19](https://user-images.githubusercontent.com/4520845/196364542-376ff99f-617e-4e82-b459-dfc8364219ad.png) Bug initially flagged on [87134](https://github.com/pytorch/pytorch/issues/87134)
Runner lambdas v20221017-084425
FIX: add back metrics runnerLessMinimumTime and runnerFound to scaleD…
Runner lambdas v20221012-113302
Rewrite scaleDown to fix a series of bugs (#864) On scaleDown: - [FIX] Guaranteed stop the runners from the oldest to the newest, avoiding having runners for too long; - [FIX] Fixed bug where a runner could be removed from GHA but kept running on AWS; - [IMPROVED] Try to maintain always a minimum of `minAvailableRunners` runners free;