-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: add canary jobs and sync mechanism #33993
Open
pohly
wants to merge
7
commits into
kubernetes:master
Choose a base branch
from
pohly:dra-job-canary
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+751
−383
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ee39d1d
dra: move all jobs into a single file
pohly 449711e
dra: add canary jobs
pohly b4f79a9
dra canary: fix comment
pohly 056f34d
dra: add shell script for applying canary changes
pohly c82feb7
dra: apply changes from canary jobs
k8s-ci-robot 71e5959
node: add YAML test
pohly 3942639
dra: use YAML anchors and aliases to avoid duplication
pohly File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Last commit which was synced into dynamic-resource-allocation.yaml. | ||
last_sync=056f34d0fffe6816d4f4ef7acd2e14811ff5c67a |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/usr/bin/env bash | ||
# Copyright 2024 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Running this script will automatically take changes made to | ||
# dynamic-resource-allocation-canary.yaml since the last sync | ||
# (tracked in .dra-sync-settings) and create a commit which | ||
# applies those changes to dynamic-resource-allocation.yaml. | ||
|
||
set -o errexit | ||
set -o nounset | ||
set -o pipefail | ||
|
||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../../.." && pwd -P)" | ||
cd "${REPO_ROOT}" | ||
|
||
# get "last_sync" | ||
source "config/jobs/kubernetes/sig-node/.dra-sync-settings" | ||
|
||
if [ -n "$(git diff --cached 2>&1)" ]; then | ||
echo >&2 "ERROR: The git staging area must be clean." | ||
exit 1 | ||
fi | ||
|
||
new_sync=$(git rev-parse HEAD) | ||
|
||
diff=$(git diff ${last_sync}..${new_sync} config/jobs/kubernetes/sig-node/dynamic-resource-allocation-canary.yaml | sed -e 's/-canary//g') | ||
|
||
if [ -z "${diff}" ]; then | ||
echo "No changes since last sync, nothing to do." | ||
exit 0 | ||
fi | ||
|
||
# Generate a "git format-patch" alike patch and apply it. | ||
git am <<EOF | ||
From ${new_sync} | ||
From: dra-sync.sh helper script <[email protected]> | ||
Date: $(date --rfc-email) | ||
Subject: [PATCH 1/1] dra: apply changes from canary jobs | ||
|
||
--- | ||
${diff} | ||
$(diff -u config/jobs/kubernetes/sig-node/.dra-sync-settings <(sed -e "s/last_sync=.*/last_sync=${new_sync}/" config/jobs/kubernetes/sig-node/.dra-sync-settings) | sed -e 's;^--- .*;--- a/config/jobs/kubernetes/sig-node/.dra-sync-settings;' -e 's;+++ .*;+++ b/config/jobs/kubernetes/sig-node/.dra-sync-settings;') | ||
EOF | ||
|
||
git log -p -n1 | ||
|
||
cat <<EOF | ||
|
||
Commit created successfully (see above). Review and submit as a pull request. | ||
EOF |
320 changes: 320 additions & 0 deletions
320
config/jobs/kubernetes/sig-node/dynamic-resource-allocation-canary.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,320 @@ | ||
# This file contains canary periodic and presubmit jobs which run tests covering | ||
# Dynamic Resource Allocation (DRA). | ||
# | ||
# The intent is to make all changes to DRA jobs first for these canary jobs, | ||
# then copy the changes into dynamic-resource-allocation.yaml. Unless some | ||
# experimental changes are being tested, a diff between the two files should | ||
# be limited to: | ||
# - this comment | ||
# - job names | ||
# - interval for the periodic jobs | ||
# | ||
# This command can be used to check this: | ||
# diff dynamic-resource-allocation.yaml <(sed -e 's/-canary//' dynamic-resource-allocation-canary.yaml) | ||
|
||
|
||
# `templates` has no special meaning. It just holds YAML anchors (= re-usable content) | ||
# that get referenced below via YAML aliases: `<<: *job` includes the content | ||
# and then allows adding or overwriting fields. This is done at the root. If a field | ||
# contains lists or objects, that content gets replaced instead of merged. | ||
# | ||
# Lists cannot be extended the same way (https://github.com/yaml/yaml/issues/35). | ||
# | ||
# If unsure what the expanded jobs look like or to test parsing, run `go test -v .` | ||
templates: | ||
- &job | ||
cluster: eks-prow-build-cluster | ||
annotations: &annotations | ||
testgrid-dashboards: sig-node-dynamic-resource-allocation | ||
# Alerting is enabled also for PRs. If someone repeatedly tests a broken | ||
# PR where the change introduced by the PR breaks tests, an alert will | ||
# eventually be triggered. This is a good thing because then we can | ||
# show the PR author how to do local testing... | ||
# | ||
# Disabled for canary jobs, enabled for real jobs. | ||
testgrid-alert-email: # [email protected],[email protected] # #wg-device-management on Slack | ||
fork-per-release: "false" # Only for canary jobs, must be true for real jobs. | ||
decorate: true | ||
|
||
- &kubernetes-master | ||
org: kubernetes | ||
repo: kubernetes | ||
base_ref: master | ||
path_alias: k8s.io/kubernetes | ||
|
||
- &test-infra-master | ||
org: kubernetes | ||
repo: test-infra | ||
base_ref: master | ||
path_alias: k8s.io/test-infra | ||
|
||
- &periodic-job | ||
interval: 1000000h # Run only once on creation and when manually triggered. | ||
extra_refs: | ||
- *kubernetes-master | ||
|
||
- &periodic-node-job | ||
<<: *periodic-job | ||
extra_refs: | ||
- *kubernetes-master | ||
- *test-infra-master # For test-infra/jobs/e2e_node files. | ||
|
||
- &presubmit-job | ||
skip_branches: | ||
- release-\d+\.\d+ # per-release image | ||
always_run: false | ||
optional: true | ||
skip_report: false | ||
# run_if_changed is set judiciously on some jobs. | ||
|
||
- &e2e-kind-job | ||
<<: *job | ||
decoration_config: | ||
timeout: 3h | ||
labels: | ||
preset-service-account: "true" | ||
preset-dind-enabled: "true" | ||
preset-kind-volume-mounts: "true" | ||
decoration_config: | ||
timeout: 3h | ||
spec: | ||
containers: | ||
- &e2e-kind-container | ||
image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241128-8df65c072f-master | ||
command: | ||
- runner.sh | ||
- /bin/sh | ||
- -xce | ||
- | | ||
make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test" | ||
curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind | ||
kind build node-image --image=dra/node:latest . | ||
trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT | ||
if ${with_all_features:-false}; then | ||
# Which DRA features exist can change over time. | ||
features=( $(grep '"DRA' pkg/features/kube_features.go | sed 's/.*"\(.*\)"/\1/') ) | ||
echo "Enabling DRA feature(s): ${features[*]}." | ||
# Those additional features are not in kind.yaml, but they can be added at the end. | ||
kind create cluster --retain --config <(cat test/e2e/dra/kind.yaml; for feature in ${features}; do echo " ${feature}: true"; done) --image dra/node:latest | ||
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=1h hack/ginkgo-e2e.sh -ginkgo.label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Alpha, Beta, DynamicResourceAllocation$(for feature in ${features}; do echo , ${feature}; done)} && !Flaky && !Slow" | ||
else | ||
kind create cluster --retain --config test/e2e/dra/kind.yaml --image dra/node:latest | ||
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=2h30m hack/ginkgo-e2e.sh -ginkgo.label-filter='Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky' | ||
fi | ||
|
||
# docker-in-docker needs privileged mode | ||
securityContext: | ||
privileged: true | ||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 9Gi | ||
requests: | ||
cpu: 2 | ||
memory: 9Gi | ||
|
||
- &e2e-node-job | ||
<<: *job | ||
labels: | ||
preset-service-account: "true" | ||
preset-k8s-ssh: "true" | ||
decoration_config: | ||
timeout: 90m | ||
spec: | ||
containers: | ||
- &e2e-node-container | ||
image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241128-8df65c072f-master | ||
command: | ||
- /bin/bash | ||
- -ce | ||
- | | ||
export IGNITION_INJECT_GCE_SSH_PUBLIC_KEY_FILE=1 | ||
export GOPATH=/go | ||
|
||
# Compose the command depending on the `runtime` variable set for the job. | ||
cmd=( | ||
runner.sh | ||
/workspace/scenarios/kubernetes_e2e.py | ||
--deployment=node | ||
--env=KUBE_SSH_USER=core | ||
--gcp-zone=us-west1-b | ||
--node-tests=true | ||
--provider=gce | ||
'--test_args=--timeout=1h --label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky"' | ||
--timeout=65m | ||
) | ||
case ${runtime:-containerd} in | ||
containerd) | ||
cmd+=( | ||
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-process-name=/usr/bin/containerd --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service" --extra-log="{\"name\": \"containerd.log\", \"journalctl\": [\"-u\", \"containerd\"]}"' | ||
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/dra/image-config-containerd-1.7.yaml | ||
) | ||
;; | ||
crio-cgroupv1) | ||
cmd+=( | ||
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"' | ||
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1-serial.yaml | ||
) | ||
;; | ||
crio-cgroupv2) | ||
cmd+=( | ||
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"' | ||
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv2-serial.yaml | ||
) | ||
;; | ||
esac | ||
|
||
# Run it. | ||
set -x | ||
${cmd[@]} | ||
|
||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 9Gi | ||
requests: | ||
cpu: 2 | ||
memory: 9Gi | ||
|
||
periodics: | ||
|
||
- <<: *e2e-kind-job | ||
<<: *periodic-job | ||
name: ci-kind-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-tab-name: ci-kind-dra-canary | ||
description: Runs E2E tests for Dynamic Resource Allocation beta features against a Kubernetes master cluster created with sigs.k8s.io/kind | ||
|
||
- <<: *e2e-kind-job | ||
<<: *periodic-job | ||
name: ci-kind-dra-all-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-tab-name: ci-kind-dra-all-canary | ||
description: Runs E2E tests for Dynamic Resource Allocation alpha and beta features against a Kubernetes master cluster created with sigs.k8s.io/kind | ||
spec: | ||
containers: | ||
- <<: *e2e-kind-container | ||
env: | ||
- name: with_all_features | ||
value: "true" | ||
|
||
- <<: *e2e-node-job | ||
<<: *periodic-node-job | ||
name: ci-node-e2e-cgrpv1-crio-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation | ||
testgrid-tab-name: ci-node-e2e-cgrpv1-crio-dra-canary | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1 | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: crio-cgroupv1 | ||
|
||
- <<: *e2e-node-job | ||
<<: *periodic-node-job | ||
name: ci-node-e2e-cgrpv2-crio-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v2 | ||
testgrid-tab-name: ci-node-e2e-cgrpv2-crio-dra-canary | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: crio-cgroupv1 | ||
|
||
- <<: *e2e-node-job | ||
<<: *periodic-node-job | ||
name: ci-node-e2e-containerd-dra-canary | ||
annotations: | ||
<<: *annotations | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1 | ||
testgrid-tab-name: ci-node-e2e-containerd-dra-canary | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: containerd | ||
|
||
presubmits: | ||
kubernetes/kubernetes: | ||
|
||
- <<: *e2e-kind-job | ||
<<: *presubmit-job | ||
name: pull-kubernetes-kind-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-tab-name: pull-kubernetes-kind-dra-canary | ||
description: Runs E2E tests for Dynamic Resource Allocation beta features against a Kubernetes master cluster created with sigs.k8s.io/kind | ||
run_if_changed: &e2e-run-if-changed-e2e /(dra|dynamicresources|resourceclaim|deviceclass|resourceslice|resourceclaimtemplate|dynamic-resource-allocation|pkg/apis/resource|api/resource)/.*.go | ||
|
||
- <<: *e2e-kind-job | ||
<<: *presubmit-job | ||
name: pull-kubernetes-kind-dra-all-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-tab-name: pull-kubernetes-kind-dra-all-canary | ||
description: Runs E2E tests for Dynamic Resource Allocation alpha and beta features against a Kubernetes master cluster created with sigs.k8s.io/kind | ||
run_if_changed: *e2e-run-if-changed-e2e | ||
spec: | ||
containers: | ||
- <<: *e2e-kind-container | ||
env: | ||
- name: with_all_features | ||
value: "true" | ||
|
||
- <<: *e2e-node-job | ||
<<: *presubmit-job | ||
name: pull-kubernetes-node-e2e-cgrpv1-crio-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation | ||
testgrid-tab-name: pull-kubernetes-node-e2e-cgrpv1-crio-dra-canary | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1 | ||
# Automatically testing with one container runtime in one configuration is sufficient to detect basic problems in kubelet early. | ||
# CRI-O was picked because it was solid for testing so far. | ||
# Periodic variant: ci-node-e2e-crio-cgrpv1-dra | ||
run_if_changed: (/dra/|/dynamicresources/|/resourceclaim/|/deviceclass/|/resourceslice/|/resourceclaimtemplate/|/dynamic-resource-allocation/|/pkg/apis/resource/|/api/resource/|/test/e2e_node/dra_).*\.(go|yaml) | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: crio-cgroupv1 | ||
|
||
- <<: *e2e-node-job | ||
<<: *presubmit-job | ||
name: pull-kubernetes-node-e2e-cgrpv2-crio-dra-canary | ||
annotations: | ||
<<: *annotations | ||
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v2 | ||
testgrid-tab-name: pull-kubernetes-node-e2e-cgrpv2-crio-dra-canary | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: crio-cgroupv2 | ||
|
||
- <<: *e2e-node-job | ||
<<: *presubmit-job | ||
name: pull-kubernetes-node-e2e-containerd-dra-canary | ||
annotations: | ||
<<: *annotations | ||
description: Runs E2E node tests for Dynamic Resource Allocation beta features with containerd | ||
testgrid-tab-name: pull-kubernetes-node-e2e-containerd-dra-canary | ||
spec: | ||
containers: | ||
- <<: *e2e-node-container | ||
env: | ||
- name: runtime | ||
value: containerd |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 @bart0sh: I went ahead and rewrote this file so that it avoids duplication between the different jobs. From 3942639:
What do you think?
I know @bart0sh pointed out the difficulty of keeping periodics and presubmits in sync. This approach solves that, but is it readable and usable enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some test failures that would need further work. But let's first hear whether this is worth it at all.
Perhaps there is some other templating mechanism that can be used instead to generate the YAML files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better than what we currently have.
However, I'd prefer to use existing solution suggested by @dims on the slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly It would be even better to avoid having a lot of configuration details in the generation script (
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/build_jobs.py
) and have a yaml file with a set of values, templates for various job types and a generic build_jobs script that generates result by rendering templates based on values.yaml. I can play with this if you don't have time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this PR is going down the wrong track. Instead of two files which we need to keep in sync, we should have a single
dynamic-resource-allocation.yaml
which is fully generated. That file should also include canary presubmits where we can try out experimental changes before changing the generator to apply them also to the normal jobs.We don't need canary periodic jobs, for two reasons:
interval
vs.branch
settings, then testing the job spec with a canary presubmit is sufficient.Thanks, please do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly this is what I came up with so far, PTAL: #34010