Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DRA driver for IMEX #1143

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions api/nvidia/v1/clusterpolicy_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ type ClusterPolicySpec struct {
// DevicePlugin component spec
DevicePlugin DevicePluginSpec `json:"devicePlugin"`
// DRADriver component spec
DRADriver DRADriverSpec `json:"draDriver"`
IMEXDRADriver IMEXDRADriverSpec `json:"imexDRADriver"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems better.

// DCGMExporter spec
DCGMExporter DCGMExporterSpec `json:"dcgmExporter"`
// DCGM component spec
Expand Down Expand Up @@ -843,24 +843,24 @@ type SandboxDevicePluginSpec struct {
Env []EnvVar `json:"env,omitempty"`
}

// DRADriverSpec defines the properties for the NVIDIA DRA Driver deployment
// IMEXDRADriverSpec defines the properties for the NVIDIA IMEX DRA Driver deployment
// TODO: add 'controller' and 'kubeletPlugin' structs to allow for per-component configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Should we expose controller and kubeletPlugin as concepts to the user? These seem like internal details and including them here couples the operator and the DRA driver implementation more tightly.

Copy link
Contributor Author

@cdesiniotis cdesiniotis Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we should at this point in time. But I can see where having the ability to control the controller / kubeletPlugin configuration independently could be useful. For example, one may need to bump the cpu / mem resources for the controller (and not the plugin) to account for larger sized clusters. Open to continue discussion on this and enumerate the list of fields we want to expose in Clusterpolicy.

type DRADriverSpec struct {
// Enabled indicates if the deployment of NVIDIA DRA Driver through the operator is enabled
type IMEXDRADriverSpec struct {
// Enabled indicates if the deployment of NVIDIA IMEX DRA Driver through the operator is enabled
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors=true
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Enable NVIDIA DRA Driver deployment through GPU Operator"
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Enable NVIDIA IMEX DRA Driver deployment through GPU Operator"
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:booleanSwitch"
Enabled *bool `json:"enabled,omitempty"`

// NVIDIA DRA Driver image repository
// NVIDIA IMEX DRA Driver image repository
// +kubebuilder:validation:Optional
Repository string `json:"repository,omitempty"`

// NVIDIA DRA Driver image name
// NVIDIA IMEX DRA Driver image name
// +kubebuilder:validation:Pattern=[a-zA-Z0-9\-]+
Image string `json:"image,omitempty"`

// NVIDIA DRA Driver image tag
// NVIDIA IMEX DRA Driver image tag
// +kubebuilder:validation:Optional
Version string `json:"version,omitempty"`

Expand Down Expand Up @@ -1820,9 +1820,9 @@ func ImagePath(spec interface{}) (string, error) {
case *SandboxDevicePluginSpec:
config := spec.(*SandboxDevicePluginSpec)
return imagePath(config.Repository, config.Image, config.Version, "SANDBOX_DEVICE_PLUGIN_IMAGE")
case *DRADriverSpec:
config := spec.(*DRADriverSpec)
return imagePath(config.Repository, config.Image, config.Version, "DRA_DRIVER_IMAGE")
case *IMEXDRADriverSpec:
config := spec.(*IMEXDRADriverSpec)
return imagePath(config.Repository, config.Image, config.Version, "IMEX_DRA_DRIVER_IMAGE")
case *DCGMExporterSpec:
config := spec.(*DCGMExporterSpec)
return imagePath(config.Repository, config.Image, config.Version, "DCGM_EXPORTER_IMAGE")
Expand Down Expand Up @@ -1931,8 +1931,8 @@ func (p *DevicePluginSpec) IsEnabled() bool {
return *p.Enabled
}

// IsEnabled returns true if draDriver is enabled through gpu-operator
func (d *DRADriverSpec) IsEnabled() bool {
// IsEnabled returns true if IMEX DRA Driver is enabled through gpu-operator
func (d *IMEXDRADriverSpec) IsEnabled() bool {
if d.Enabled == nil {
// default is true if not specified by user
return true
Expand Down
82 changes: 41 additions & 41 deletions api/nvidia/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: nvidia-dra-driver
name: nvidia-imex-dra-driver
namespace: "FILLED BY THE OPERATOR"
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: nvidia-dra-driver
name: nvidia-imex-dra-driver
rules:
# TODO: restrict RBAC for DRA driver
- apiGroups:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nvidia-dra-driver
name: nvidia-imex-dra-driver
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: nvidia-dra-driver
name: nvidia-imex-dra-driver
subjects:
- kind: ServiceAccount
name: nvidia-dra-driver
name: nvidia-imex-dra-driver
namespace: "FILLED BY THE OPERATOR"
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,21 @@ apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nvidia-dra-driver-controller
name: nvidia-dra-driver-controller
app: nvidia-imex-dra-driver-controller
name: nvidia-imex-dra-driver-controller
namespace: "FILLED BY THE OPERATOR"
spec:
replicas: 1
selector:
matchLabels:
app: nvidia-dra-driver-controller
app: nvidia-imex-dra-driver-controller
template:
metadata:
labels:
app: nvidia-dra-driver-controller
app: nvidia-imex-dra-driver-controller
spec:
priorityClassName: system-node-critical
serviceAccountName: nvidia-dra-driver
serviceAccountName: nvidia-imex-dra-driver
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-dra-driver-kubelet-plugin-entrypoint
name: nvidia-imex-dra-driver-kubelet-plugin-entrypoint
namespace: "FILLED BY THE OPERATOR"
labels:
app: nvidia-dra-driver-kubelet-plugin
app: nvidia-imex-dra-driver-kubelet-plugin
data:
entrypoint.sh: |-
#!/bin/bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: nvidia-dra-driver-kubelet-plugin
name: nvidia-dra-driver-kubelet-plugin
app: nvidia-imex-dra-driver-kubelet-plugin
name: nvidia-imex-dra-driver-kubelet-plugin
namespace: "FILLED BY THE OPERATOR"
spec:
selector:
matchLabels:
app: nvidia-dra-driver-kubelet-plugin
app: nvidia-imex-dra-driver-kubelet-plugin
updateStrategy:
rollingUpdate:
maxSurge: 0
Expand All @@ -17,30 +17,17 @@ spec:
template:
metadata:
labels:
app: nvidia-dra-driver-kubelet-plugin
app: nvidia-imex-dra-driver-kubelet-plugin
spec:
priorityClassName: system-node-critical
serviceAccountName: nvidia-dra-driver
# TODO: revisit the affinity / nodeSelector for this daemonset
serviceAccountName: nvidia-imex-dra-driver
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
- matchExpressions:
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- NVIDIA
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
- key: nvidia.com/gpu.imex-domain
operator: Exists
initContainers:
- image: "FILLED BY THE OPERATOR"
name: driver-validation
Expand Down Expand Up @@ -83,7 +70,7 @@ spec:
securityContext:
privileged: true
volumeMounts:
- name: nvidia-dra-driver-kubelet-plugin-entrypoint
- name: nvidia-imex-dra-driver-kubelet-plugin-entrypoint
readOnly: true
mountPath: /bin/entrypoint.sh
subPath: entrypoint.sh
Expand All @@ -104,9 +91,9 @@ spec:
- mountPath: /var/run/cdi
name: cdi
volumes:
- name: nvidia-dra-driver-kubelet-plugin-entrypoint
- name: nvidia-imex-dra-driver-kubelet-plugin-entrypoint
configMap:
name: nvidia-dra-driver-kubelet-plugin-entrypoint
name: nvidia-imex-dra-driver-kubelet-plugin-entrypoint
defaultMode: 448
- name: run-nvidia-validations
hostPath:
Expand Down
Loading