feat: implement preflight #502

noryev · 2025-01-29T00:34:16Z

Summary

This PR implements a new pre-flight check system for Lilypad resource providers. The pre-flight check verifies essential requirements including GPU availability, Docker runtime, and system resources before a provider joins the network.

This pull request makes the following changes:

Implemented GPU detection and validation using nvidia-smi
Added Docker runtime verification
Introduced system requirements check (1GB RAM minimum
Added graceful handling for non-GPU environments
Enhanced logging system for pre-flight status and success messages

These changes are made to ensure a Resource Provider(s) is prepared to run a Lilypad Module

Task/Issue reference

First iteration of preflight!

Test plan

You can test this locally within the dev environment.

Details (optional)

Add any additional details that will help to review this pull request.

Related issues or PRs (optional)

A previous PR that served as inspiration: #433

narbs91

Good stuff @noryev ! Just a couple small comments/suggestions

narbs91 · 2025-01-29T01:10:16Z

pkg/resourceprovider/preflight/docker.go

+			Message: "NVIDIA runtime test failed",
+			Error:   err,


[nit] Ordering of these is different in comparison to the lines above. Also maybe a worth adding a formatted error like the others?

Fixed the ordering and per Brians comments, I also moved these structs to preflight.go to get rid of the types.toml

narbs91 · 2025-01-29T01:30:14Z

pkg/resourceprovider/preflight/gpu.go

+		fields := strings.Split(record, ", ")
+		if len(fields) != 4 {
+			continue
+		}
+
+		memoryStr := strings.Split(fields[2], " ")[0]
+		memoryMiB, _ := strconv.ParseInt(memoryStr, 10, 64)
+
+		gpu := GPUInfo{
+			UUID:          strings.TrimSpace(fields[0]),
+			Name:          strings.TrimSpace(fields[1]),
+			MemoryTotal:   memoryMiB,
+			DriverVersion: strings.TrimSpace(fields[3]),
+		}


Maybe would be cleaner if you had a mapper function that took in a record and spit out a gpu object? Also, do we know for certain that the fields won't be blank?

TY! Totally agree, I have moved that parsing code into its own parseGPURecord func that is cleaner. WDYT? d1b7d78

I also added better error handling if the fields are blank too

narbs91 · 2025-01-29T01:33:13Z

pkg/resourceprovider/preflight/gpu.go

+type preflightChecker struct {
+	gpuInfo []GPUInfo
+}
+
+type GPUCheckConfig struct {
+	Required     bool
+	MinGPUs      int
+	MinMemory    int64
+	Capabilities []string
+}
+
+func checkNvidiaSMI() error {
+	_, err := exec.LookPath("nvidia-smi")
+	return err
+}
+
+type nvidiaSmiResponse struct {
+	UUID          string
+	Name          string
+	MemoryTotal   string
+	DriverVersion string
+}


Any reason for leaving these in this file vs putting them in preflight/types?

That would have made more sense! For cleaner code, I moved types.go into preflight.go per Brians instructions getting rid of the types.go.

bgins · 2025-01-29T19:30:16Z

pkg/resourceprovider/preflight/types.go

Could we move these types into preflight.go? Having the types alongside the code that uses them is a better code organization pattern.

Yep! Thank you for the recommendation!

bgins · 2025-01-29T19:35:58Z

pkg/options/resource-provider.go

+		Preflight: preflight.PreflightConfig{
+			GPU: struct {
+				Required     bool
+				Enabled      bool
+				MinMemoryGB  int64
+				Capabilities []string
+			}{
+				Required:    false, // Enable to require GPU
+				Enabled:     true,  // Enable checks to detect if GPU exists
+				MinMemoryGB: 1,     // Minimum memory required for GPU (we can match this with the resourceOffer)
+			},
+		},


We discussed these configs. It looks like this may run fine on CPU-only and GPU machines, and doesn't significantly slow down start up times. For now, let's leave the preflight check enabled in all cases, and we can see if there is a motivation to make it configurable.

The Required config can also be removed now that the preflight check works fine on any machine.

The MinMemoryGB can be moved to the code that uses it and hardcoded. In the future, there may be a reason a resource provider would want to configure this, but for a first pass we can hard code.

I removed the toggle(s) so now preflight is always running with or without a GPU within the commits here: d14be6b
Secondly, I assigned the MinMemoryGB value to a variable thats hardcoded in the code. 7b42625

bgins · 2025-02-03T22:58:41Z

pkg/options/resource-provider.go

+		Preflight: preflight.PreflightConfig{
+			GPU: struct {
+				MinMemoryGB int64
+			}{
+				MinMemoryGB: preflight.RequiredGPUMemoryGB,
+			},
+		},


Move this to the preflight package. We add items to the options package when they are configurable by users.

Co-authored-by: logan <[email protected]>

narbs91

LGTM!

bgins · 2025-02-04T17:29:54Z

pkg/resourceprovider/resourceprovider.go

@@ -80,6 +80,7 @@ type ResourceProvider struct {
 	web3SDK    *web3.Web3SDK
 	options    ResourceProviderOptions
 	controller *ResourceProviderController
+	gpuInfo    []preflight.GPUInfo


Are we using this?

Co-authored-by: logan <[email protected]>

bgins

Looks good! Great work on this. 🎉

noryev · 2025-02-04T21:12:44Z

TYSM Brian and Narb! I have made sure preflight is working across multiple Resource Provider environments:
Hardware tested:

Multiple Apple Silicon RP's
Two different NVIDIA GPU enabled RP's

Here is an example preflight logs on a GPU enabled RP

noryev requested a review from a team as a code owner January 29, 2025 00:34

cla-bot bot added the cla-signed label Jan 29, 2025

noryev changed the title ~~Feat/preflight logan~~ feat/preflight logan Jan 29, 2025

noryev changed the title ~~feat/preflight logan~~ noryev/feat-preflight-checks Jan 29, 2025

noryev changed the title ~~noryev/feat-preflight-checks~~ feat: implement preflight Jan 29, 2025

github-actions bot added the feature label Jan 29, 2025

narbs91 reviewed Jan 29, 2025

View reviewed changes

bgins reviewed Jan 29, 2025

View reviewed changes

noryev added 18 commits February 3, 2025 12:28

feat: barebones

a334ecd

feat: Implement preflight checks for GPU and Docker runtime

ced7238

feat: Preflight logging and success messages

a7ee2ae

chore: restore comments in [pkg/resourceprovider/resourceprovider.go]

6f6ed68

feat: gpu check + cleanup

f201b4d

fix: gpu nvidia-smi check

f436084

fix: handle no-GPU case gracefully

d6099e0

fix: update start method in RP

834cd16

chore: remove unused dockerfiles

95c6ca0

chore: restore comments in [pkg/resourceprovider/resourceprovider.go]

7db0a3f

chore: remove comments within [pkg/resourceprovider/resourceprovider.go]

4db0862

refactor: removed minimum GPU parameter

8e1e504

feat: 1gb ram requirement

06329f0

refactor: simplify GPU configuration by removing unnecessary parameters

a77b80d

refactor: enhance GPU info logging and remove types file

952898e

refactor: replace hardcoded GPU memory with default constant

00e7467

refactor: improve GPU info parsing and validation in GetGPUInfo

7e746d5

chore: comments for required GPU VRAM

2e56993

noryev force-pushed the feat/preflight-logan branch from 9cbff18 to 2e56993 Compare February 3, 2025 18:29

bgins reviewed Feb 3, 2025

View reviewed changes

refactor: Move preflight checker from interface to struct

c216504

Co-authored-by: logan <[email protected]>

bgins and others added 4 commits February 3, 2025 15:30

chore: Remove preflight check from start function

b6342c6

Co-authored-by: logan <[email protected]>

refactor: Move RunPreflightChecks function to preflight package

4582da5

Co-authored-by: logan <[email protected]>

refactor: Move preflight config to preflight package

f01160d

Co-authored-by: logan <[email protected]>

chore: refactor context within resource provider

65330ad

narbs91 approved these changes Feb 4, 2025

View reviewed changes

bgins reviewed Feb 4, 2025

View reviewed changes

noryev and others added 4 commits February 4, 2025 11:35

chore: remove unused gpuInfo field

12d4238

refactor: Make functions and structs private where possible

f8c43ac

Co-authored-by: logan <[email protected]>

chore: Exit early when no GPU detected

5ac7ee5

Co-authored-by: logan <[email protected]>

chore: Improve failed to parse GPU string error

0150090

Co-authored-by: logan <[email protected]>

bgins approved these changes Feb 4, 2025

View reviewed changes

noryev merged commit 9c2ae59 into main Feb 4, 2025
5 checks passed

noryev deleted the feat/preflight-logan branch February 4, 2025 21:15

lilypad-releases bot mentioned this pull request Feb 4, 2025

chore(main): release 2.12.0 #490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement preflight #502

feat: implement preflight #502

noryev commented Jan 29, 2025 •

edited

Loading

narbs91 left a comment

narbs91 Jan 29, 2025

noryev Feb 3, 2025

narbs91 Jan 29, 2025

noryev Feb 3, 2025

noryev Feb 3, 2025

narbs91 Jan 29, 2025

noryev Feb 3, 2025

bgins Jan 29, 2025

noryev Feb 3, 2025

bgins Jan 29, 2025

noryev Feb 3, 2025

bgins Feb 3, 2025

narbs91 left a comment

bgins Feb 4, 2025

bgins left a comment

noryev commented Feb 4, 2025

feat: implement preflight #502

feat: implement preflight #502

Conversation

noryev commented Jan 29, 2025 • edited Loading

Summary

Task/Issue reference

Test plan

Details (optional)

Related issues or PRs (optional)

narbs91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narbs91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgins left a comment

Choose a reason for hiding this comment

noryev commented Feb 4, 2025

noryev commented Jan 29, 2025 •

edited

Loading