Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement preflight #502

Merged
merged 27 commits into from
Feb 4, 2025
Merged

feat: implement preflight #502

merged 27 commits into from
Feb 4, 2025

Conversation

noryev
Copy link
Contributor

@noryev noryev commented Jan 29, 2025

Summary

This PR implements a new pre-flight check system for Lilypad resource providers. The pre-flight check verifies essential requirements including GPU availability, Docker runtime, and system resources before a provider joins the network.

This pull request makes the following changes:

  • Implemented GPU detection and validation using nvidia-smi
  • Added Docker runtime verification
  • Introduced system requirements check (1GB RAM minimum
  • Added graceful handling for non-GPU environments
  • Enhanced logging system for pre-flight status and success messages

These changes are made to ensure a Resource Provider(s) is prepared to run a Lilypad Module

Task/Issue reference

First iteration of preflight!

Test plan

You can test this locally within the dev environment.

Details (optional)

Add any additional details that will help to review this pull request.

Related issues or PRs (optional)

A previous PR that served as inspiration: #433

@noryev noryev requested a review from a team as a code owner January 29, 2025 00:34
@cla-bot cla-bot bot added the cla-signed label Jan 29, 2025
@noryev noryev changed the title Feat/preflight logan feat/preflight logan Jan 29, 2025
@noryev noryev changed the title feat/preflight logan noryev/feat-preflight-checks Jan 29, 2025
@noryev noryev changed the title noryev/feat-preflight-checks feat: implement preflight Jan 29, 2025
Copy link
Collaborator

@narbs91 narbs91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @noryev ! Just a couple small comments/suggestions

Comment on lines 48 to 49
Message: "NVIDIA runtime test failed",
Error: err,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Ordering of these is different in comparison to the lines above. Also maybe a worth adding a formatted error like the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the ordering and per Brians comments, I also moved these structs to preflight.go to get rid of the types.toml

Comment on lines 58 to 71
fields := strings.Split(record, ", ")
if len(fields) != 4 {
continue
}

memoryStr := strings.Split(fields[2], " ")[0]
memoryMiB, _ := strconv.ParseInt(memoryStr, 10, 64)

gpu := GPUInfo{
UUID: strings.TrimSpace(fields[0]),
Name: strings.TrimSpace(fields[1]),
MemoryTotal: memoryMiB,
DriverVersion: strings.TrimSpace(fields[3]),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe would be cleaner if you had a mapper function that took in a record and spit out a gpu object? Also, do we know for certain that the fields won't be blank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY! Totally agree, I have moved that parsing code into its own parseGPURecord func that is cleaner. WDYT? d1b7d78

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added better error handling if the fields are blank too

Comment on lines 13 to 34
type preflightChecker struct {
gpuInfo []GPUInfo
}

type GPUCheckConfig struct {
Required bool
MinGPUs int
MinMemory int64
Capabilities []string
}

func checkNvidiaSMI() error {
_, err := exec.LookPath("nvidia-smi")
return err
}

type nvidiaSmiResponse struct {
UUID string
Name string
MemoryTotal string
DriverVersion string
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for leaving these in this file vs putting them in preflight/types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have made more sense! For cleaner code, I moved types.go into preflight.go per Brians instructions getting rid of the types.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these types into preflight.go? Having the types alongside the code that uses them is a better code organization pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Thank you for the recommendation!

Comment on lines 22 to 28
Preflight: preflight.PreflightConfig{
GPU: struct {
Required bool
Enabled bool
MinMemoryGB int64
Capabilities []string
}{
Required: false, // Enable to require GPU
Enabled: true, // Enable checks to detect if GPU exists
MinMemoryGB: 1, // Minimum memory required for GPU (we can match this with the resourceOffer)
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed these configs. It looks like this may run fine on CPU-only and GPU machines, and doesn't significantly slow down start up times. For now, let's leave the preflight check enabled in all cases, and we can see if there is a motivation to make it configurable.

The Required config can also be removed now that the preflight check works fine on any machine.

The MinMemoryGB can be moved to the code that uses it and hardcoded. In the future, there may be a reason a resource provider would want to configure this, but for a first pass we can hard code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the toggle(s) so now preflight is always running with or without a GPU within the commits here: d14be6b
Secondly, I assigned the MinMemoryGB value to a variable thats hardcoded in the code. 7b42625

@noryev noryev force-pushed the feat/preflight-logan branch from 9cbff18 to 2e56993 Compare February 3, 2025 18:29
Comment on lines 22 to 28
Preflight: preflight.PreflightConfig{
GPU: struct {
MinMemoryGB int64
}{
MinMemoryGB: preflight.RequiredGPUMemoryGB,
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the preflight package. We add items to the options package when they are configurable by users.

Copy link
Collaborator

@narbs91 narbs91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -80,6 +80,7 @@ type ResourceProvider struct {
web3SDK *web3.Web3SDK
options ResourceProviderOptions
controller *ResourceProviderController
gpuInfo []preflight.GPUInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

Copy link
Contributor

@bgins bgins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great work on this. 🎉

@noryev
Copy link
Contributor Author

noryev commented Feb 4, 2025

TYSM Brian and Narb! I have made sure preflight is working across multiple Resource Provider environments:
Hardware tested:

  • Multiple Apple Silicon RP's
  • Two different NVIDIA GPU enabled RP's

Here is an example preflight logs on a GPU enabled RP
image

@noryev noryev merged commit 9c2ae59 into main Feb 4, 2025
5 checks passed
@noryev noryev deleted the feat/preflight-logan branch February 4, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants