Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bacalhau dind w/ nvidia #463

Merged
merged 4 commits into from
Dec 6, 2024
Merged

Conversation

walkah
Copy link
Collaborator

@walkah walkah commented Dec 4, 2024

Summary

Our current bacalhau container was not properly mounting nvidia drivers (nor subsequently making them available to job containers). This PR makes a few changes:

  • Updates the bacalhau container to use cuda-runtime as the base
  • Sets up "Docker in Docker" (dind) and nvidia container toolkit
  • Changes docker-compose files to run bacalhau as privileged (rather than mounting /var/run/docker.sock as a volume)

Task/Issue reference

Closes: #440

Test plan

On a GPU-enabled machine, build the current container version via:

  • docker compose -f ./docker/docker-compose.yml build
  • run via docker compose -f ./docker/docker-compose.yml up

@walkah walkah requested a review from a team as a code owner December 4, 2024 15:54
@cla-bot cla-bot bot added the cla-signed label Dec 4, 2024
@github-actions github-actions bot added the fix label Dec 4, 2024
Copy link
Contributor

@bgins bgins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works on a CPU node modifying COMPUTE_MODE to cpu and sending a cowsay job.

Noticed that cancelling out of docker compose -f ./docker/docker-compose.yml up left Bacalhau in a bad state where it would not restart:

bacalhau           | time="2024-12-06T19:39:38.542704888Z" level=info msg="Starting up"
bacalhau           | failed to start daemon, ensure docker is not running or delete /var/run/docker.pid: process with PID 7 is still running

Stopping with docker compose -f ./docker/docker-compose.yml down did not have this issue. May be worth investigating if this would surface during restarts.

@walkah
Copy link
Collaborator Author

walkah commented Dec 6, 2024

Noticed that cancelling out of docker compose -f ./docker/docker-compose.yml up left Bacalhau in a bad state where it would not restart:

I added a find /run /var/run -iname 'docker*.pid' -delete prior to dind dockerd & to make sure any old pids get cleared out. thanks!

@walkah walkah merged commit 2767106 into main Dec 6, 2024
4 checks passed
@walkah walkah deleted the walkah/fix-bacalhau-nvidia-dind branch December 6, 2024 22:31
walkah added a commit that referenced this pull request Dec 10, 2024
* fix: bacalhau dind w/ nvidia

* fix: smaller RP image

* fix: PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bacalhau not identifying Docker container for targeted job
3 participants