Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/sd/v2: rely on systemd to set device access rules #3847

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Apr 26, 2023

It seems that the code added by commit b810da1 had cgroup v1
in mind, where runc overwrites the rules set by systemd. It is all
different in v2, because both ebpf programs (systemd's and runc's) have
to say "allow" for the device to get access.

So, when using cgroup v2 with systemd cgroup driver, access to devices
rules for that can't be translated to systemd properties is not possible
at all, and it makes sense to error out (rather than warn) in such case,
as the container won't work as intended.

With this change in mind, provided that runc correctly translates all
the device access rule, and systemd correctly applies those, we no
longer have to create and apply a second eBPF program with own rules.
Let's stop doing that, instead relying on systemd only.

Having two sets of rules (two ebpf programs) for cgroupv2/ebpf is
problematic for two reasons:

  1. Both sets should say "ok" for access to be granted.

  2. After systemd daemon-reload (which happens during routine systemd
    upgrade) the program runc adds is removed, so it's a time-bomb.

Note 1: by the way, this difference in cgroup v1 vs v2 behavior explains failures seen in #3620, #3708, #3671.

Note 2: since this may be a breaking change (container won't run vs device won't be accessible), let's not backport this one, but make it part of runc 1.2.

@kolyshkin kolyshkin requested review from cyphar and AkihiroSuda April 26, 2023 21:20
@kolyshkin kolyshkin added this to the 1.2.0 milestone Apr 26, 2023
@kolyshkin kolyshkin mentioned this pull request Apr 26, 2023
@kolyshkin kolyshkin force-pushed the systemd-dev-error branch from 27ecf36 to 021cc72 Compare May 3, 2023 01:10
@kolyshkin kolyshkin changed the title libct/cg/sd: error on untranslatable dev rules in v2 libct/cg/sd/v2: rely on systemd to set device access rules May 3, 2023
@kolyshkin kolyshkin force-pushed the systemd-dev-error branch from 021cc72 to 1a7c643 Compare May 3, 2023 01:28
@kolyshkin

This comment was marked as resolved.

@kolyshkin kolyshkin marked this pull request as draft May 3, 2023 20:30
@kolyshkin
Copy link
Contributor Author

kolyshkin commented May 3, 2023

As for the systemd, cgroup v2 is the default since systemd v243 (and most distros reverted that). Meaning, if we're running on cgroup v2 + systemd, we can assume systemd is at least at v243.

Also, here's the list of distros having cgroup v2 by default, and their respective systemd versions.

Distro oldest systemd version source of info
Fedora (since 31) 243 http://mirror.math.princeton.edu/pub/fedora-archive/fedora/linux/releases/31/Everything/source/tree/Packages/s/
Arch Linux (since April 2021) 248 https://wiki.archlinux.org/index.php?title=Cgroups&diff=next&oldid=653999
openSUSE Tumbleweed (since c. 2021) ?
Debian GNU/Linux (since 11) 247 https://packages.debian.org/bullseye/systemd
Ubuntu (since 21.10) 248 https://launchpad.net/ubuntu/+source/systemd/248.3-1ubuntu4
RHEL and RHEL-like distributions (since 9) 249 https://git.centos.org/rpms/systemd/history/SPECS/systemd.spec?identifier=c9-beta

@kolyshkin
Copy link
Contributor Author

kolyshkin commented May 3, 2023

Based on the info from the previous comment, it does not make sense to check for minimal systemd version in systemd+cgroup v2 driver, as it will be > v240 in any case.

@kolyshkin kolyshkin marked this pull request as ready for review May 3, 2023 22:40
@kolyshkin
Copy link
Contributor Author

@opencontainers/runc-maintainers @cyphar PTAL

@kolyshkin kolyshkin force-pushed the systemd-dev-error branch 2 times, most recently from 7ffb6e3 to 31ff5a6 Compare August 3, 2023 05:56
@kolyshkin
Copy link
Contributor Author

@cyphar what do you think about this? In my view, this fixes the mess of duplicate bpf programs.

The alternative is to switch to using BPFPprogram= property, but it requires systemd >= v249 which is somewhat new, meaning we'll have to keep all this device rule conversion code, too, for quite some time. Meaning, it's needed anyway.

@kolyshkin kolyshkin force-pushed the systemd-dev-error branch from 31ff5a6 to c335671 Compare July 3, 2024 16:12
@kolyshkin
Copy link
Contributor Author

Rebased, ptal @cyphar @AkihiroSuda

@thaJeztah
Copy link
Member

Ubuntu (since 21.10)

Worth noting that Ubuntu 20.04 is still quite popular, although I don't know cgroups v2 on that one (not near a computer to check right now)

@cyphar cyphar force-pushed the systemd-dev-error branch from c335671 to 4b358f4 Compare October 21, 2024 07:13
Comment on lines 143 to 148
err = fmt.Errorf("systemd older than v240 does not support wildcard-minor rules for devices not listed in /proc/devices: +%v", *rule)
if cgroupVer == 2 {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we don't want to always return an error in this case, even if systemd supports it? I guess it's more consistent but now we're going to start returning errors...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is a combination of these four factors:

  1. The rule is minor-wildcard (i.e. $MAJOR:*).
  2. Systemd is older than v240 (so we can't use {block,char}-NNN syntax, where NNN == $MAJOR).
  3. The MAJOR specified can't be found in /proc/devices (so we can't use {block,char}-NAME syntax, where NAME is an entry from /proc/devices corresponding to specified major number).
  4. Cgroup is v2, meaning we solely rely on systemd for rules management (see the change to libcontainer/cgroups/fs2/fs2.go), meaning we can't enable access to these devices.

So, the two options we have are ignore this device rule (with a warning) or return an error. I'd rather do the latter since otherwise it's harder to debug cases when we can't access a device.

NOTE we do not return errors if we think systemd supports it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar It just occurred to me you meant we should still generate our own EBPF (in this case these errors might be downgraded to warnings).

Or do something else -- say, if there are no warnings when generating systemd device rules, skip adding our own EBPF, otherwise add one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with this a bit, checking a case when a rule is not added to systemd DeviceAccess, but only to the EBPF we generate. In this case kernel denies access to the device.

This essentially means:

  • if we can't translate a device rule to systemd lingo, having our own EBPF won't help to allow access;
  • if we can translate all the rules, and systemd is bug free™, having our own EBPF is redundant.

This is the reason why in this PR I chose to

  • return an error if a rule can't be applied;
  • not generate a second EBPF.

Now, we can certainly downgrade the errors to warnings, let me know if this is how you prefer it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar OK I've downgraded errors to warnings. PTAL.

@cyphar
Copy link
Member

cyphar commented Oct 21, 2024

I'm going to move the milestone to 1.3.0. I'm not sure this is a critically needed fix for 1.2.0, and the fact (AFAICS) we could return errors for configurations that were previously allowed we probably should let the code stew in a 1.3.0-rc1 first.

@cyphar cyphar modified the milestones: 1.2.0, 1.3.0 Oct 21, 2024
It seems that the code added by commit b810da1 had cgroup v1
in mind, where runc overwrites the rules set by systemd.

In cgroup v2, things are different, since both ebpf programs (the one
added by systemd and the one added by runc) have to say "allow" so that
a program can access a device. So, adding a second EBPF is not really
helpful (provided that systemd works as it should).

This commit does two things:

1. Removes adding a second EBPF when systemd is used.

2. Promotes debug-warnings to real warnings for cgroup v2 case,
   since for v2 case they are real problematic.

Initial version of this commit returned an error for cgroup v2 case, but
we've decided to demote it to a warning, otherwise we risk breaking
backward compatibility.

Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants