Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

Closed
wants to merge 1 commit into from

Conversation

Sonicadvance1
Copy link
Member

Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening /. which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With #4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different.

We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.

@asahilina
Copy link
Contributor

asahilina commented Dec 20, 2024

So the problem here is... my optimization in #4225 of overlay file lookup breaks this, since it only works with files that do already exist on the host FS.

We could split the difference though, and do the lookup once on the raw path (symlinks / cwd-relative paths / funny mount business won't work), and once after opening the backing file normally. That means direct /proc/foo or /sys/foo lookups including this one would work (and not cause any extra opens/syscalls), and opens of overlaid files through weird mounts/symlinks/relative paths/funny paths with ../whatever would only work for existing files. So as long as we declare that /sys/fex/* can only be accessed as an absolute path directly opened, we're good.

Does that sound OK? I'll update #4225 if so.

@Sonicadvance1
Copy link
Member Author

So the problem here is... my optimization in #4225 of overlay file lookup breaks this, since it only works with files that do already exist on the host FS.

We could split the difference though, and do the lookup once on the raw path (symlinks / cwd-relative paths / funny mount business won't work), and once after opening the backing file normally. That means direct /proc/foo or /sys/foo lookups including this one would work (and not cause any extra opens/syscalls), and opens of overlaid files through weird mounts/symlinks/whatever would only work for existing files. So as long as we declare that /sys/fex/* can only be accessed as an absolute path directly opened, we're good.

Does that sound OK? I'll update #4225 if so.

The current expectation is that they will open the file with the absolute path like openat(AT_FDCWD, "/sys/fex/rootfs", O_RDONLY|O_DIRECTORY) or openat2. Even the current system will break if you put a trailing / on to that, since it's exact match check.

If we add more files in here, we expect the same behaviour, except maybe missing O_DIRECTORY in the case of a file instead of this directory redirect we're doing for this case.

@asahilina
Copy link
Contributor

Ah wait, this isn't going to work. You can't just let them open the RootFS, that's what opening / would do anyway in merged mode. What gets cleaned up is the readlink on the proc file, so it still looks like /. So what you want here is to just return the RootFS path as a string or something.

But I wonder... what's the point of this? The way pressure-vessel broke for me is just that without a RootFS detected, it assumed the host wasn't FEX. But I don't get what they need the RootFS path for, it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

@Sonicadvance1
Copy link
Member Author

Ah wait, this isn't going to work. You can't just let them open the RootFS, that's what opening / would do anyway in merged mode. What gets cleaned up is the readlink on the proc file, so it still looks like /. So what you want here is to just return the RootFS path as a string or something.

But I wonder... what's the point of this? The way pressure-vessel broke for me is just that without a RootFS detected, it assumed the host wasn't FEX. But I don't get what they need the RootFS path for, it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

I haven't looked too closely at how your merged rootfs mode works, but pressure-vessel bind-mounts a few paths. In FEX's case it needs the rootfs (even if that is just /) to know where to bind mount the x86 graphics libraries and glibc inside of its container. Since it pivots most everything over to the steam-linux-runtime, the only thing coming from the host environment is glibc and graphics drivers afaik.

@asahilina
Copy link
Contributor

I need to look more closely into what they're doing and whether there's a cleaner solution here. I know about the bind mounts but I'm not sure exactly how they behave with the rootfs. I do know that without the workaround, they just disable FEX mode entirely when they can't find the rootfs, and then end up trying to do a traditional pivot_root and that just breaks everything since FEX can't run with all of its depenencies missing.

Currently pressure-vessel does a bit of a bodge to get the x86 rootfs
path when running inside of FEX. It does this by opening `/.` which
works around our pseudo-overlayfs tracking. While this worked, it wasn't
guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around
how rootfs is laid out, it had to break this path (while adding a
workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to
get a real fd back to the x86 rootfs, that way if we break this code
path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see
if they'll accept this path or if we need to do something different.

We can also use this same mechanism in the future if we want to expose
more FEX specific data to the application through this.
@smcv
Copy link

smcv commented Jan 2, 2025

To recap, the purpose of pressure-vessel is that it sets up a container runtime in a new filesystem namespace, with application-level libraries like SDL and libjpeg taken directly from the Steam Runtime (for long-term ABI compatibility), but a small subset of libraries taken from the host system (for ability to use host x86 graphics drivers).

The problem with FEX + pressure-vessel is that FEX implements a mockup of a new filesystem namespace in user-space for x86 code, with a pseudo-overlayfs as its root (when x86 code opens a path, FEX redirects it to either the real root or FEX's "rootfs"); but pressure-vessel is genuinely entering a different filesystem namespace at kernel level ("below" FEX), which affects both ARM and x86 code. Their stacking order could be considered to be inconsistent: in many ways FEX is lower-level than pressure-vessel, in the sense that it's emulating a different CPU whereas pressure-vessel operates entirely in x86 world; but pressure-vessel is changing the filesystem layout at kernel level, which is lower-level/more fundamental than what FEX is doing in user-space. Unfortunately, I don't see a way for both modules to do their jobs without them being stacked in this order, so at least one of the two needs to be aware of implementation details of the other.

pressure-vessel certainly needs to know that FEX is there, so that it can provide the real ARM root filesystem in the root of its new filesystem namespace; otherwise, FEX will stop working inside our container, because its ARM shared library dependencies are missing (unless FEX was statically linked like qemu-x86_64-static, but I think that would cause as many problems as it would solve). If this detection was enough, we'd be able to do it without needing filesystem lookups, by querying the hypervisor ID from CPUID, the same way we detect e.g. qemu or Xen for diagnostic purposes (see steam-runtime-tools/virtualization.c).

But, we do need to be aware of FEX's "rootfs" as well: inside pressure-vessel we refer to it as the "interpreter root", to make it clearer whether we're talking about the real ARM root filesystem or the emulated x86 root filesystem. Normally we would make our x86 runtime be the root directory, but because we need the root filesystem of our container to be the real ARM system to keep FEX working, instead we have to set up our x86 runtime in a subdirectory, and instruct FEX to use that subdirectory as its new rootfs, with FEX's x86 graphics drivers (which might either be real x86 graphics drivers, or x86 thunks/shims that call into ARM graphics drivers) taking the role of the host graphics drivers for the purposes of how we populate that directory. FEX intercepts open() but does not intercept mount() - which is just as well, because if it did intercept mount(), we would need to bypass that somehow in order to set up the ARM root filesystem the way we are required to do to keep FEX working! So when we want to bind-mount FEX's rootfs's /usr to a location inside our replacement rootfs, we can't just set the mount source to "/usr", because that's the ARM /usr. Instead, we need to point to the x86 /usr.

To be able to set up our x86 directory the same way we would on real x86, we need to be able to inspect and enumerate the libraries that it contains, so that we can make decisions like "is the FEX rootfs libxcb older or newer than the Steam Runtime libxcb?" - because for most libraries, we must use whichever one is newer (has more ABI) and if we selected the older one then games would crash at runtime with unresolved symbols. We do this using fd-relative I/O, so the only thing FEX needs to be able to provide us with is an open fd pointing into the rootfs. Until now, the trick we used to achieve this was to open "/.", but it seems FEX is now going to break our ability to do that.

We unfortunately can't pass an open file descriptor for the x86 /usr to mount(), because of kernel limitations (there is no mountat()), so we must use an absolute path and let the kernel resolve it - although in fact we prefer to do as much of our internal processing as possible with fd-relative I/O (openat()), and we call realpath() on /proc/self/fd/whatever to get the absolute path at the last possible moment.

@smcv
Copy link

smcv commented Jan 2, 2025

the only thing FEX needs to be able to provide us with is an open fd pointing into the rootfs. Until now, the trick we used to achieve this was to open "/.", but it seems FEX is now going to break our ability to do that

A different approach that I considered in the past was to open "/usr/..", but if FEX is doing path normalization then that will no longer work.

Or, if opening "/usr" will still give us a directory fd pointing to the FEX rootfs's /usr, perhaps we could do usr = openat(AT_FDCWD, "/usr", ...) followed by rootfs = openat(usr, "..", ...)?

Or does opening "/usr" now give us a directory fd pointing into some sort of overlayfs or similar construct at kernel level?

@smcv
Copy link

smcv commented Jan 2, 2025

it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

As much as possible, we use fd-relative I/O to access everything relative to some sort of sysroot - which might be the real root, or the FEX rootfs, or Flatpak's /run/host, or even some totally unrelated tree from which we have been told to collect graphics drivers (although we don't do that last one in production).

What is "merged-RootFS mode"? Is that a mode where FEX's rootfs is set to / (the real root), and the real root combines x86 and ARM libraries using Debian-style multiarch or something?

The problem I see with that is that on real x86, we're intentionally building a container that is ~ 90% Steam Runtime and only ~ 10% host system, so that games are insulated from host system library stack changes, and don't accidentally add dependencies that happen to work in 2025 but are likely to stop working by 2030. For example, we don't want games to be able to see the specific version of libtiff that happens to be shipped on the Steam Deck, because the Steam Deck's rolling-release operating system is subject to rapid change and will not be the same in 5 years' time: we only want them to see the long-term-stable version of libtiff that is part of the Steam Runtime.

Under FEX, we can do that equally well for the x86 world, but we don't know what libraries (and other things like ELF interpreters or Mesa drivers) are in FEX's ARM-world dependencies - and indeed those libraries are implementation details of FEX, so conceptually we shouldn't know. As a result, we don't know how to populate a merged container with FEX's ARM dependencies in addition to our carefully-curated x86 stack.

@asahilina
Copy link
Contributor

asahilina commented Jan 2, 2025 via email

@smcv
Copy link

smcv commented Jan 2, 2025

how is [what pressure-vessel does] different to simply not pivoting root at all, and just switching the FEX RootFS to another directory of your choosing with the right mix of x86_64 libraries?

If we didn't actually change the root then we would have no control over /run, therefore APIs like /run/host/os-release wouldn't/couldn't work. There are probably others but /run is the main one.

And, we want pressure-vessel under FEX to be as similar as possible to pressure-vessel on real x86, partly because we need to be quite ruthless about minimizing the number of code paths (it's high-complexity code maintained by a small team and we want to minimize regressions), and partly so that we don't get games that somehow work on one but fail on the other (either way round would be undesirable, the goal is predictability).

How does PV build the ARM chroot that it then stacks its own interpreter root within?

It assumes that the host system is something reasonably FHS-shaped, and bind-mounts "most" subdirectories of the real root - including at least /usr and related filesystems (/lib, /bin, etc. as compatibility symlinks or separate directories), and excluding the ones that it needs control over (again, notably /run).

@asahilina
Copy link
Contributor

asahilina commented Jan 3, 2025 via email

@smcv
Copy link

smcv commented Jan 3, 2025

But you don't need to change the root to mostly control /run. You can already mount whatever you want under .../rootfs/run and it should already take priority for most operations (at least reads).

That would result in the real /run leaking through from the host (any file or socket that exists on the real host but not in the rootfs we have prepared would still be openable in the container), which is an observable behaviour change between FEX and real x86. That isn't necessarily a showstopper in the longer term, but as I said, we want pressure-vessel under FEX to be as similar as possible to pressure-vessel on real x86.

I'd be particularly reluctant to be using bwrap (which is what actually does the pivot_root) for pressure-vessel on real x86, while no longer using bwrap for pressure-vessel on FEX, because that would mean going onto a different code path with significantly different behaviour. The more divergent the code paths are, the higher the risk of a FEX-specific regression, which is unlikely to be caught immediately by testing because our non-x86 testing bandwidth is very limited.

We could potentially build the filesystem that is passed to bwrap differently (for example passing through the whole root directory as-is, and building a new rootfs as you suggest below some tmpfs) but, again, different code paths. Re-architecting how pressure-vessel interacts with FEX is not something that we can do instantaneously, particularly if we're still required to continue to maintain compatibility with current/older FEX which does not make use of overlayfs in this way.

although we'd have to catch a lot more syscalls including mount() to make sure it works properly

If FEX starts intercepting mount(), then pressure-vessel is going to need to be able to detect whether it is dealing with "old" or "new" FEX, to be able to construct the correct bwrap arguments for each.

for non-security-critical use cases like PV

pressure-vessel is specifically not a security boundary: it makes no attempt to prevent games inside its container from executing arbitrary code outside (for example it doesn't filter the D-Bus session bus, or the IPC between games and the Steam client). What it aims to do is limited to minimizing the extent to which games accidentally rely on implementation details of the host system that are not long-term-compatible (for example, we don't want games to accidentally rely on the specific libtiff.so.* that the Steam Deck has in its host OS).

If Valve wants to run games in a meaningful sandbox at some future date, that would most likely have to be done on an opt-in basis, for only the subset of games that have been verified to work as intended when all the arbitrary-code-execution routes have been cut off.

I'm not sure how stuff like Flatpak would interact with all this, but that's a whole different issue we also need to sort out (running x86_64 Flatpaks)

In general, Flatpak is a security boundary. It's somewhat simpler than pressure-vessel because it doesn't make any attempt to use the host's graphics drivers as-is inside the Flatpak sandbox, but fundamentally it works by making a purely x86 sysroot and asking the kernel to pivot into it; so if FEX relies on the sysroot containing ARM libraries, then FEX cannot work with x86 Flatpak apps.

Conversely, if FEX somehow gains the ability to "remember" an open fd pointing to the real root, and use that fd to access the FEX interpreter and its required ld.so and shared libraries, for the benefit of being able to run x86 Flatpak apps, then that mechanism should work equally well for pressure-vessel. If that's the case, then pressure-vessel running on sufficiently new FEX would be able to stop separating the interpreter root (FEX rootfs) from the ARM root (real root), and re-converge many of its code paths with pressure-vessel on real x86.

@Sonicadvance1
Copy link
Member Author

Since this simple change got derailed with a more indepth discussion, I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants