FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

Sonicadvance1 · 2024-12-20T19:57:09Z

Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening /. which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With #4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different.

We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.

asahilina · 2024-12-20T20:47:55Z

So the problem here is... my optimization in #4225 of overlay file lookup breaks this, since it only works with files that do already exist on the host FS.

We could split the difference though, and do the lookup once on the raw path (symlinks / cwd-relative paths / funny mount business won't work), and once after opening the backing file normally. That means direct /proc/foo or /sys/foo lookups including this one would work (and not cause any extra opens/syscalls), and opens of overlaid files through weird mounts/symlinks/relative paths/funny paths with ../whatever would only work for existing files. So as long as we declare that /sys/fex/* can only be accessed as an absolute path directly opened, we're good.

Does that sound OK? I'll update #4225 if so.

Sonicadvance1 · 2024-12-20T20:52:55Z

So the problem here is... my optimization in #4225 of overlay file lookup breaks this, since it only works with files that do already exist on the host FS.

We could split the difference though, and do the lookup once on the raw path (symlinks / cwd-relative paths / funny mount business won't work), and once after opening the backing file normally. That means direct /proc/foo or /sys/foo lookups including this one would work (and not cause any extra opens/syscalls), and opens of overlaid files through weird mounts/symlinks/whatever would only work for existing files. So as long as we declare that /sys/fex/* can only be accessed as an absolute path directly opened, we're good.

Does that sound OK? I'll update #4225 if so.

The current expectation is that they will open the file with the absolute path like openat(AT_FDCWD, "/sys/fex/rootfs", O_RDONLY|O_DIRECTORY) or openat2. Even the current system will break if you put a trailing / on to that, since it's exact match check.

If we add more files in here, we expect the same behaviour, except maybe missing O_DIRECTORY in the case of a file instead of this directory redirect we're doing for this case.

asahilina · 2024-12-20T20:53:17Z

Ah wait, this isn't going to work. You can't just let them open the RootFS, that's what opening / would do anyway in merged mode. What gets cleaned up is the readlink on the proc file, so it still looks like /. So what you want here is to just return the RootFS path as a string or something.

But I wonder... what's the point of this? The way pressure-vessel broke for me is just that without a RootFS detected, it assumed the host wasn't FEX. But I don't get what they need the RootFS path for, it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

Sonicadvance1 · 2024-12-20T20:57:14Z

Ah wait, this isn't going to work. You can't just let them open the RootFS, that's what opening / would do anyway in merged mode. What gets cleaned up is the readlink on the proc file, so it still looks like /. So what you want here is to just return the RootFS path as a string or something.

But I wonder... what's the point of this? The way pressure-vessel broke for me is just that without a RootFS detected, it assumed the host wasn't FEX. But I don't get what they need the RootFS path for, it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

I haven't looked too closely at how your merged rootfs mode works, but pressure-vessel bind-mounts a few paths. In FEX's case it needs the rootfs (even if that is just /) to know where to bind mount the x86 graphics libraries and glibc inside of its container. Since it pivots most everything over to the steam-linux-runtime, the only thing coming from the host environment is glibc and graphics drivers afaik.

asahilina · 2024-12-20T21:11:17Z

I need to look more closely into what they're doing and whether there's a cleaner solution here. I know about the bind mounts but I'm not sure exactly how they behave with the rootfs. I do know that without the workaround, they just disable FEX mode entirely when they can't find the rootfs, and then end up trying to do a traditional pivot_root and that just breaks everything since FEX can't run with all of its depenencies missing.

Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening `/.` which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working). Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix. Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different. We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.

smcv · 2025-01-02T13:04:15Z

To recap, the purpose of pressure-vessel is that it sets up a container runtime in a new filesystem namespace, with application-level libraries like SDL and libjpeg taken directly from the Steam Runtime (for long-term ABI compatibility), but a small subset of libraries taken from the host system (for ability to use host x86 graphics drivers).

The problem with FEX + pressure-vessel is that FEX implements a mockup of a new filesystem namespace in user-space for x86 code, with a pseudo-overlayfs as its root (when x86 code opens a path, FEX redirects it to either the real root or FEX's "rootfs"); but pressure-vessel is genuinely entering a different filesystem namespace at kernel level ("below" FEX), which affects both ARM and x86 code. Their stacking order could be considered to be inconsistent: in many ways FEX is lower-level than pressure-vessel, in the sense that it's emulating a different CPU whereas pressure-vessel operates entirely in x86 world; but pressure-vessel is changing the filesystem layout at kernel level, which is lower-level/more fundamental than what FEX is doing in user-space. Unfortunately, I don't see a way for both modules to do their jobs without them being stacked in this order, so at least one of the two needs to be aware of implementation details of the other.

pressure-vessel certainly needs to know that FEX is there, so that it can provide the real ARM root filesystem in the root of its new filesystem namespace; otherwise, FEX will stop working inside our container, because its ARM shared library dependencies are missing (unless FEX was statically linked like qemu-x86_64-static, but I think that would cause as many problems as it would solve). If this detection was enough, we'd be able to do it without needing filesystem lookups, by querying the hypervisor ID from CPUID, the same way we detect e.g. qemu or Xen for diagnostic purposes (see steam-runtime-tools/virtualization.c).

But, we do need to be aware of FEX's "rootfs" as well: inside pressure-vessel we refer to it as the "interpreter root", to make it clearer whether we're talking about the real ARM root filesystem or the emulated x86 root filesystem. Normally we would make our x86 runtime be the root directory, but because we need the root filesystem of our container to be the real ARM system to keep FEX working, instead we have to set up our x86 runtime in a subdirectory, and instruct FEX to use that subdirectory as its new rootfs, with FEX's x86 graphics drivers (which might either be real x86 graphics drivers, or x86 thunks/shims that call into ARM graphics drivers) taking the role of the host graphics drivers for the purposes of how we populate that directory. FEX intercepts open() but does not intercept mount() - which is just as well, because if it did intercept mount(), we would need to bypass that somehow in order to set up the ARM root filesystem the way we are required to do to keep FEX working! So when we want to bind-mount FEX's rootfs's /usr to a location inside our replacement rootfs, we can't just set the mount source to "/usr", because that's the ARM /usr. Instead, we need to point to the x86 /usr.

To be able to set up our x86 directory the same way we would on real x86, we need to be able to inspect and enumerate the libraries that it contains, so that we can make decisions like "is the FEX rootfs libxcb older or newer than the Steam Runtime libxcb?" - because for most libraries, we must use whichever one is newer (has more ABI) and if we selected the older one then games would crash at runtime with unresolved symbols. We do this using fd-relative I/O, so the only thing FEX needs to be able to provide us with is an open fd pointing into the rootfs. Until now, the trick we used to achieve this was to open "/.", but it seems FEX is now going to break our ability to do that.

We unfortunately can't pass an open file descriptor for the x86 /usr to mount(), because of kernel limitations (there is no mountat()), so we must use an absolute path and let the kernel resolve it - although in fact we prefer to do as much of our internal processing as possible with fd-relative I/O (openat()), and we call realpath() on /proc/self/fd/whatever to get the absolute path at the last possible moment.

smcv · 2025-01-02T13:10:21Z

the only thing FEX needs to be able to provide us with is an open fd pointing into the rootfs. Until now, the trick we used to achieve this was to open "/.", but it seems FEX is now going to break our ability to do that

A different approach that I considered in the past was to open "/usr/..", but if FEX is doing path normalization then that will no longer work.

Or, if opening "/usr" will still give us a directory fd pointing to the FEX rootfs's /usr, perhaps we could do usr = openat(AT_FDCWD, "/usr", ...) followed by rootfs = openat(usr, "..", ...)?

Or does opening "/usr" now give us a directory fd pointing into some sort of overlayfs or similar construct at kernel level?

smcv · 2025-01-02T13:23:04Z

it's not like they can access the RootFS in merged-RootFS mode through open() family calls and it still worked?

As much as possible, we use fd-relative I/O to access everything relative to some sort of sysroot - which might be the real root, or the FEX rootfs, or Flatpak's /run/host, or even some totally unrelated tree from which we have been told to collect graphics drivers (although we don't do that last one in production).

What is "merged-RootFS mode"? Is that a mode where FEX's rootfs is set to / (the real root), and the real root combines x86 and ARM libraries using Debian-style multiarch or something?

The problem I see with that is that on real x86, we're intentionally building a container that is ~ 90% Steam Runtime and only ~ 10% host system, so that games are insulated from host system library stack changes, and don't accidentally add dependencies that happen to work in 2025 but are likely to stop working by 2030. For example, we don't want games to be able to see the specific version of libtiff that happens to be shipped on the Steam Deck, because the Steam Deck's rolling-release operating system is subject to rapid change and will not be the same in 5 years' time: we only want them to see the long-term-stable version of libtiff that is part of the Steam Runtime.

Under FEX, we can do that equally well for the x86 world, but we don't know what libraries (and other things like ELF interpreters or Mesa drivers) are in FEX's ARM-world dependencies - and indeed those libraries are implementation details of FEX, so conceptually we shouldn't know. As a result, we don't know how to populate a merged container with FEX's ARM dependencies in addition to our carefully-curated x86 stack.

asahilina · 2025-01-02T16:29:47Z

I have to say I've only recently started to be involved with the FEX project so I don't have all the history, only how we're using it in Fedora Asahi Remix to run Steam (and other stuff). With that caveat, taking a step back for a second, this is how I see things: FEX is an interpreter that intends to run x86(_64) binaries as if they ran directly on the host system (not containerized into an x86_64 environment container). To that end, it uses a RootFS and internally implements filesystem overlay logic to present the contents of that RootFS "on top" of the real root filesystem, from the point of view of x86_64 apps. This RootFS is not intended to be a self-contained x86_64 environment, but rather just an overlay. Its main purpose is to provide x86_64 versions of libraries in place of arm64 ones. This is useful on distros *without* Debian-style multiarch. If you have multiarch then you can just install all three architectures in parallel and you don't need any RootFS at all. But if you don't have multiarch, your only options are a full-blown container, or this. On Fedora Asahi Remix, our RootFS essentially only contains library files, a few hand-picked binaries (so launcher shell scripts mostly behave as intended in the x86_64 environment), and a couple things in /etc (ld.so stuff and alternatives for Wine). Now, the FEX userspace overlay logic is relatively buggy and incomplete. It's good enough to load dynamic libraries, but does not know how to merge directory listings, nor can it handle paths with symlinks that cross between the real root filesystem and the RootFS. This is what currently breaks Wine on Fedora. The overlay logic could be incrementally improved to do complex path resolution (I tried that) but it increases the syscall count significantly, and it seems like a never ending chase. So I figured that instead of doing that, we could let the kernel help. "Merged RootFS" means we use kernel overlayfs and bind mounts to build a *complete* view of the filesystem for the guest, like a chroot, with the x86_64 libraries overlaid on the arm64 root filesystem. Then FEX can just direct *all* guest accesses to this directory, and using `openat2()` with `RESOLVE_IN_ROOT` it makes root-relative symlinks work properly. Getting to the point where this is as watertight as a real `pivot_root()` from the guest POV is still a lot of work, but it seems more achievable than trying to do that on top of userspace overlay logic. So far, what I've done is improve the FEX filesystem logic to better handle a merged RootFS. It doesn't actually turn off the overlay logic yet, but the idea is that if "everything" is accessible through the RootFS path, then it simply should never fall back to the real root at all. As part of that, I also had to fix several RootFS path leaks because the RootFS itself isn't accessible recursively, and having raw RootFS-prefixed paths leak into the guest would cause lookups using them to fail (since they are double prefixed). That's when I ran into the pressure_vessel thing. For what it's worth, pressure_vessel *does* work today with "merged RootFS", as long as we don't break its ability to open the FEX RootFS path (if we do it just thinks it's not FEX at all and breaks). In this scenario, what PV sees as the interpreter root is a tmpfs directory with a bunch of submounts, some of which are overlayfs (usr and etc), and in fact it would be equivalent to what it sees at /. I don't know how the fact that suddenly the entire arm64 world is also visible under there affects its logic (if it does at all), but it does work in practice. I didn't know what PV was doing about FEX. I had kind of assumed that it just sets up its x86 mount tree, and then instead of pivot_rooting (as it does natively), it would just point FEX to the new location as the RootFS so x86_64 apps see its new environment. But from what you say, PV *does* pivot_root() with FEX, it just also sets up the tree so arm64 apps work. I'm kind of worried about that, since as you say, that's an implementation detail. How does PV build the ARM chroot that it then stacks its own interpreter root within? And, how is this different to simply not pivoting root at all, and just switching the FEX RootFS to another directory of your choosing with the right mix of x86_64 libraries? I think I still need more info to understand what we should do, but so far I suspect the right direction is one of these: * Improve FEX filesystem sandboxing behavior to the point you can just point the FEX RootFS to a new "x86_64 root" built however you want (the mountpoint you'd `pivot_root()` into for the native case). We could turn off the FEX filesystem merge logic to guarantee that all accesses happen relative to that root (although we'd have to catch a lot more syscalls including mount() to make sure it works properly). * Actually support real `pivot_root()` in FEX somehow, such as by keeping around a fd to the ARM64 RootFS that FEX can use to resolve ARM64 library loads even after the root filesystem has changed. There are subtleties here and open questions around what happens when graphics drivers themselves need to open files, but I get the feeling it can be made to work. I already added logic to support hiding FEX-internal fds from guests and reject closing them, so this seems viable. In both cases you wouldn't need to know what the original interpreter root is, since assuming FEX is doing its job properly you should just "see" the x86_64 world available at /. Both of those have issues if the goal is security, since it's going to be very hard to prove that the guest can't escape from the container. But at least for non-security-critical use cases like PV, that's not an issue. I'm not sure how stuff like Flatpak would interact with all this, but that's a whole different issue we also need to sort out (running x86_64 Flatpaks)...

smcv · 2025-01-02T17:30:07Z

how is [what pressure-vessel does] different to simply not pivoting root at all, and just switching the FEX RootFS to another directory of your choosing with the right mix of x86_64 libraries?

If we didn't actually change the root then we would have no control over /run, therefore APIs like /run/host/os-release wouldn't/couldn't work. There are probably others but /run is the main one.

And, we want pressure-vessel under FEX to be as similar as possible to pressure-vessel on real x86, partly because we need to be quite ruthless about minimizing the number of code paths (it's high-complexity code maintained by a small team and we want to minimize regressions), and partly so that we don't get games that somehow work on one but fail on the other (either way round would be undesirable, the goal is predictability).

How does PV build the ARM chroot that it then stacks its own interpreter root within?

It assumes that the host system is something reasonably FHS-shaped, and bind-mounts "most" subdirectories of the real root - including at least /usr and related filesystems (/lib, /bin, etc. as compatibility symlinks or separate directories), and excluding the ones that it needs control over (again, notably /run).

asahilina · 2025-01-03T00:41:15Z

But you don't need to change the root to mostly control /run. You can already mount whatever you want under .../rootfs/run and it should already take priority for most operations (at least reads). In the endgame for merged RootFS mode, FEX would direct all accesses including writes to the RootFS only. Would that work? That said, I have some evil ideas to make FEX work with real pivot_root() and no special handling for the ARM side, though it will involve a kernel patch. Maybe if that works out and upstream likes it, that's the best way to go...

smcv · 2025-01-03T12:24:10Z

But you don't need to change the root to mostly control /run. You can already mount whatever you want under .../rootfs/run and it should already take priority for most operations (at least reads).

That would result in the real /run leaking through from the host (any file or socket that exists on the real host but not in the rootfs we have prepared would still be openable in the container), which is an observable behaviour change between FEX and real x86. That isn't necessarily a showstopper in the longer term, but as I said, we want pressure-vessel under FEX to be as similar as possible to pressure-vessel on real x86.

I'd be particularly reluctant to be using bwrap (which is what actually does the pivot_root) for pressure-vessel on real x86, while no longer using bwrap for pressure-vessel on FEX, because that would mean going onto a different code path with significantly different behaviour. The more divergent the code paths are, the higher the risk of a FEX-specific regression, which is unlikely to be caught immediately by testing because our non-x86 testing bandwidth is very limited.

We could potentially build the filesystem that is passed to bwrap differently (for example passing through the whole root directory as-is, and building a new rootfs as you suggest below some tmpfs) but, again, different code paths. Re-architecting how pressure-vessel interacts with FEX is not something that we can do instantaneously, particularly if we're still required to continue to maintain compatibility with current/older FEX which does not make use of overlayfs in this way.

although we'd have to catch a lot more syscalls including mount() to make sure it works properly

If FEX starts intercepting mount(), then pressure-vessel is going to need to be able to detect whether it is dealing with "old" or "new" FEX, to be able to construct the correct bwrap arguments for each.

for non-security-critical use cases like PV

pressure-vessel is specifically not a security boundary: it makes no attempt to prevent games inside its container from executing arbitrary code outside (for example it doesn't filter the D-Bus session bus, or the IPC between games and the Steam client). What it aims to do is limited to minimizing the extent to which games accidentally rely on implementation details of the host system that are not long-term-compatible (for example, we don't want games to accidentally rely on the specific libtiff.so.* that the Steam Deck has in its host OS).

If Valve wants to run games in a meaningful sandbox at some future date, that would most likely have to be done on an opt-in basis, for only the subset of games that have been verified to work as intended when all the arbitrary-code-execution routes have been cut off.

I'm not sure how stuff like Flatpak would interact with all this, but that's a whole different issue we also need to sort out (running x86_64 Flatpaks)

In general, Flatpak is a security boundary. It's somewhat simpler than pressure-vessel because it doesn't make any attempt to use the host's graphics drivers as-is inside the Flatpak sandbox, but fundamentally it works by making a purely x86 sysroot and asking the kernel to pivot into it; so if FEX relies on the sysroot containing ARM libraries, then FEX cannot work with x86 Flatpak apps.

Conversely, if FEX somehow gains the ability to "remember" an open fd pointing to the real root, and use that fd to access the FEX interpreter and its required ld.so and shared libraries, for the benefit of being able to run x86 Flatpak apps, then that mechanism should work equally well for pressure-vessel. If that's the case, then pressure-vessel running on sufficiently new FEX would be able to stop separating the interpreter root (FEX rootfs) from the ARM root (real root), and re-converge many of its code paths with pressure-vessel on real x86.

Sonicadvance1 · 2025-01-22T18:36:08Z

Since this simple change got derailed with a more indepth discussion, I'm closing this.

Sonicadvance1 force-pushed the sysfs_fex_node branch from 67dbe4b to e93641e Compare January 1, 2025 16:21

Sonicadvance1 closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024 •

edited

Loading

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024

smcv commented Jan 2, 2025

smcv commented Jan 2, 2025

smcv commented Jan 2, 2025

asahilina commented Jan 2, 2025 via email

smcv commented Jan 2, 2025

asahilina commented Jan 3, 2025 via email

smcv commented Jan 3, 2025

Sonicadvance1 commented Jan 22, 2025

FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

FEXInterpreter: Punch through a /sys/fex/rootfs node #4228

Conversation

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024 • edited Loading

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024

Sonicadvance1 commented Dec 20, 2024

asahilina commented Dec 20, 2024

smcv commented Jan 2, 2025

smcv commented Jan 2, 2025

smcv commented Jan 2, 2025

asahilina commented Jan 2, 2025 via email

smcv commented Jan 2, 2025

asahilina commented Jan 3, 2025 via email

smcv commented Jan 3, 2025

Sonicadvance1 commented Jan 22, 2025

asahilina commented Dec 20, 2024 •

edited

Loading