-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEXInterpreter: Punch through a /sys/fex/rootfs node #4228
Conversation
So the problem here is... my optimization in #4225 of overlay file lookup breaks this, since it only works with files that do already exist on the host FS. We could split the difference though, and do the lookup once on the raw path (symlinks / cwd-relative paths / funny mount business won't work), and once after opening the backing file normally. That means direct Does that sound OK? I'll update #4225 if so. |
The current expectation is that they will open the file with the absolute path like If we add more files in here, we expect the same behaviour, except maybe missing |
Ah wait, this isn't going to work. You can't just let them open the RootFS, that's what opening / would do anyway in merged mode. What gets cleaned up is the But I wonder... what's the point of this? The way pressure-vessel broke for me is just that without a RootFS detected, it assumed the host wasn't FEX. But I don't get what they need the RootFS path for, it's not like they can access the RootFS in merged-RootFS mode through |
I haven't looked too closely at how your merged rootfs mode works, but pressure-vessel bind-mounts a few paths. In FEX's case it needs the rootfs (even if that is just /) to know where to bind mount the x86 graphics libraries and glibc inside of its container. Since it pivots most everything over to the steam-linux-runtime, the only thing coming from the host environment is glibc and graphics drivers afaik. |
I need to look more closely into what they're doing and whether there's a cleaner solution here. I know about the bind mounts but I'm not sure exactly how they behave with the rootfs. I do know that without the workaround, they just disable FEX mode entirely when they can't find the rootfs, and then end up trying to do a traditional pivot_root and that just breaks everything since FEX can't run with all of its depenencies missing. |
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening `/.` which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working). Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix. Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different. We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.
67dbe4b
to
e93641e
Compare
To recap, the purpose of pressure-vessel is that it sets up a container runtime in a new filesystem namespace, with application-level libraries like SDL and libjpeg taken directly from the Steam Runtime (for long-term ABI compatibility), but a small subset of libraries taken from the host system (for ability to use host x86 graphics drivers). The problem with FEX + pressure-vessel is that FEX implements a mockup of a new filesystem namespace in user-space for x86 code, with a pseudo-overlayfs as its root (when x86 code opens a path, FEX redirects it to either the real root or FEX's "rootfs"); but pressure-vessel is genuinely entering a different filesystem namespace at kernel level ("below" FEX), which affects both ARM and x86 code. Their stacking order could be considered to be inconsistent: in many ways FEX is lower-level than pressure-vessel, in the sense that it's emulating a different CPU whereas pressure-vessel operates entirely in x86 world; but pressure-vessel is changing the filesystem layout at kernel level, which is lower-level/more fundamental than what FEX is doing in user-space. Unfortunately, I don't see a way for both modules to do their jobs without them being stacked in this order, so at least one of the two needs to be aware of implementation details of the other. pressure-vessel certainly needs to know that FEX is there, so that it can provide the real ARM root filesystem in the root of its new filesystem namespace; otherwise, FEX will stop working inside our container, because its ARM shared library dependencies are missing (unless FEX was statically linked like But, we do need to be aware of FEX's "rootfs" as well: inside pressure-vessel we refer to it as the "interpreter root", to make it clearer whether we're talking about the real ARM root filesystem or the emulated x86 root filesystem. Normally we would make our x86 runtime be the root directory, but because we need the root filesystem of our container to be the real ARM system to keep FEX working, instead we have to set up our x86 runtime in a subdirectory, and instruct FEX to use that subdirectory as its new rootfs, with FEX's x86 graphics drivers (which might either be real x86 graphics drivers, or x86 thunks/shims that call into ARM graphics drivers) taking the role of the host graphics drivers for the purposes of how we populate that directory. FEX intercepts To be able to set up our x86 directory the same way we would on real x86, we need to be able to inspect and enumerate the libraries that it contains, so that we can make decisions like "is the FEX rootfs libxcb older or newer than the Steam Runtime libxcb?" - because for most libraries, we must use whichever one is newer (has more ABI) and if we selected the older one then games would crash at runtime with unresolved symbols. We do this using fd-relative I/O, so the only thing FEX needs to be able to provide us with is an open fd pointing into the rootfs. Until now, the trick we used to achieve this was to open We unfortunately can't pass an open file descriptor for the x86 |
A different approach that I considered in the past was to open Or, if opening Or does opening |
As much as possible, we use fd-relative I/O to access everything relative to some sort of sysroot - which might be the real root, or the FEX rootfs, or Flatpak's What is "merged-RootFS mode"? Is that a mode where FEX's rootfs is set to The problem I see with that is that on real x86, we're intentionally building a container that is ~ 90% Steam Runtime and only ~ 10% host system, so that games are insulated from host system library stack changes, and don't accidentally add dependencies that happen to work in 2025 but are likely to stop working by 2030. For example, we don't want games to be able to see the specific version of Under FEX, we can do that equally well for the x86 world, but we don't know what libraries (and other things like ELF interpreters or Mesa drivers) are in FEX's ARM-world dependencies - and indeed those libraries are implementation details of FEX, so conceptually we shouldn't know. As a result, we don't know how to populate a merged container with FEX's ARM dependencies in addition to our carefully-curated x86 stack. |
I have to say I've only recently started to be involved with the FEX project so I don't have all the history, only how we're using it in Fedora Asahi Remix to run Steam (and other stuff). With that caveat, taking a step back for a second, this is how I see things:
FEX is an interpreter that intends to run x86(_64) binaries as if they ran directly on the host system (not containerized into an x86_64 environment container). To that end, it uses a RootFS and internally implements filesystem overlay logic to present the contents of that RootFS "on top" of the real root filesystem, from the point of view of x86_64 apps.
This RootFS is not intended to be a self-contained x86_64 environment, but rather just an overlay. Its main purpose is to provide x86_64 versions of libraries in place of arm64 ones. This is useful on distros *without* Debian-style multiarch. If you have multiarch then you can just install all three architectures in parallel and you don't need any RootFS at all. But if you don't have multiarch, your only options are a full-blown container, or this.
On Fedora Asahi Remix, our RootFS essentially only contains library files, a few hand-picked binaries (so launcher shell scripts mostly behave as intended in the x86_64 environment), and a couple things in /etc (ld.so stuff and alternatives for Wine).
Now, the FEX userspace overlay logic is relatively buggy and incomplete. It's good enough to load dynamic libraries, but does not know how to merge directory listings, nor can it handle paths with symlinks that cross between the real root filesystem and the RootFS. This is what currently breaks Wine on Fedora.
The overlay logic could be incrementally improved to do complex path resolution (I tried that) but it increases the syscall count significantly, and it seems like a never ending chase. So I figured that instead of doing that, we could let the kernel help. "Merged RootFS" means we use kernel overlayfs and bind mounts to build a *complete* view of the filesystem for the guest, like a chroot, with the x86_64 libraries overlaid on the arm64 root filesystem. Then FEX can just direct *all* guest accesses to this directory, and using `openat2()` with `RESOLVE_IN_ROOT` it makes root-relative symlinks work properly. Getting to the point where this is as watertight as a real `pivot_root()` from the guest POV is still a lot of work, but it seems more achievable than trying to do that on top of userspace overlay logic.
So far, what I've done is improve the FEX filesystem logic to better handle a merged RootFS. It doesn't actually turn off the overlay logic yet, but the idea is that if "everything" is accessible through the RootFS path, then it simply should never fall back to the real root at all. As part of that, I also had to fix several RootFS path leaks because the RootFS itself isn't accessible recursively, and having raw RootFS-prefixed paths leak into the guest would cause lookups using them to fail (since they are double prefixed). That's when I ran into the pressure_vessel thing.
For what it's worth, pressure_vessel *does* work today with "merged RootFS", as long as we don't break its ability to open the FEX RootFS path (if we do it just thinks it's not FEX at all and breaks). In this scenario, what PV sees as the interpreter root is a tmpfs directory with a bunch of submounts, some of which are overlayfs (usr and etc), and in fact it would be equivalent to what it sees at /. I don't know how the fact that suddenly the entire arm64 world is also visible under there affects its logic (if it does at all), but it does work in practice.
I didn't know what PV was doing about FEX. I had kind of assumed that it just sets up its x86 mount tree, and then instead of pivot_rooting (as it does natively), it would just point FEX to the new location as the RootFS so x86_64 apps see its new environment. But from what you say, PV *does* pivot_root() with FEX, it just also sets up the tree so arm64 apps work.
I'm kind of worried about that, since as you say, that's an implementation detail. How does PV build the ARM chroot that it then stacks its own interpreter root within? And, how is this different to simply not pivoting root at all, and just switching the FEX RootFS to another directory of your choosing with the right mix of x86_64 libraries?
I think I still need more info to understand what we should do, but so far I suspect the right direction is one of these:
* Improve FEX filesystem sandboxing behavior to the point you can just point the FEX RootFS to a new "x86_64 root" built however you want (the mountpoint you'd `pivot_root()` into for the native case). We could turn off the FEX filesystem merge logic to guarantee that all accesses happen relative to that root (although we'd have to catch a lot more syscalls including mount() to make sure it works properly).
* Actually support real `pivot_root()` in FEX somehow, such as by keeping around a fd to the ARM64 RootFS that FEX can use to resolve ARM64 library loads even after the root filesystem has changed. There are subtleties here and open questions around what happens when graphics drivers themselves need to open files, but I get the feeling it can be made to work. I already added logic to support hiding FEX-internal fds from guests and reject closing them, so this seems viable.
In both cases you wouldn't need to know what the original interpreter root is, since assuming FEX is doing its job properly you should just "see" the x86_64 world available at /.
Both of those have issues if the goal is security, since it's going to be very hard to prove that the guest can't escape from the container. But at least for non-security-critical use cases like PV, that's not an issue. I'm not sure how stuff like Flatpak would interact with all this, but that's a whole different issue we also need to sort out (running x86_64 Flatpaks)...
|
If we didn't actually change the root then we would have no control over And, we want pressure-vessel under FEX to be as similar as possible to pressure-vessel on real x86, partly because we need to be quite ruthless about minimizing the number of code paths (it's high-complexity code maintained by a small team and we want to minimize regressions), and partly so that we don't get games that somehow work on one but fail on the other (either way round would be undesirable, the goal is predictability).
It assumes that the host system is something reasonably FHS-shaped, and bind-mounts "most" subdirectories of the real root - including at least |
But you don't need to change the root to mostly control /run. You can already mount whatever you want under .../rootfs/run and it should already take priority for most operations (at least reads). In the endgame for merged RootFS mode, FEX would direct all accesses including writes to the RootFS only. Would that work?
That said, I have some evil ideas to make FEX work with real pivot_root() and no special handling for the ARM side, though it will involve a kernel patch. Maybe if that works out and upstream likes it, that's the best way to go...
|
That would result in the real I'd be particularly reluctant to be using We could potentially build the filesystem that is passed to
If FEX starts intercepting
pressure-vessel is specifically not a security boundary: it makes no attempt to prevent games inside its container from executing arbitrary code outside (for example it doesn't filter the D-Bus session bus, or the IPC between games and the Steam client). What it aims to do is limited to minimizing the extent to which games accidentally rely on implementation details of the host system that are not long-term-compatible (for example, we don't want games to accidentally rely on the specific If Valve wants to run games in a meaningful sandbox at some future date, that would most likely have to be done on an opt-in basis, for only the subset of games that have been verified to work as intended when all the arbitrary-code-execution routes have been cut off.
In general, Flatpak is a security boundary. It's somewhat simpler than pressure-vessel because it doesn't make any attempt to use the host's graphics drivers as-is inside the Flatpak sandbox, but fundamentally it works by making a purely x86 sysroot and asking the kernel to pivot into it; so if FEX relies on the sysroot containing ARM libraries, then FEX cannot work with x86 Flatpak apps. Conversely, if FEX somehow gains the ability to "remember" an open fd pointing to the real root, and use that fd to access the FEX interpreter and its required ld.so and shared libraries, for the benefit of being able to run x86 Flatpak apps, then that mechanism should work equally well for pressure-vessel. If that's the case, then pressure-vessel running on sufficiently new FEX would be able to stop separating the interpreter root (FEX rootfs) from the ARM root (real root), and re-converge many of its code paths with pressure-vessel on real x86. |
Since this simple change got derailed with a more indepth discussion, I'm closing this. |
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening
/.
which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With #4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working).Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix.
Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different.
We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.