-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a merged RootFS (and a bunch of related fixes) #4225
Conversation
To locate whether a path is in the emulated list, EmulatedFDManager::OpenAt() attemps to resolve the path. realpath() ends up calling readlinkat() on every path component, which is a lot of syscalls for every open() variant syscall. It also makes interaction with the rootfs complex and error-prone. There's a much easier way to do this: We just open the file without emulation and check its real path via get_fdpath(). This is just one readlink() syscall per open, instead of one per path component. If the file turns out to be emulated (uncommon case), we swap out the fds. This also decouples EmulatedFDManager from guest path resolution entirely, so it will never fall out of sync with the RootFS logic.
If the guest reads a RootFS path from /proc/self/fd/*, we should return it with the RootFS prefix stripped.
If there's a symlink to / within the RootFS, don't attempt to follow it, since that will end up trying to look up the empty string within the RootFS (which is not legal). Just return the symlink.
If a RootFS symlink links to an absolute path within the RootFS, we need to strip the RootFS prefix. This would not normally happen with a plain RootFS, but it can happen if /proc is mounted within the RootFS.
With a merged RootFS, all binaries are executed through the RootFS. When executing a binary that is actually a native binary, we want to do so outside the RootFS. Handle this by stripping the RootFS prefix in that case.
- Parse the shebang line properly (use FHU::ParseArgumentsFromString which is the same code the loader uses) - Make native-interpreter shebang files work by deferring to the kernel in that case (previously, they'd get executed through the loader and it would choke on the architecture of the interpreter) - Do not use the RootFS-prepended path when executing shebang files. The loader will prepend that anyway when looking it up, but it needs the bare guest path so it can pass it as an argument to the interpreter, which (since it's emulated) will do the lookup through the RootFS.
The wrappers handle errno, we just need to return -1 on errors.
0a3228a
to
6862e72
Compare
Awesome, thanks for having continued to look into this!
Just curious, can an equivalent RootFS setup be achieved with vanilla OverlayFS or are there muvm-specific bits to this approach? My understanding was that normal OverlayFS has restrictions that make it infeasible for FEX RootFS management in general, though I don't remember if this was just an issue of lack of multiarch/lib in most Linux distributions or something more fundamental. |
There's nothing special about muvm, the same setup can be achieved on any system. The main limitation is that overlayfs is fundamentally read-only the way we intend to use it here, so you have to partition the file hierarchy into subtrees that are overlaid (and therefore cannot be written through) and subtrees that are bind-mounted straight through. overlayfs also can't handle sub-mounts in its layers.
In addition, overlayfs nominally requires that the layers not mutate out from under it, so technically making any changes to the overlaid subtrees of the real host filesystem is also "UB" (constrained, won't crash your kernel). Realistically I don't think this will cause any major issues other than "maybe if you add files to the host root you won't see them in FEX until a remount", which is probably fine. In our setup, as long as /etc and /usr don't have any submounts (they don't in any default Fedora installs), and as long as they don't mutate while For reference, this is the mount tree that
All the sub-mounts (e.g. /dev/shm) are automatically handled since we bind-mount with recursion. |
Interesting that all those tests failed... they pass on my machine. Any idea how I can replicate the testing environment? Different RootFS, I guess? |
The rootfs that the builders use is an Ubuntu 22.04 image that has been used for about two years. |
I mean, can I grab it somewhere? |
Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6. |
Some of the devices are running ancient kernels. 5.0 is our minspec. |
Oof... then I need to gate all the changes on kernel version and revert to the old more broken behavior if it's too old... ;; |
https://drive.google.com/file/d/18vz4L2yMkGrI6ACiui-K2-MKHxYstCjs/view?usp=sharing For the rootfs image that I built for the builders. |
Sounds like now's a good time to bump our minspec then, 5.0 predates panfrost.ko getting merged! |
I added a quick gate (turned out to be not too horrible). That does also mean that the CI isn't testing the new approach, though... |
One thing worth considering: Now that we have a merged RootFS, would it actually be practical to pivot_root into it? Obviously FEX still needs to be able to access files outside the RootFS, but that could be done by keeping an fd around (and pivoting back before execing a native binary or FEXInterpreter again). With some care FEX might even be able to claim actual robust isolation, as long as it can prevent that fd from leaking into the guest. Perhaps this is the right direction to go in for stuff like Flatpak to work? |
Okay, that makes more sense. So the ARMv8.0 runner was on an old kernel, and it looks like I broke thunks on the new ones with the new code. Now I need to figure out how to build & test with thunks... ^^;; |
I'll see what we can do about raising our kernel version minspec. Just need to check our CI system to see what needs updated, or how hard it will be to update them. |
Took a look at our CI systems with their kernel use.
So we could update our minspec to 6.0 /relatively/ easily, I just need to do it. I could fit that in to this weekend or next week. |
Here it mentions 4.17 though: FEX/Source/Tools/FEXLoader/FEXLoader.cpp Line 444 in e44d1f1
|
Yea, 4.17 was where we exactly knew problems were going to happen because of old Jetsons, so se decided to put a spotlight on it. If anything when we raise the minspec for real, we can just change 4.17 to 6.0 or whatever. |
This avoids having to do the symlink chasing in GetEmulatedFDPath, since the kernel does it for us. On top of that, with a merged RootFS setup, this will correctly handle symlinks from user directories into the RootFS, fixing wine on Fedora.
01d56e0
to
3fe2650
Compare
OK, I think I fixed thunks now. It was painful to get enough of the FEXLinuxTests test suite and thunk stuff built, without a working x86_64 compiler on Fedora... |
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening `/.` which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working). Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix. Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different.
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening `/.` which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working). Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix. Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different. We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs path when running inside of FEX. It does this by opening `/.` which works around our pseudo-overlayfs tracking. While this worked, it wasn't guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around how rootfs is laid out, it had to break this path (while adding a workaround for it to keep working). Give pressure-vessel a blessed path from the EmulatedFiles code paths to get a real fd back to the x86 rootfs, that way if we break this code path it is entirely our problem to fix. Still need to have a conversation with upstream pressure-vessel to see if they'll accept this path or if we need to do something different. We can also use this same mechanism in the future if we want to expose more FEX specific data to the application through this.
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in 5.8. The main driving force around minimum kernel version requirement is that the lowest kernel version in our CI is 5.15. A benefit to this choice is that this is an LTS release, which is also what Ubuntu 22.04 is shipping. Once the single CI machine is fixed to ship something newer then the next logical choice would be kernel 6.1 which is also LTS, but until then just lift it to 5.15. This version was released in October 2021, and is supported by the kernel developers until 2026. Our previous minimum of 5.0 was released in March 2019, so a two year leap here. This removes the openat2 workaround that was necessary to pass our CI since it is no longer necessary.
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in 5.8. The main driving force around minimum kernel version requirement is that the lowest kernel version in our CI is 5.15. A benefit to this choice is that this is an LTS release, which is also what Ubuntu 22.04 is shipping. Once the single CI machine is fixed to ship something newer then the next logical choice would be kernel 6.1 which is also LTS, but until then just lift it to 5.15. This version was released in October 2021, and is supported by the kernel developers until 2026. Our previous minimum of 5.0 was released in March 2019, so a two year leap here. This removes the openat2 workaround that was necessary to pass our CI since it is no longer necessary.
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in 5.8. The main driving force around minimum kernel version requirement is that the lowest kernel version in our CI is 5.15. A benefit to this choice is that this is an LTS release, which is also what Ubuntu 22.04 is shipping. Once the single CI machine is fixed to ship something newer then the next logical choice would be kernel 6.1 which is also LTS, but until then just lift it to 5.15. This version was released in October 2021, and is supported by the kernel developers until 2026. Our previous minimum of 5.0 was released in March 2019, so a two year leap here. This removes the openat2 workaround that was necessary to pass our CI since it is no longer necessary.
Since it seems likely it would be affected: Has library forwarding functionality been verified to still work after this change? |
Yes, that's what I fixed after CI caught it (it was a missed case, look for the `AT_FDCWD` tests in the last commit).
|
Great, thanks! Wasn't sure from memory if this had just been a build failure or not, but good to know it's taken care of. Also, happy new year of course :) |
This is a new RootFS approach where the entire filesystem hierarchy is bind-mounted under the RootFS directory, with the RootFS itself overlaid. This, in principle, lets FEX avoid all the RootFS merge logic and lets the kernel do it, which should be both faster and more correct. In particular, symlinks between user directories and the RootFS work, and listing a directory correctly returns the merged entries from the RootFS and the host FS, instead of only the RootFS ones.
This PR includes fixes to make a merged-RootFS configuration work (as well as some misc adjacent issues), but does not add an explicit configuration option to disable the filesystem overlay logic in this scenario. That may come later, but it should only be an optimization at that point and may or may not be worth it.
The merged RootFS scenario stress-tests the FEX path handling, since in principle everything is opened through the RootFS and therefore there are more opportunities for the RootFS path to leak to the guest or end up in the wrong place.
Unlike #3831, this PR should improve performance, since it removes system calls in a hot path.
open()
/openat()
/openat2()
go from (in the common case):readlink()
* O(directory components) (fromrealpath()
in emufd)fstatat()
(fromGetEmulatedFDPath()
, and possibly more if it's a symlink)openat2()
itselfTo:
openat2()
for the RootFSopenat2()
for the hostreadlink()
to check for emu filesTested with Steam and some games under Proton and Portal 2, both with and without the merged-RootFS setup.
When used with a merged RootFS, this finally fixes Wine. This supersedes #3831 only in the merged-rootfs case, so the question remains whether the non-merged resolution should be fixed to handle the problem or not...
Note that the fixes here are not comprehensive (proper symlink resolution behavior is limited to
open()
family syscalls, not any others), but it's enough to fix Wine and shouldn't regress any cases.