Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a merged RootFS (and a bunch of related fixes) #4225

Merged
merged 9 commits into from
Jan 1, 2025

Conversation

asahilina
Copy link
Contributor

@asahilina asahilina commented Dec 19, 2024

This is a new RootFS approach where the entire filesystem hierarchy is bind-mounted under the RootFS directory, with the RootFS itself overlaid. This, in principle, lets FEX avoid all the RootFS merge logic and lets the kernel do it, which should be both faster and more correct. In particular, symlinks between user directories and the RootFS work, and listing a directory correctly returns the merged entries from the RootFS and the host FS, instead of only the RootFS ones.

This PR includes fixes to make a merged-RootFS configuration work (as well as some misc adjacent issues), but does not add an explicit configuration option to disable the filesystem overlay logic in this scenario. That may come later, but it should only be an optimization at that point and may or may not be worth it.

The merged RootFS scenario stress-tests the FEX path handling, since in principle everything is opened through the RootFS and therefore there are more opportunities for the RootFS path to leak to the guest or end up in the wrong place.

Unlike #3831, this PR should improve performance, since it removes system calls in a hot path. open()/openat()/openat2() go from (in the common case):

  • readlink() * O(directory components) (from realpath() in emufd)
  • fstatat() (from GetEmulatedFDPath(), and possibly more if it's a symlink)
  • The openat2() itself

To:

  • One openat2() for the RootFS
  • If not found in RootFS, one openat2() for the host
  • One single readlink() to check for emu files

Tested with Steam and some games under Proton and Portal 2, both with and without the merged-RootFS setup.

When used with a merged RootFS, this finally fixes Wine. This supersedes #3831 only in the merged-rootfs case, so the question remains whether the non-merged resolution should be fixed to handle the problem or not...

Note that the fixes here are not comprehensive (proper symlink resolution behavior is limited to open() family syscalls, not any others), but it's enough to fix Wine and shouldn't regress any cases.

To locate whether a path is in the emulated list, EmulatedFDManager::OpenAt()
attemps to resolve the path. realpath() ends up calling readlinkat() on
every path component, which is a lot of syscalls for every open()
variant syscall. It also makes interaction with the rootfs complex and
error-prone.

There's a much easier way to do this: We just open the file without
emulation and check its real path via get_fdpath(). This is just one
readlink() syscall per open, instead of one per path component. If the
file turns out to be emulated (uncommon case), we swap out the fds.

This also decouples EmulatedFDManager from guest path resolution
entirely, so it will never fall out of sync with the RootFS logic.
If the guest reads a RootFS path from /proc/self/fd/*, we should return
it with the RootFS prefix stripped.
If there's a symlink to / within the RootFS, don't attempt to follow it,
since that will end up trying to look up the empty string within the
RootFS (which is not legal). Just return the symlink.
If a RootFS symlink links to an absolute path within the RootFS, we need
to strip the RootFS prefix. This would not normally happen with a plain
RootFS, but it can happen if /proc is mounted within the RootFS.
With a merged RootFS, all binaries are executed through the RootFS. When
executing a binary that is actually a native binary, we want to do so
outside the RootFS. Handle this by stripping the RootFS prefix in that
case.
@asahilina asahilina changed the title Support a merged RootFS Support a merged RootFS (and a bunch of related fixes) Dec 19, 2024
- Parse the shebang line properly (use FHU::ParseArgumentsFromString
  which is the same code the loader uses)
- Make native-interpreter shebang files work by deferring to the kernel
  in that case (previously, they'd get executed through the loader and
  it would choke on the architecture of the interpreter)
- Do not use the RootFS-prepended path when executing shebang files. The
  loader will prepend that anyway when looking it up, but it needs the
  bare guest path so it can pass it as an argument to the interpreter,
  which (since it's emulated) will do the lookup through the RootFS.
The wrappers handle errno, we just need to return -1 on errors.
@asahilina asahilina force-pushed the merged-rootfs branch 3 times, most recently from 0a3228a to 6862e72 Compare December 19, 2024 18:04
@neobrain
Copy link
Member

Awesome, thanks for having continued to look into this!

When used with a AsahiLinux/muvm#130, this finally fixes Wine

Just curious, can an equivalent RootFS setup be achieved with vanilla OverlayFS or are there muvm-specific bits to this approach? My understanding was that normal OverlayFS has restrictions that make it infeasible for FEX RootFS management in general, though I don't remember if this was just an issue of lack of multiarch/lib in most Linux distributions or something more fundamental.

@asahilina
Copy link
Contributor Author

asahilina commented Dec 19, 2024

There's nothing special about muvm, the same setup can be achieved on any system. The main limitation is that overlayfs is fundamentally read-only the way we intend to use it here, so you have to partition the file hierarchy into subtrees that are overlaid (and therefore cannot be written through) and subtrees that are bind-mounted straight through. overlayfs also can't handle sub-mounts in its layers.

muvm achieves this by populating the RootFS directory on a top-level-directory basis. In the end, since our RootFS only has /etc and /usr, those two end up as OverlayFS mounts and everything else is a bind mount.

In addition, overlayfs nominally requires that the layers not mutate out from under it, so technically making any changes to the overlaid subtrees of the real host filesystem is also "UB" (constrained, won't crash your kernel). Realistically I don't think this will cause any major issues other than "maybe if you add files to the host root you won't see them in FEX until a remount", which is probably fine.

In our setup, as long as /etc and /usr don't have any submounts (they don't in any default Fedora installs), and as long as they don't mutate while muvm is running (they generally don't, certainly nothing that FEX cares about unless you're running system updates), things should work. If anything I would expect /etc to be the first place to become problematic, but that could probably be fixed by moving the overlay behavior to one level deeper there.

For reference, this is the mount tree that muvm ends up with:

tmpfs on /run/fex-emu type tmpfs (rw,nosuid,noexec,relatime)
/dev/vda on /run/fex-emu/vda type erofs (ro,relatime,user_xattr,acl,cache_strategy=readaround)
/dev/vdb on /run/fex-emu/vdb type erofs (ro,relatime,user_xattr,acl,cache_strategy=readaround)
/dev/vdc on /run/fex-emu/vdc type erofs (ro,relatime,user_xattr,acl,cache_strategy=readaround)
/dev/vda on /run/fex-emu/rootfs/.profile type erofs (ro,relatime,user_xattr,acl,cache_strategy=readaround)
/dev/root on /run/fex-emu/rootfs/.viminfo type virtiofs (rw,relatime)
/dev/root on /run/fex-emu/rootfs/media type virtiofs (rw,relatime)
/dev/root on /run/fex-emu/rootfs/mnt type virtiofs (rw,relatime)
/dev/root on /run/fex-emu/rootfs/opt type virtiofs (rw,relatime)
overlay on /run/fex-emu/rootfs/etc type overlay (ro,relatime,lowerdir=/run/fex-emu/vdc/etc:/run/fex-emu/vdb/etc:/run/fex-emu/vda/etc:/run/muvm-host/etc,redirect_dir=nofollow,userxattr)
none on /run/fex-emu/rootfs/tmp type virtiofs (rw,relatime)
tmpfs on /run/fex-emu/rootfs/tmp/.X11-unix type tmpfs (rw,nosuid,noexec,relatime)
overlay on /run/fex-emu/rootfs/usr type overlay (ro,relatime,lowerdir=/run/fex-emu/vdc/usr:/run/fex-emu/vdb/usr:/run/fex-emu/vda/usr:/usr,redirect_dir=nofollow,userxattr)
none on /boot type virtiofs (rw,relatime)
none on /run/fex-emu/rootfs/boot type virtiofs (rw,relatime)
sysfs on /run/fex-emu/rootfs/sys type sysfs (rw,nosuid,nodev,noexec,relatime)
cgroup2 on /run/fex-emu/rootfs/sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
/dev/root on /run/fex-emu/rootfs/afs type virtiofs (rw,relatime)
/dev/root on /run/fex-emu/rootfs/root type virtiofs (rw,relatime)
/dev/root on /run/fex-emu/rootfs/srv type virtiofs (rw,relatime)
none on /run/fex-emu/rootfs/home type virtiofs (rw,relatime)
devtmpfs on /run/fex-emu/rootfs/dev type devtmpfs (rw,relatime,size=25559024k,nr_inodes=6389756,mode=755)
devpts on /run/fex-emu/rootfs/dev/pts type devpts (rw,nosuid,noexec,relatime,mode=600,ptmxmode=000)
devshm on /run/fex-emu/rootfs/dev/shm type virtiofs (rw,nosuid,noexec,relatime,dax=always)
tmpfs on /run/fex-emu/rootfs/run type tmpfs (rw,nosuid,noexec,relatime)
/dev/root on /run/fex-emu/rootfs/run/muvm-host type virtiofs (rw,relatime)
proc on /run/fex-emu/rootfs/proc type proc (rw,nosuid,nodev,noexec,relatime)
binfmt_misc on /run/fex-emu/rootfs/proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,noexec,relatime)
/dev/root on /run/fex-emu/rootfs/var type virtiofs (rw,relatime)
none on /run/fex-emu/rootfs/etc/resolv.conf type virtiofs (rw,relatime)

All the sub-mounts (e.g. /dev/shm) are automatically handled since we bind-mount with recursion. /etc/resolv.conf is a special case since muvm overlays that on the real / filesystem too. /.profile is an artifact of our RootFS build and /.viminfo is junk on my system (I could delete those but I left them around as test cases for file, not dir, mounts).

@asahilina
Copy link
Contributor Author

Interesting that all those tests failed... they pass on my machine. Any idea how I can replicate the testing environment? Different RootFS, I guess?

@Sonicadvance1
Copy link
Member

Interesting that all those tests failed... they pass on my machine. Any idea how I can replicate the testing environment? Different RootFS, I guess?

The rootfs that the builders use is an Ubuntu 22.04 image that has been used for about two years.

@asahilina
Copy link
Contributor Author

I mean, can I grab it somewhere?

@asahilina
Copy link
Contributor Author

Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6.

@Sonicadvance1
Copy link
Member

Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6.

Some of the devices are running ancient kernels. 5.0 is our minspec.

@asahilina
Copy link
Contributor Author

Oof... then I need to gate all the changes on kernel version and revert to the old more broken behavior if it's too old... ;;

@Sonicadvance1
Copy link
Member

I mean, can I grab it somewhere?

https://drive.google.com/file/d/18vz4L2yMkGrI6ACiui-K2-MKHxYstCjs/view?usp=sharing For the rootfs image that I built for the builders.

@alyssarosenzweig
Copy link
Collaborator

Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6.

Some of the devices are running ancient kernels. 5.0 is our minspec.

Sounds like now's a good time to bump our minspec then, 5.0 predates panfrost.ko getting merged!

@asahilina
Copy link
Contributor Author

I added a quick gate (turned out to be not too horrible). That does also mean that the CI isn't testing the new approach, though...

@asahilina
Copy link
Contributor Author

One thing worth considering: Now that we have a merged RootFS, would it actually be practical to pivot_root into it? Obviously FEX still needs to be able to access files outside the RootFS, but that could be done by keeping an fd around (and pivoting back before execing a native binary or FEXInterpreter again). With some care FEX might even be able to claim actual robust isolation, as long as it can prevent that fd from leaking into the guest. Perhaps this is the right direction to go in for stuff like Flatpak to work?

@asahilina
Copy link
Contributor Author

asahilina commented Dec 19, 2024

Okay, that makes more sense. So the ARMv8.0 runner was on an old kernel, and it looks like I broke thunks on the new ones with the new code. Now I need to figure out how to build & test with thunks... ^^;;

@Sonicadvance1
Copy link
Member

Okay, that makes more sense. So the ARMv8.0 runner was on an old kernel, and it looks like I broke thunks on the new ones with the new code. Now I need to figure out how to build & test with thunks... ^^;;

I'll see what we can do about raising our kernel version minspec. Just need to check our CI system to see what needs updated, or how hard it will be to update them.

@Sonicadvance1
Copy link
Member

Sonicadvance1 commented Dec 20, 2024

Took a look at our CI systems with their kernel use.

  • 3 units at 6.x (various)
  • 2 units at > 5.6, can easily be updated to 6.x (armv8.4)
    • Upgraded to 6.x
  • 1 unit at > 5.6, can be updated to 6.x fairly easily but means I need to take it offline for a bit (armv8.2)
    • Upgraded to 6.x
  • 1 unit at < 5.6, will be kind of a pain to update to 6.x, but is possible (armv8.0)
    • Upgraded to 5.15

So we could update our minspec to 6.0 /relatively/ easily, I just need to do it. I could fit that in to this weekend or next week.

@pmatos
Copy link
Collaborator

pmatos commented Dec 20, 2024

Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6.

Some of the devices are running ancient kernels. 5.0 is our minspec.

Here it mentions 4.17 though:

LogMan::Msg::EFmt("FEXLoader requires kernel 4.17 minimum. Expect problems.");

@Sonicadvance1
Copy link
Member

Wait, this isn't just an ancient kernel issue right? openat2() requires 5.6.

Some of the devices are running ancient kernels. 5.0 is our minspec.

Here it mentions 4.17 though:

LogMan::Msg::EFmt("FEXLoader requires kernel 4.17 minimum. Expect problems.");

Yea, 4.17 was where we exactly knew problems were going to happen because of old Jetsons, so se decided to put a spotlight on it. If anything when we raise the minspec for real, we can just change 4.17 to 6.0 or whatever.

This avoids having to do the symlink chasing in GetEmulatedFDPath, since
the kernel does it for us. On top of that, with a merged RootFS
setup, this will correctly handle symlinks from user directories into
the RootFS, fixing wine on Fedora.
@asahilina
Copy link
Contributor Author

OK, I think I fixed thunks now. It was painful to get enough of the FEXLinuxTests test suite and thunk stuff built, without a working x86_64 compiler on Fedora...

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Dec 20, 2024
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs
path when running inside of FEX. It does this by opening `/.` which
works around our pseudo-overlayfs tracking. While this worked, it wasn't
guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around
how rootfs is laid out, it had to break this path (while adding a
workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to
get a real fd back to the x86 rootfs, that way if we break this code
path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see
if they'll accept this path or if we need to do something different.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Dec 20, 2024
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs
path when running inside of FEX. It does this by opening `/.` which
works around our pseudo-overlayfs tracking. While this worked, it wasn't
guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around
how rootfs is laid out, it had to break this path (while adding a
workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to
get a real fd back to the x86 rootfs, that way if we break this code
path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see
if they'll accept this path or if we need to do something different.

We can also use this same mechanism in the future if we want to expose
more FEX specific data to the application through this.
@Sonicadvance1 Sonicadvance1 merged commit 2d66bc2 into FEX-Emu:main Jan 1, 2025
12 checks passed
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 1, 2025
Currently pressure-vessel does a bit of a bodge to get the x86 rootfs
path when running inside of FEX. It does this by opening `/.` which
works around our pseudo-overlayfs tracking. While this worked, it wasn't
guaranteed to work forever. With FEX-Emu#4225 working to fix more issues around
how rootfs is laid out, it had to break this path (while adding a
workaround for it to keep working).

Give pressure-vessel a blessed path from the EmulatedFiles code paths to
get a real fd back to the x86 rootfs, that way if we break this code
path it is entirely our problem to fix.

Still need to have a conversation with upstream pressure-vessel to see
if they'll accept this path or if we need to do something different.

We can also use this same mechanism in the future if we want to expose
more FEX specific data to the application through this.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 1, 2025
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in
5.8.

The main driving force around minimum kernel version requirement is that
the lowest kernel version in our CI is 5.15. A benefit to this choice is
that this is an LTS release, which is also what Ubuntu 22.04 is
shipping.

Once the single CI machine is fixed to ship something newer then the
next logical choice would be kernel 6.1 which is also LTS, but until
then just lift it to 5.15. This version was released in October 2021,
and is supported by the kernel developers until 2026. Our previous
minimum of 5.0 was released in March 2019, so a two year leap here.

This removes the openat2 workaround that was necessary to pass our CI
since it is no longer necessary.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 1, 2025
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in
5.8.

The main driving force around minimum kernel version requirement is that
the lowest kernel version in our CI is 5.15. A benefit to this choice is
that this is an LTS release, which is also what Ubuntu 22.04 is
shipping.

Once the single CI machine is fixed to ship something newer then the
next logical choice would be kernel 6.1 which is also LTS, but until
then just lift it to 5.15. This version was released in October 2021,
and is supported by the kernel developers until 2026. Our previous
minimum of 5.0 was released in March 2019, so a two year leap here.

This removes the openat2 workaround that was necessary to pass our CI
since it is no longer necessary.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 1, 2025
Brought up in FEX-Emu#4225 where it had issues with Openat2 which was added in
5.8.

The main driving force around minimum kernel version requirement is that
the lowest kernel version in our CI is 5.15. A benefit to this choice is
that this is an LTS release, which is also what Ubuntu 22.04 is
shipping.

Once the single CI machine is fixed to ship something newer then the
next logical choice would be kernel 6.1 which is also LTS, but until
then just lift it to 5.15. This version was released in October 2021,
and is supported by the kernel developers until 2026. Our previous
minimum of 5.0 was released in March 2019, so a two year leap here.

This removes the openat2 workaround that was necessary to pass our CI
since it is no longer necessary.
@neobrain
Copy link
Member

neobrain commented Jan 2, 2025

Since it seems likely it would be affected: Has library forwarding functionality been verified to still work after this change?

@asahilina
Copy link
Contributor Author

asahilina commented Jan 2, 2025 via email

@neobrain
Copy link
Member

neobrain commented Jan 2, 2025

Yes, that's what I fixed after CI caught it (it was a missed case, look for the AT_FDCWD tests in the last commit).

Great, thanks! Wasn't sure from memory if this had just been a build failure or not, but good to know it's taken care of.

Also, happy new year of course :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants