Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote caching unstable after upgrading to Bazel 8.0.0 #24867

Closed
armandomontanez opened this issue Jan 8, 2025 · 9 comments
Closed

Remote caching unstable after upgrading to Bazel 8.0.0 #24867

armandomontanez opened this issue Jan 8, 2025 · 9 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@armandomontanez
Copy link

Description of the bug:

When moving to Bazel 8.0.0, remote caching for Pigweed became unreliable, frequently emitting errors like the following:

ERROR: /usr/local/google/home/amontanez/development/projects/pigweed/pigweed/targets/rp2040/BUILD.bazel:247:14: Creating runfiles tree bazel-out/k8-fastbuild/bin/targets/rp2040/rp2350_system_async_example.runfiles failed: java.io.FileNotFoundException: /usr/local/google/home/amontanez/.cache/bazel/_bazel_amontanez/06cb1b6ef37c7adbbc0068a64c52919c/execroot/_main/bazel-out/k8-fastbuild/bin/targets/rp2040/rp2350_system_async_example.runfiles/_main/external/pico-sdk+/src/rp2350/boot_stage2 (No such file or directory)

Turning off --experimental_inprocess_symlink_creation works around this.

This seems to occur the most frequently around the boundaries of transitions that propagate runfiles, though that may be anecdotal. I was able to prevent some occurrences by excluding runfiles from being passed across the boundary of the transition.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I can reproduce locally by:

  1. Removing common:remote_cache --experimental_inprocess_symlink_creation=false from Pigweed's .bazelrc.
  2. Run bazel test --config=remote_cache //...
  3. Run bazel test --config=remote_cache --config=googletest //...

Note that for whatever reason the first bazel test does not reproduce, but switching configurations slightly does.

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

release 8.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jan 8, 2025
@fmeum
Copy link
Collaborator

fmeum commented Jan 8, 2025

@bazel-io flag 8.0.1

@fmeum
Copy link
Collaborator

fmeum commented Jan 8, 2025

@tjgq

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jan 8, 2025
@iancha1992
Copy link
Member

@bazel-io fork 8.0.1

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jan 8, 2025
@UebelAndre
Copy link
Contributor

I believe this is the root cause for this failed test in rules_rust when migrating to Bazel 8.0.0 (bazelbuild/rules_rust#3077)
https://buildkite.com/bazel/rules-rust-rustlang/builds/13675#0194496d-7d1d-4ac5-8559-5eb8b1038aca

(04:59:40) ERROR: /workdir/test/unpretty/BUILD.bazel:45:10: Creating runfiles tree bazel-out/k8-fastbuild/bin/test/unpretty/proc_macro_test_unpretty_diff_test-test.sh.runfiles failed: java.io.FileNotFoundException: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/_main/bazel-out/k8-fastbuild/bin/test/unpretty/proc_macro_test_unpretty_diff_test-test.sh.runfiles/_main/test/unpretty (No such file or directory)
(04:59:40) ERROR: /workdir/test/unpretty/BUILD.bazel:18:10: Creating runfiles tree bazel-out/k8-fastbuild/bin/test/unpretty/proc_macro_unpretty_diff_test-test.sh.runfiles failed: java.io.FileNotFoundException: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/execroot/_main/bazel-out/k8-fastbuild/bin/test/unpretty/proc_macro_unpretty_diff_test-test.sh.runfiles/_main/test/unpretty (No such file or directory)

@fmeum
Copy link
Collaborator

fmeum commented Jan 9, 2025

Could you rerun with --verbose_failures, which should give us a stack trace?

Nvm, that also doesn't show the extension. I can reproduce on pigweed though.

@fmeum
Copy link
Collaborator

fmeum commented Jan 9, 2025

Stack trace for pigweed:

java.io.FileNotFoundException: /private/var/tmp/_bazel_fmeum/c9f560a675e76a01d47db885533bcde8/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/pw_system/rp2350_system_example.runfiles/_main/external/pico-sdk+/src/rp2350/boot_stage2 (No such file or directory)
	at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem$Errno.exception(InMemoryFileSystem.java:145)
	at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem$Errno.inodeOrThrow(InMemoryFileSystem.java:126)
	at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem.getDirectory(InMemoryFileSystem.java:340)
	at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem.createSymbolicLink(InMemoryFileSystem.java:531)
	at com.google.devtools.build.lib.vfs.Path.createSymbolicLink(Path.java:481)
	at com.google.devtools.build.lib.remote.RemoteActionFileSystem.createSymbolicLink(RemoteActionFileSystem.java:559)
	at com.google.devtools.build.lib.vfs.Path.createSymbolicLink(Path.java:481)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.ensureSymbolicLink(FileSystemUtils.java:345)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:310)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper$Directory.syncTreeRecursively(SymlinkTreeHelper.java:317)
	at com.google.devtools.build.lib.exec.SymlinkTreeHelper.createSymlinksDirectly(SymlinkTreeHelper.java:139)
	at com.google.devtools.build.lib.exec.SymlinkTreeStrategy.createSymlinks(SymlinkTreeStrategy.java:114)
	at com.google.devtools.build.lib.analysis.actions.SymlinkTreeAction.execute(SymlinkTreeAction.java:230)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1170)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1075)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:559)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:928)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:375)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:216)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:467)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:435)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

I wasn't expecting to see an in-memory FS in there. Does it make sense to run SymlinkTreeAction against the RemoteActionFileSystem given that the symlinks it creates aren't even tracked as outputs?

Edit: To reproduce, I only had to build //pw_system:rp2350_system_example with and without the config.

@fmeum
Copy link
Collaborator

fmeum commented Jan 9, 2025

I suspect that this is caused by a fundamental limitation of the RemoteActionFileSystem, which makes it unfit for this purpose: When the runfiles tree logic traverses the runfiles directory, which only exists locally and not in the in-memory file system, it will assume that the parent directories already exist since they do so locally. When it attempts to recreate a symlink, this operation is mirrored to the in-memory file system, which then fails because in that filesystem the parent directory doesn't exist.

@tjgq
Copy link
Contributor

tjgq commented Jan 9, 2025

Yeah, I think it's a mistake to use the action filesystem for this purpose. I will write a fix.

@meteorcloudy meteorcloudy added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jan 13, 2025
@tjgq tjgq self-assigned this Jan 13, 2025
@tjgq tjgq modified the milestone: 8.0.1 release blockers Jan 13, 2025
tjgq added a commit to tjgq/bazel that referenced this issue Jan 14, 2025
…ction filesystem.

The comment added in SymlinkTreeStrategy explains why this is required.

Fixes bazelbuild#24867.

PiperOrigin-RevId: 715305548
Change-Id: I376d360a0d072c0d5912e14e3115a7fb3b5f2281
github-merge-queue bot pushed a commit that referenced this issue Jan 14, 2025
…ction filesystem. (#24924)

The comment added in SymlinkTreeStrategy explains why this is required.

Fixes #24867.

PiperOrigin-RevId: 715305548
Change-Id: I376d360a0d072c0d5912e14e3115a7fb3b5f2281
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 8.0.1 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.0.1rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants