Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop agents before exiting test cases, and reliably close all_*.log when agents exit #624

Merged
merged 17 commits into from
Feb 19, 2025

Conversation

jglick
Copy link
Member

@jglick jglick commented Feb 14, 2025

As noted in #617 (comment) there are mostly consistent test failures on Windows. Not sure why these are just appearing now (and I cannot reproduce locally on Windows 10), but at any rate SupportLogHandler opens a file handle on e.g. agent-work-dirs\agent0\support\all_2025-02-14_13.33.29.log which will not be closed unless the agent process exits.

(CloudBees internal issue)

@jglick jglick added the chore label Feb 14, 2025
@jglick jglick requested a review from a team as a code owner February 14, 2025 20:13
@jglick jglick requested a review from Dohbedoh February 14, 2025 20:13
@jglick jglick marked this pull request as draft February 17, 2025 12:24
@jglick
Copy link
Member Author

jglick commented Feb 17, 2025

Apparently did not suffice. Maybe this file is not actually closed when the agent terminates? Or the current attempt to terminate the agent is not effective because it just closes the connection and does not exit the JVM?

@jglick
Copy link
Member Author

jglick commented Feb 18, 2025

ROOT_LOGGER.addHandler(LogHolder.AGENT_LOG_HANDLER);
initializes a log handler which opens this file handle in and AFAICT an agent would not normally close this handle until the JVM exits. https://github.com/jenkinsci/jenkins-test-harness/blob/052a595c77a54099ceeac0002c0d4f816838092e/src/main/java/org/jvnet/hudson/test/SimpleCommandLauncher.java#L82-L91 ought to be terminating that JVM, but I guess this just initiates termination of the JVM, it does not wait for it.

@jglick
Copy link
Member Author

jglick commented Feb 18, 2025

Incorrect. SimpleCommandLauncher seems to be waiting for the process to exit, but its listener is not being called synchronously by this thread:

java.lang.Throwable
	at org.jvnet.hudson.test.SimpleCommandLauncher$1.onClosed(SimpleCommandLauncher.java:85)
	at hudson.remoting.Channel.terminate(Channel.java:1219)
	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1438)
	at hudson.remoting.Channel$1.handle(Channel.java:664)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:86)

@jglick
Copy link
Member Author

jglick commented Feb 18, 2025

Almost, but not quite. Weird because the log does show the agent process having been killed before the test cleanup begins.

This reverts commit 06190e8.
It seemed to mostly work, but not in one case.
@jglick jglick marked this pull request as ready for review February 18, 2025 23:12
@jglick jglick enabled auto-merge February 18, 2025 23:12
@jglick jglick added this pull request to the merge queue Feb 18, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 18, 2025
MarkEWaite
MarkEWaite previously approved these changes Feb 19, 2025
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the investigation!

@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

So this failure is puzzling me.

11.263 [id=227]	INFO	c.c.j.support.SupportTestUtils#stopAgents: successfully deleted C:\Jenkins\agent\workspace\57ae19e04dee06d2eba8c5f1128e603e\target\tmp\j h17291846234312879372\agent-work-dirs
…
11.374 [id=227]	INFO	o.j.h.t.TemporaryDirectoryAllocator#dispose: deleting C:\Jenkins\agent\workspace\57ae19e04dee06d2eba8c5f1128e603e\target\tmp\j h17291846234312879372
…
java.nio.file.FileSystemException: C:\Jenkins\agent\workspace\57ae19e04dee06d2eba8c5f1128e603e\target\tmp\j h17291846234312879372\agent-work-dirs\slave0\support\all_2025-02-18_23.39.21.log: The process cannot access the file because it is being used by another process

would seem to indicate that something recreated this file after its containing directory had been deleted.

@jglick jglick marked this pull request as draft February 19, 2025 01:12
@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

Two tracks now: jenkinsci/jenkins-test-harness#922; and trying to see if I can get SupportLogHandler.close to be called when the connection is closed.

@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

No luck so far on the latter approach:

diff --git src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java
index 3e57d01..c09f9d6 100644
--- src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java
+++ src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java
@@ -57,6 +57,7 @@ import hudson.model.Descriptor;
 import hudson.model.Node;
 import hudson.model.PeriodicWork;
 import hudson.model.TaskListener;
+import hudson.remoting.Channel;
 import hudson.remoting.ChannelClosedException;
 import hudson.remoting.Future;
 import hudson.remoting.VirtualChannel;
@@ -877,6 +878,12 @@ public class SupportPlugin extends Plugin {
             LogHolder.AGENT_LOG_HANDLER.setLevel(level);
             LogHolder.AGENT_LOG_HANDLER.setDirectory(new File(rootPath.getRemote(), SUPPORT_DIRECTORY_NAME), "all");
             ROOT_LOGGER.addHandler(LogHolder.AGENT_LOG_HANDLER);
+            Channel.currentOrFail().addListener(new Channel.Listener() {
+                @Override
+                public void onClosed(Channel channel, IOException cause) {
+                    LogHolder.AGENT_LOG_HANDLER.close();
+                }
+            });
             return null;
         }
     }

So far as I can tell, onClosed does not get called.

@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

Still failing so I am running out of hypotheses.

@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

With better logs we can see that the agent process is terminated (exit code zero), and yet 59ms later the attempt to delete its working directory fails. This remark

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

suggests that jenkinsci/jenkins-test-harness#922 (while probably a step in the right direction) cannot suffice. Either I figure out how to make the agent release its locks before it exits, or the deletion code needs to just retry for a while.

@jglick jglick changed the title Stop agents before exiting test cases Stop agents before exiting test cases, and reliably close all_*.log when agents exit Feb 19, 2025
@jglick jglick added bug and removed chore labels Feb 19, 2025
@jglick jglick marked this pull request as ready for review February 19, 2025 19:25
@@ -70,6 +70,8 @@
<!-- https://www.jenkins.io/doc/developer/plugin-development/choosing-jenkins-baseline/ -->
<jenkins.baseline>2.479</jenkins.baseline>
<jenkins.version>${jenkins.baseline}.1</jenkins.version>
<!-- TODO until in parent -->
<jenkins-test-harness.version>2403.v256947ecb_c8a_</jenkins-test-harness.version>
Copy link
Member Author

@jglick jglick Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jenkinsci/jenkins-test-harness#922

Previously, the agent JVM launched by SimpleCommandLauncher (JenkinsRule.createSlave / .createOnlineSlave) would be terminated when the Computer was disconnected as part of Jenkins.cleanUp, but asynchronously and possible slightly later, when TemporaryDirectoryAllocator was already trying to clean up.

@@ -234,6 +234,7 @@ private void setFile(File file) throws FileNotFoundException {
parentFile.mkdirs();
}

StreamUtils.closeQuietly(null); // ensure class is loaded so close() can succeed
Copy link
Member Author

@jglick jglick Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nothing but loads this class. Otherwise there can occasionally be a NoClassDefFoundError from close during agent shutdown, at which time it is too late to load new classes from RemoteClassLoader because the connection has already been closed.

@@ -862,6 +862,15 @@ public Void call() {
// avoid double installation of the handler. JNLP agents can reconnect to the controller multiple times
// and each connection gets a different RemoteClassLoader, so we need to evict them by class name,
// not by their identity.
closeAll();
Runtime.getRuntime().addShutdownHook(new Thread(LogInitializer::closeAll, "close log handlers"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensures that the file handle is closed before the JVM exits. Likely irrelevant on Linux, but seems to matter on Windows, where the OS will release mandatory file locks “sometime” after the process exits but not necessarily quickly enough for a test.

@jglick jglick enabled auto-merge February 19, 2025 19:33
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very, very much!

@jglick jglick added this pull request to the merge queue Feb 19, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 19, 2025
@jglick
Copy link
Member Author

jglick commented Feb 19, 2025

Yet another flake 👀

@jglick jglick enabled auto-merge February 19, 2025 20:13
@jglick jglick added this pull request to the merge queue Feb 19, 2025
Merged via the queue into jenkinsci:master with commit 49b01a4 Feb 19, 2025
17 checks passed
@jglick jglick deleted the SupportLogHandler branch February 19, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants