Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement afterDisconnect in SimpleCommandLauncher #922

Merged
merged 4 commits into from
Feb 19, 2025

Conversation

jglick
Copy link
Member

@jglick jglick commented Feb 19, 2025

jglick added a commit to jglick/support-core-plugin that referenced this pull request Feb 19, 2025
jglick added a commit to jglick/workflow-cps-plugin that referenced this pull request Feb 19, 2025
@jglick jglick marked this pull request as ready for review February 19, 2025 18:38
LOGGER.log(Level.INFO, "agent launched for {0}", computer.getName());
} catch (Exception x) {
LOGGER.log(Level.WARNING, null, x);
}
}

@SuppressFBWarnings(value = "IS2_INCONSISTENT_SYNC", justification = "test code, close enough")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

command-launcher plugin could be updated to match, I guess, though it mainly matters here because of Windows tests.

@jglick jglick merged commit 256947e into jenkinsci:master Feb 19, 2025
15 checks passed
@jglick jglick deleted the SimpleCommandLauncher branch February 19, 2025 18:43
jglick added a commit to jglick/support-core-plugin that referenced this pull request Feb 19, 2025
@basil
Copy link
Member

basil commented Feb 20, 2025

I'm seeing frequent flakes in Jenkins core in hudson.cli.DisconnectNodeCommandTest#disconnectNodeManyShouldSucceedWithCause as in https://ci.jenkins.io/job/Core/job/jenkins/job/PR-10305/7/testReport/junit/hudson.cli/DisconnectNodeCommandTest/linux_jdk17___Linux___JDK_17___Build___Test___disconnectNodeManyShouldSucceedWithCause/ though I am not sure if they are caused by this PR. But I never saw them prior to rolling out this PR in Jenkins core.

@jglick
Copy link
Member Author

jglick commented Feb 20, 2025

Also https://ci.jenkins.io/job/Core/job/jenkins/job/PR-10305/9/testReport/junit/hudson.cli/ConnectNodeCommandTest/linux_jdk17___Linux___JDK_17___Build___Test___connectNodeManyShouldSucceed/ I suppose. I tried running both locally, and they passed, 350 times in a row. Tried inserting delays into SlaveComputer.disconnect but they still pass. Where else are you seeing these flakes? Just random core PR builds? Found https://ci.jenkins.io/job/Core/job/jenkins/view/change-requests/job/PR-10252/lastSuccessfulBuild/testReport/hudson.cli/DisconnectNodeCommandTest/linux_jdk21___Linux___JDK_21___Build___Test___disconnectNodeManyShouldSucceed/

The disconnectNodeManyShouldSucceedWithCause flake is not particularly concerning. Locally, each agent process terminates quietly with exit code 0. In the flake, they have exit code 143 (SIGTERM) and print messy stack traces. I think this just reflects differences in timing between the process being terminated vs. CloseCommand being sent. So the test should tolerate an alternate cause of disconnection—there is no real guarantee that ByCLI would be used.

The connectNodeManyShouldSucceed flake is more puzzling, because the logs suggest something is disconnecting the agent before Jenkins.cleanUp runs, but the test does not involve this.

@basil
Copy link
Member

basil commented Feb 20, 2025

I've been running core builds and watching for flakes for the past week as part of #913. There is some sort of issue in HtmlUnit which affects HtmlUnit-tests seemingly at random which remains a problem. Other than that, #913 was the top flake I observed up until the release of 2403.v256947ecb_c8a_ which gained the fix for #913 possibly (seemingly?) at the cost of destabilizing DisconnectNodeCommandTest. DisconnectNodeCommandTest failed twice and then passed once in jenkinsci/jenkins#10304, at which point I merged it. But after rebasing today, DisconnectNodeCommandTest failed three times in jenkinsci/jenkins#10305 and hasn't passed yet. I suppose this is bad for cost savings because I have now run 3 builds but still don't have the incremental I need for testing.

@jglick
Copy link
Member Author

jglick commented Feb 21, 2025

something is disconnecting the agent

https://github.com/jenkinsci/jenkins/pull/10240/checks?check_run_id=37600387598 (jenkinsci/jenkins#10240) now some flakes on Windows. One failure in JenkinsTest.getComputers shows a Native memory allocation (mmap) failed message suggesting an infra problem (jenkins-infra/helpdesk#4554 etc.). Two others in ComputerTest do not mention such an error. In all these cases there is

WARNING	o.j.h.test.SimpleCommandLauncher#launch
java.io.EOFException: unexpected stream termination
	at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:478)
	at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:422)
	at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:440)
	at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:407)
	at org.jvnet.hudson.test.SimpleCommandLauncher.launch(SimpleCommandLauncher.java:83)
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)

which obviously involves code touched in this PR, though it should not be changing the behavior of launch in any material way (the failure is just during JenkinsRule.createOnlineSlave() with no further behaviors). Hard to say whether the flakes are related to this PR, or to the recent infra changes which might have caused problems forking subprocesses in limited-memory conditions. The ComputerTest failures show the Java command being launched, but then no stderr from the agent, and the negotiate failure suggests that there is no stdout from the agent either—consist with the agent JVM being launched but then crashing before it gets to application code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants