-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129
Comments
Seen in a chat. I think this was the same problem:
|
There's two bugs here:
# Kill test process if still running
if (kill(0, $pid_to_kill)) {
_diag($timeout_msg);
kill('KILL', $pid_to_kill);
if ($is_cygwin) {
# sometimes the above isn't enough on cygwin
sleep 1; # wait a little, it might have worked after all
system("/bin/kill -f $pid_to_kill");
}
} |
I spent some time a while back trying to work around this buggy behavior. I believe the introduction of ConPTY support in Cygwin made it unreliable. It seems to disappear so long as the cygwin server is enabled and the cygwin terminal process is started with There's also a seemingly related problem in the way S_exit_warning() vetos the Perl interpreter cleanup and emits warnings when a child thread calls exit(). In a nutshell, there's no reason to veto the cleanup if all of the threads are also in a joinable state. This seems to happen frequently on cygwin when a child thread calls exit(). The top three commits in this branch have the code I was experimenting with to fix this problem, but I didn't figure out a way to get the cygwin shell to start properly with disable_pcon set. |
The relevant attempt seems to have been 26eacad |
I feel like we should put a skip in this unless someone wants to come up with a fix. IMO it's not cool when all our commits are decorated red because of this failure. Opinions? |
You could try increasing the timeout first to see if that helps. Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough? |
It helps but only changes the rate of failure. It does not fix anything. See: #18129 (comment)
Can you produce a pull request? |
I saw the comment. It refers to their test VM. We have no data on what, e.g. a 30 second watchdog would do on the github VM.
I'd be happier to do it if i had less rocks in my way, but sure, i'll produce initial PRs. |
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in #18129.
This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.
A simple way of reproducing the critical part of the bug is this oneliner:
On Linux that will reliably show an exit code of 99. The same logic implemented in C with pthreads works reliably on Cygwin and Linux. Using the version of Perl 5.30.3 that comes with cygwin also works reliably, but building Perl 5.30.3 from source on Cygwin shows this buggy behavior. I'm at a loss to explain why Perl misbehaves in this scenario, but it seems to be caused by a fairly tight race condition in the way the exit codes are handled when a child thread exits. It also seems like either something in the cygwin patches for Perl or the cygwin package build toolchain fixes it. The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell. |
It's also worth noting that from what I've seen, starting the cygwin SSH daemon and connecting over SSH seems to fix the behavior, and starting the Cygwin shell with It could be that some of these variations just make the Perl exit logic run slightly faster or slower and avoid whatever race condition is the root cause of this bug. |
I had sent you some stuff on IRC as per toddr. Anyhow, if you're on it and not letting this be closed with a simple skip, yay. Edit:
Yes, but they can work around the issue so it remains usable until such time as a fix is found. They are not not now and were never intended to be the fix. Plus the watchdog oddity with kill was a legit bug. |
My testing showed that increasing the timeout didn't eliminate the flapping behavior. I bet the simplest way to avoid the flapping tests would be one of these options...
I'm at a loss for how to fix the thread bug itself. I asked Achim Gratz (the cygwin perl package maintainer) to take a look at this issue. |
@lightsey I have not been able to reproduce with the snippet you gave above. https://gist.github.com/wchristian/ca33e766c73b82f9fd3026abad1e7634 |
This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.
We tried reproducing it on my system, with the check PASSing every time: https://gist.github.com/wchristian/ad4971132939899e4c52f310d830cfc1 Trying to see with watchdog changes now: https://github.com/wchristian/perl5/actions/runs/260042599 Also gonna add a test that runs the snippet a thousand times to get a better idea of what actually happens. |
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.
So in this branch adding a longer timeout seems to have improved things. |
FWIW, a recent github action run on cygwin in a "Perl 7" strict-by-default environment came up with this error:
|
@jkeenan that link's dead edit: also you wanna cherry-pick wchristian@2015363 |
|
Very odd, on one of my smoke runs a github vm saw this failure:
|
And @jkeenan, yeah, that's the exact same error we're seeing here. Seems to be caused by github VMs, on account of not occurring with cygwin on iron. Also you wanna cherry-pick wchristian/perl5@2015363 |
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.
I've implemented a test that should fail every time under cygwin on github vms, and not take too long on other systems: If that is at 100% fail we can mark it as TODO. |
…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in #18129. (cherry picked from commit 57b919f26b5911913c97a93dd5238f7c8c2a6e5f) Signed-off-by: Nicolas R <[email protected]> Committer: Add contributor to AUTHORS
As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.
PR made: #18149 |
As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.
As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.
As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.
As per github #18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.
The pull request earlier seems to have successfully stopped re/pat.t from flapping, but instead fails consistently as a TODO. This means the specific bug is now: on Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes |
This one also needs the mswin32 label replaced with the distro-cygwin one. |
I've noticed this failure on several CI tests which were completely unrelated to this test.
Re-running the entire workflow usually does not fail.
The text was updated successfully, but these errors were encountered: