Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

toddr · 2020-09-13T20:39:46Z

I've noticed this failure on several CI tests which were completely unrelated to this test.

2020-09-13T19:19:46.2191182Z sh: line 1:  6654 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6570A" 2>&1
2020-09-13T19:19:47.1516660Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1065
2020-09-13T19:19:47.1517926Z #      got "# Test process timed out - terminating\nkill: 6654: No such process"
2020-09-13T19:19:47.1518573Z # expected "Timeout"
2020-09-13T19:19:47.1518994Z # PROG: 
2020-09-13T19:19:47.1519346Z # 
2020-09-13T19:19:47.1519921Z #                 BEGIN{require q(test.pl);}
2020-09-13T19:19:47.1520517Z #                 watchdog(3);
2020-09-13T19:19:47.1521025Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-13T19:19:47.1521437Z #                 alarm 1;
2020-09-13T19:19:47.1521838Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-13T19:19:47.1522235Z #                 /.*a.*b.*c.*[de]/;
2020-09-13T19:19:47.1522593Z #             
2020-09-13T19:19:47.1522984Z # STATUS: 35072
2020-09-13T19:19:58.3677295Z t/re/pat ....................................................... FAILED at test 955

Re-running the entire workflow usually does not fail.

The text was updated successfully, but these errors were encountered:

toddr · 2020-09-13T20:42:57Z

Seen in a chat. I think this was the same problem:

I have a test VM reproducing the cygwin failures that keep showing up on github. It’s the test on line 1905 of t/re/pat.t. I believe it’s either the signals from the alarms showing up out of order, or the kill of the watcher process just unblocking it from its sleep(), or the print statements showing up out of order. Increasing the timeout of the watchdog process to 10 seconds decreases the liklihood significantly, but doesn’t eliminate it entirely. Need to dig into it more when I have free time.

It might fix the problem if the watchdog process used a select() loop for it’s timeout rather than sleep() + alarm(). I’d bet it’s just an oddity of the way signals are implemented on cygwin.

wchristian · 2020-09-13T22:35:51Z

There's two bugs here:

For some reason sometimes alarm doesn't fire in time, and the watchdog fires instead. This is the actual bug. I'm presuming performance reasons right now.
For some reason the cygwin kill logic is silly. As you can see, the system kill is called unconditionally on cygwin, without checking if the previous kill worked or not. This results in the watchdog termination sometimes ALSO having a spurious error message which should however either: Be ignored, or the logic be augmented to check if the process still exists before trying to kill it. NOTE: This doesn't address why the process is being eaten by the watchdog, it just addresses the kill error, which is really ... irrelevant.

            # Kill test process if still running
            if (kill(0, $pid_to_kill)) {
                _diag($timeout_msg);
                kill('KILL', $pid_to_kill);
                if ($is_cygwin) {
                    # sometimes the above isn't enough on cygwin
                    sleep 1; # wait a little, it might have worked after all
                    system("/bin/kill -f $pid_to_kill");
                }
            }

lightsey · 2020-09-14T16:18:39Z

I spent some time a while back trying to work around this buggy behavior. I believe the introduction of ConPTY support in Cygwin made it unreliable. It seems to disappear so long as the cygwin server is enabled and the cygwin terminal process is started with CYGWIN=disable_pcon set.

There's also a seemingly related problem in the way S_exit_warning() vetos the Perl interpreter cleanup and emits warnings when a child thread calls exit(). In a nutshell, there's no reason to veto the cleanup if all of the threads are also in a joinable state. This seems to happen frequently on cygwin when a child thread calls exit().

The top three commits in this branch have the code I was experimenting with to fix this problem, but I didn't figure out a way to get the cygwin shell to start properly with disable_pcon set.

https://github.com/lightsey/perl5/tree/cygwin-threads

toddr · 2020-09-15T18:09:39Z

The relevant attempt seems to have been 26eacad

toddr · 2020-09-15T19:34:55Z

I feel like we should put a skip in this unless someone wants to come up with a fix. IMO it's not cool when all our commits are decorated red because of this failure. Opinions?

wchristian · 2020-09-16T01:00:40Z

You could try increasing the timeout first to see if that helps.

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

toddr · 2020-09-16T16:39:14Z

You could try increasing the timeout first to see if that helps.

It helps but only changes the rate of failure. It does not fix anything. See: #18129 (comment)

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

Can you produce a pull request?

wchristian · 2020-09-16T18:46:13Z

I saw the comment. It refers to their test VM. We have no data on what, e.g. a 30 second watchdog would do on the github VM.

Can you produce a pull request?

I'd be happier to do it if i had less rocks in my way, but sure, i'll produce initial PRs.

…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.

…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in #18129.

This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.

lightsey · 2020-09-17T13:34:58Z

A simple way of reproducing the critical part of the bug is this oneliner:

PERL5LIB=`pwd`/lib ./perl.exe -e 'no warnings; use threads; threads->create(sub {exit(99)});sleep(5);exit(98);' 2>/dev/null; echo $?

On Linux that will reliably show an exit code of 99.
On Cygwin is will frequently show an exit code of 0.
If you strace it on Cygwin it will start reliably showing an exit code of 99.
If you remove the 2>/dev/null and allow the warnings to reach the console on Cygwin it will generally show an exit code of 99.

The same logic implemented in C with pthreads works reliably on Cygwin and Linux.

Using the version of Perl 5.30.3 that comes with cygwin also works reliably, but building Perl 5.30.3 from source on Cygwin shows this buggy behavior.

I'm at a loss to explain why Perl misbehaves in this scenario, but it seems to be caused by a fairly tight race condition in the way the exit codes are handled when a child thread exits. It also seems like either something in the cygwin patches for Perl or the cygwin package build toolchain fixes it.

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

lightsey · 2020-09-17T13:48:27Z

It's also worth noting that from what I've seen, starting the cygwin SSH daemon and connecting over SSH seems to fix the behavior, and starting the Cygwin shell with CYGWIN=disable_pcon set also seems to fix it. Unfortunately, that doesn't seem to work in the github workflow file and doesn't explain why the version of Perl packaged with Cygwin doesn't exhibit the same issue.

It could be that some of these variations just make the Perl exit logic run slightly faster or slower and avoid whatever race condition is the root cause of this bug.

wchristian · 2020-09-17T14:28:09Z

I had sent you some stuff on IRC as per toddr.

Anyhow, if you're on it and not letting this be closed with a simple skip, yay.

Edit:

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

Yes, but they can work around the issue so it remains usable until such time as a fix is found. They are not not now and were never intended to be the fix. Plus the watchdog oddity with kill was a legit bug.

lightsey · 2020-09-17T16:59:18Z

My testing showed that increasing the timeout didn't eliminate the flapping behavior. I bet the simplest way to avoid the flapping tests would be one of these options...

disable cygwin in the smoke tests
compile without threads on cygwin for the smoke tests
rewrite the watchdog() code to use forks on cygwin
rewrite the failing tests to look for a zero exit code rather than a non-zero exit code

I'm at a loss for how to fix the thread bug itself. I asked Achim Gratz (the cygwin perl package maintainer) to take a look at this issue.

wchristian · 2020-09-17T19:58:51Z

@lightsey I have not been able to reproduce with the snippet you gave above.

https://gist.github.com/wchristian/ca33e766c73b82f9fd3026abad1e7634

This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.

wchristian · 2020-09-17T21:23:13Z

We tried reproducing it on my system, with the check PASSing every time:

https://gist.github.com/wchristian/ad4971132939899e4c52f310d830cfc1

Trying to see with watchdog changes now:

https://github.com/wchristian/perl5/actions/runs/260042599
https://github.com/wchristian/perl5/actions/runs/260048164

Also gonna add a test that runs the snippet a thousand times to get a better idea of what actually happens.

…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.

wchristian · 2020-09-18T00:04:58Z

So in this branch adding a longer timeout seems to have improved things.

https://github.com/wchristian/perl5/actions/runs/260165383

jkeenan · 2020-09-18T00:12:52Z

FWIW, a recent github action run on cygwin in a "Perl 7" strict-by-default environment came up with this error:

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

From: https://pipelines.actions.githubusercontent.com/StwKAELrIIfJZjsjvFeJdim2u3AOp8ez6qvldOETh5vCDDWoz5/_apis/pipelines/1/runs/780/signedlogcontent/32?urlExpires=2020-09-18T00%3A11%3A44.6877392Z&urlSigningMethod=HMACV1&urlSignature=H0qYt4FXy2QKhfJVmRkN2gAngBWtZR82d4Wc5lBv4H4%3D

wchristian · 2020-09-18T00:31:49Z

@jkeenan that link's dead

edit: also you wanna cherry-pick wchristian@2015363

jkeenan · 2020-09-18T00:42:21Z

@jkeenan that link's dead

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

wchristian · 2020-09-18T12:51:01Z

Very odd, on one of my smoke runs a github vm saw this failure:

t/op/alarm ..................................................... FAILED--expected 5 tests, saw 0

wchristian · 2020-09-18T12:52:43Z

And @jkeenan, yeah, that's the exact same error we're seeing here. Seems to be caused by github VMs, on account of not occurring with cygwin on iron. Also you wanna cherry-pick wchristian/perl5@2015363

…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in Perl#18129.

wchristian · 2020-09-18T17:29:42Z

I've implemented a test that should fail every time under cygwin on github vms, and not take too long on other systems:

wchristian@e8eafd9

If that is at 100% fail we can mark it as TODO.

…ocess Under Cygwin a process can sometimes take a little while to spool down after being killed. There already is code to wait a second and retry. However if the process has already disappeared in the wait second, then the retry is engaged anyhow and will then complain it can't find the process. This change makes it so test.pl only truly attempts to kill a cygwin process if it actually is still around. This resolves the secondary bug in #18129. (cherry picked from commit 57b919f26b5911913c97a93dd5238f7c8c2a6e5f) Signed-off-by: Nicolas R <[email protected]> Committer: Add contributor to AUTHORS

As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.

wchristian · 2020-09-18T20:18:12Z

PR made: #18149

As per github Perl#18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.

As per github #18129, github test VMs occasionally fail this alarm test. This commit implements a loop that forces those systems to always fail the test. On cygwin running directly on iron this doesn't fail even after 1000 iterations. However in order to make github smokes a little more useful for now, this also marks it TODO.

wchristian · 2020-10-05T09:18:08Z

The pull request earlier seems to have successfully stopped re/pat.t from flapping, but instead fails consistently as a TODO.

This means the specific bug is now:

on Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes

wchristian · 2020-10-05T09:31:44Z

This one also needs the mswin32 label replaced with the distro-cygwin one.

toddr added Needs Triage distro-mswin32 Win32 This is a meta-ticket to tag issues in the perl core which need attention on Win32. See #11925 labels Sep 13, 2020

toddr changed the title ~~re/pat.t appears to be unstable on windows CI tests~~ re/pat.t appears to be unstable on cygwin CI tests Sep 13, 2020

wchristian mentioned this issue Sep 16, 2020

prevent a spurious error output from test.pl when killing a cygwin process #18146

Merged

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 16, 2020

prevent regex/alarm test from timing out on cygwin

d1d7730

This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.

wchristian mentioned this issue Sep 16, 2020

prevent regex/alarm test from timing out on cygwin #18147

Closed

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 17, 2020

prevent regex/alarm test from timing out on cygwin

9fff5e0

This is an attempt to see if the primary issue of Perl#18129 can be fixed with an increased watchdog timeout.

wchristian mentioned this issue Sep 18, 2020

forcibly check the timing of the regex/alarm test in bulk #18149

Merged

wchristian mentioned this issue Oct 5, 2020

cygwin failures for op/stat.t and op/threads.t #18193

Closed

This was referenced Oct 5, 2020

on Github Cygwin VMs op/threads rarely flaps #18206

Open

on Github Cygwin VMs op/alarm rarely flaps #18207

Open

toddr added distro-cygwin and removed distro-mswin32 labels Oct 5, 2020

jkeenan removed the Needs Triage label Oct 12, 2020

khwilliamson changed the title ~~re/pat.t appears to be unstable on cygwin CI tests~~ Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

toddr commented Sep 13, 2020 •

edited

Loading

toddr commented Sep 13, 2020 •

edited

Loading

wchristian commented Sep 13, 2020

lightsey commented Sep 14, 2020

toddr commented Sep 15, 2020

toddr commented Sep 15, 2020

wchristian commented Sep 16, 2020

toddr commented Sep 16, 2020 •

edited

Loading

wchristian commented Sep 16, 2020

lightsey commented Sep 17, 2020

lightsey commented Sep 17, 2020

wchristian commented Sep 17, 2020 •

edited

Loading

lightsey commented Sep 17, 2020

wchristian commented Sep 17, 2020

wchristian commented Sep 17, 2020

wchristian commented Sep 18, 2020

jkeenan commented Sep 18, 2020

wchristian commented Sep 18, 2020 •

edited

Loading

jkeenan commented Sep 18, 2020

wchristian commented Sep 18, 2020

wchristian commented Sep 18, 2020

wchristian commented Sep 18, 2020 •

edited

Loading

wchristian commented Sep 18, 2020

wchristian commented Oct 5, 2020

wchristian commented Oct 5, 2020

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

Comments

toddr commented Sep 13, 2020 • edited Loading

toddr commented Sep 13, 2020 • edited Loading

wchristian commented Sep 13, 2020

lightsey commented Sep 14, 2020

toddr commented Sep 15, 2020

toddr commented Sep 15, 2020

wchristian commented Sep 16, 2020

toddr commented Sep 16, 2020 • edited Loading

wchristian commented Sep 16, 2020

lightsey commented Sep 17, 2020

lightsey commented Sep 17, 2020

wchristian commented Sep 17, 2020 • edited Loading

lightsey commented Sep 17, 2020

wchristian commented Sep 17, 2020

wchristian commented Sep 17, 2020

wchristian commented Sep 18, 2020

jkeenan commented Sep 18, 2020

wchristian commented Sep 18, 2020 • edited Loading

jkeenan commented Sep 18, 2020

wchristian commented Sep 18, 2020

wchristian commented Sep 18, 2020

wchristian commented Sep 18, 2020 • edited Loading

wchristian commented Sep 18, 2020

wchristian commented Oct 5, 2020

wchristian commented Oct 5, 2020

toddr commented Sep 13, 2020 •

edited

Loading

toddr commented Sep 13, 2020 •

edited

Loading

toddr commented Sep 16, 2020 •

edited

Loading

wchristian commented Sep 17, 2020 •

edited

Loading

wchristian commented Sep 18, 2020 •

edited

Loading

wchristian commented Sep 18, 2020 •

edited

Loading