Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129

Open
toddr opened this issue Sep 13, 2020 · 24 comments
Labels
distro-cygwin Win32 This is a meta-ticket to tag issues in the perl core which need attention on Win32. See #11925

Comments

@toddr
Copy link
Member

toddr commented Sep 13, 2020

I've noticed this failure on several CI tests which were completely unrelated to this test.

2020-09-13T19:19:46.2191182Z sh: line 1:  6654 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6570A" 2>&1
2020-09-13T19:19:47.1516660Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1065
2020-09-13T19:19:47.1517926Z #      got "# Test process timed out - terminating\nkill: 6654: No such process"
2020-09-13T19:19:47.1518573Z # expected "Timeout"
2020-09-13T19:19:47.1518994Z # PROG: 
2020-09-13T19:19:47.1519346Z # 
2020-09-13T19:19:47.1519921Z #                 BEGIN{require q(test.pl);}
2020-09-13T19:19:47.1520517Z #                 watchdog(3);
2020-09-13T19:19:47.1521025Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-13T19:19:47.1521437Z #                 alarm 1;
2020-09-13T19:19:47.1521838Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-13T19:19:47.1522235Z #                 /.*a.*b.*c.*[de]/;
2020-09-13T19:19:47.1522593Z #             
2020-09-13T19:19:47.1522984Z # STATUS: 35072
2020-09-13T19:19:58.3677295Z t/re/pat ....................................................... FAILED at test 955

Re-running the entire workflow usually does not fail.

@toddr toddr added Needs Triage distro-mswin32 Win32 This is a meta-ticket to tag issues in the perl core which need attention on Win32. See #11925 labels Sep 13, 2020
@toddr
Copy link
Member Author

toddr commented Sep 13, 2020

Seen in a chat. I think this was the same problem:

I have a test VM reproducing the cygwin failures that keep showing up on github. It’s the test on line 1905 of t/re/pat.t. I believe it’s either the signals from the alarms showing up out of order, or the kill of the watcher process just unblocking it from its sleep(), or the print statements showing up out of order. Increasing the timeout of the watchdog process to 10 seconds decreases the liklihood significantly, but doesn’t eliminate it entirely. Need to dig into it more when I have free time.

It might fix the problem if the watchdog process used a select() loop for it’s timeout rather than sleep() + alarm(). I’d bet it’s just an oddity of the way signals are implemented on cygwin.

@toddr toddr changed the title re/pat.t appears to be unstable on windows CI tests re/pat.t appears to be unstable on cygwin CI tests Sep 13, 2020
@wchristian
Copy link
Contributor

There's two bugs here:

  1. For some reason sometimes alarm doesn't fire in time, and the watchdog fires instead. This is the actual bug. I'm presuming performance reasons right now.

  2. For some reason the cygwin kill logic is silly. As you can see, the system kill is called unconditionally on cygwin, without checking if the previous kill worked or not. This results in the watchdog termination sometimes ALSO having a spurious error message which should however either: Be ignored, or the logic be augmented to check if the process still exists before trying to kill it. NOTE: This doesn't address why the process is being eaten by the watchdog, it just addresses the kill error, which is really ... irrelevant.

            # Kill test process if still running
            if (kill(0, $pid_to_kill)) {
                _diag($timeout_msg);
                kill('KILL', $pid_to_kill);
                if ($is_cygwin) {
                    # sometimes the above isn't enough on cygwin
                    sleep 1; # wait a little, it might have worked after all
                    system("/bin/kill -f $pid_to_kill");
                }
            }

@lightsey
Copy link
Contributor

I spent some time a while back trying to work around this buggy behavior. I believe the introduction of ConPTY support in Cygwin made it unreliable. It seems to disappear so long as the cygwin server is enabled and the cygwin terminal process is started with CYGWIN=disable_pcon set.

There's also a seemingly related problem in the way S_exit_warning() vetos the Perl interpreter cleanup and emits warnings when a child thread calls exit(). In a nutshell, there's no reason to veto the cleanup if all of the threads are also in a joinable state. This seems to happen frequently on cygwin when a child thread calls exit().

The top three commits in this branch have the code I was experimenting with to fix this problem, but I didn't figure out a way to get the cygwin shell to start properly with disable_pcon set.

https://github.com/lightsey/perl5/tree/cygwin-threads

@toddr
Copy link
Member Author

toddr commented Sep 15, 2020

The relevant attempt seems to have been 26eacad

@toddr
Copy link
Member Author

toddr commented Sep 15, 2020

I feel like we should put a skip in this unless someone wants to come up with a fix. IMO it's not cool when all our commits are decorated red because of this failure. Opinions?

@wchristian
Copy link
Contributor

You could try increasing the timeout first to see if that helps.

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

@toddr
Copy link
Member Author

toddr commented Sep 16, 2020

You could try increasing the timeout first to see if that helps.

It helps but only changes the rate of failure. It does not fix anything. See: #18129 (comment)

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

Can you produce a pull request?

@wchristian
Copy link
Contributor

I saw the comment. It refers to their test VM. We have no data on what, e.g. a 30 second watchdog would do on the github VM.

Can you produce a pull request?

I'd be happier to do it if i had less rocks in my way, but sure, i'll produce initial PRs.

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 16, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in Perl#18129.
khwilliamson pushed a commit that referenced this issue Sep 16, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in #18129.
wchristian added a commit to wchristian/perl5 that referenced this issue Sep 16, 2020
This is an attempt to see if the primary issue of Perl#18129 can be fixed with
an increased watchdog timeout.
@lightsey
Copy link
Contributor

A simple way of reproducing the critical part of the bug is this oneliner:

PERL5LIB=`pwd`/lib ./perl.exe -e 'no warnings; use threads; threads->create(sub {exit(99)});sleep(5);exit(98);' 2>/dev/null; echo $?

On Linux that will reliably show an exit code of 99.
On Cygwin is will frequently show an exit code of 0.
If you strace it on Cygwin it will start reliably showing an exit code of 99.
If you remove the 2>/dev/null and allow the warnings to reach the console on Cygwin it will generally show an exit code of 99.

The same logic implemented in C with pthreads works reliably on Cygwin and Linux.

Using the version of Perl 5.30.3 that comes with cygwin also works reliably, but building Perl 5.30.3 from source on Cygwin shows this buggy behavior.

I'm at a loss to explain why Perl misbehaves in this scenario, but it seems to be caused by a fairly tight race condition in the way the exit codes are handled when a child thread exits. It also seems like either something in the cygwin patches for Perl or the cygwin package build toolchain fixes it.

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

@lightsey
Copy link
Contributor

It's also worth noting that from what I've seen, starting the cygwin SSH daemon and connecting over SSH seems to fix the behavior, and starting the Cygwin shell with CYGWIN=disable_pcon set also seems to fix it. Unfortunately, that doesn't seem to work in the github workflow file and doesn't explain why the version of Perl packaged with Cygwin doesn't exhibit the same issue.

It could be that some of these variations just make the Perl exit logic run slightly faster or slower and avoid whatever race condition is the root cause of this bug.

@wchristian
Copy link
Contributor

wchristian commented Sep 17, 2020

I had sent you some stuff on IRC as per toddr.

Anyhow, if you're on it and not letting this be closed with a simple skip, yay.

Edit:

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

Yes, but they can work around the issue so it remains usable until such time as a fix is found. They are not not now and were never intended to be the fix. Plus the watchdog oddity with kill was a legit bug.

@lightsey
Copy link
Contributor

My testing showed that increasing the timeout didn't eliminate the flapping behavior. I bet the simplest way to avoid the flapping tests would be one of these options...

  1. disable cygwin in the smoke tests
  2. compile without threads on cygwin for the smoke tests
  3. rewrite the watchdog() code to use forks on cygwin
  4. rewrite the failing tests to look for a zero exit code rather than a non-zero exit code

I'm at a loss for how to fix the thread bug itself. I asked Achim Gratz (the cygwin perl package maintainer) to take a look at this issue.

@wchristian
Copy link
Contributor

@lightsey I have not been able to reproduce with the snippet you gave above.

https://gist.github.com/wchristian/ca33e766c73b82f9fd3026abad1e7634

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 17, 2020
This is an attempt to see if the primary issue of Perl#18129 can be fixed with
an increased watchdog timeout.
@wchristian
Copy link
Contributor

We tried reproducing it on my system, with the check PASSing every time:

https://gist.github.com/wchristian/ad4971132939899e4c52f310d830cfc1

Trying to see with watchdog changes now:

https://github.com/wchristian/perl5/actions/runs/260042599
https://github.com/wchristian/perl5/actions/runs/260048164

Also gonna add a test that runs the snippet a thousand times to get a better idea of what actually happens.

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 17, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in Perl#18129.
@wchristian
Copy link
Contributor

So in this branch adding a longer timeout seems to have improved things.

https://github.com/wchristian/perl5/actions/runs/260165383

@jkeenan
Copy link
Contributor

jkeenan commented Sep 18, 2020

FWIW, a recent github action run on cygwin in a "Perl 7" strict-by-default environment came up with this error:

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

From: https://pipelines.actions.githubusercontent.com/StwKAELrIIfJZjsjvFeJdim2u3AOp8ez6qvldOETh5vCDDWoz5/_apis/pipelines/1/runs/780/signedlogcontent/32?urlExpires=2020-09-18T00%3A11%3A44.6877392Z&urlSigningMethod=HMACV1&urlSignature=H0qYt4FXy2QKhfJVmRkN2gAngBWtZR82d4Wc5lBv4H4%3D

@wchristian
Copy link
Contributor

wchristian commented Sep 18, 2020

@jkeenan that link's dead

edit: also you wanna cherry-pick wchristian@2015363

@jkeenan
Copy link
Contributor

jkeenan commented Sep 18, 2020

@jkeenan that link's dead

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

@wchristian
Copy link
Contributor

Very odd, on one of my smoke runs a github vm saw this failure:

t/op/alarm ..................................................... FAILED--expected 5 tests, saw 0

@wchristian
Copy link
Contributor

And @jkeenan, yeah, that's the exact same error we're seeing here. Seems to be caused by github VMs, on account of not occurring with cygwin on iron. Also you wanna cherry-pick wchristian/perl5@2015363

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 18, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in Perl#18129.
wchristian added a commit to wchristian/perl5 that referenced this issue Sep 18, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in Perl#18129.
@wchristian
Copy link
Contributor

wchristian commented Sep 18, 2020

I've implemented a test that should fail every time under cygwin on github vms, and not take too long on other systems:

wchristian@e8eafd9

If that is at 100% fail we can mark it as TODO.

jkeenan pushed a commit that referenced this issue Sep 18, 2020
…ocess

Under Cygwin a process can sometimes take a little while to spool down
after being killed. There already is code to wait a second and retry.
However if the process has already disappeared in the wait second, then the
retry is engaged anyhow and will then complain it can't find the process.

This change makes it so test.pl only truly attempts to kill a cygwin
process if it actually is still around.

This resolves the secondary bug in #18129.

(cherry picked from commit 57b919f26b5911913c97a93dd5238f7c8c2a6e5f)
Signed-off-by: Nicolas R <[email protected]>

Committer: Add contributor to AUTHORS
wchristian added a commit to wchristian/perl5 that referenced this issue Sep 18, 2020
As per github Perl#18129, github test VMs occasionally fail this alarm test.

This commit implements a loop that forces those systems to always fail the
test.

On cygwin running directly on iron this doesn't fail even after 1000
iterations.

However in order to make github smokes a little more useful for now, this
also marks it TODO.
@wchristian
Copy link
Contributor

PR made: #18149

wchristian added a commit to wchristian/perl5 that referenced this issue Sep 30, 2020
As per github Perl#18129, github test VMs occasionally fail this alarm test.

This commit implements a loop that forces those systems to always fail the
test.

On cygwin running directly on iron this doesn't fail even after 1000
iterations.

However in order to make github smokes a little more useful for now, this
also marks it TODO.
wchristian added a commit to wchristian/perl5 that referenced this issue Oct 3, 2020
As per github Perl#18129, github test VMs occasionally fail this alarm test.

This commit implements a loop that forces those systems to always fail the
test.

On cygwin running directly on iron this doesn't fail even after 1000
iterations.

However in order to make github smokes a little more useful for now, this
also marks it TODO.
wchristian added a commit to wchristian/perl5 that referenced this issue Oct 3, 2020
As per github Perl#18129, github test VMs occasionally fail this alarm test.

This commit implements a loop that forces those systems to always fail the
test.

On cygwin running directly on iron this doesn't fail even after 1000
iterations.

However in order to make github smokes a little more useful for now, this
also marks it TODO.
khwilliamson pushed a commit that referenced this issue Oct 3, 2020
As per github #18129, github test VMs occasionally fail this alarm test.

This commit implements a loop that forces those systems to always fail the
test.

On cygwin running directly on iron this doesn't fail even after 1000
iterations.

However in order to make github smokes a little more useful for now, this
also marks it TODO.
@wchristian
Copy link
Contributor

The pull request earlier seems to have successfully stopped re/pat.t from flapping, but instead fails consistently as a TODO.

This means the specific bug is now:

on Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes

@wchristian
Copy link
Contributor

This one also needs the mswin32 label replaced with the distro-cygwin one.

@khwilliamson khwilliamson changed the title re/pat.t appears to be unstable on cygwin CI tests Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distro-cygwin Win32 This is a meta-ticket to tag issues in the perl core which need attention on Win32. See #11925
Projects
None yet
Development

No branches or pull requests

4 participants