Spurious failures in CI #3793

tripleee · 2020-04-03T08:22:59Z

What problem has occurred? What issues has it caused?

Two of my recent PRs had Circle CI failures in code which I had not touched. One of them failed, then succeeded after I rebased my commits -- the other exhibited the opposite behavior: test passed, then I rebased the commits and force pushed, and the same code failed.

For the record; #3789 #3790

What would you like to happen/not happen?

The test should not fail spuriously.

I think I had this on my laptop occasionally too, so I don't think it's specific to Circle CI.

The failing code has some comments which vaguely hint at what might be wrong, but then why would they fail only some of the time?

tripleee · 2020-04-03T08:32:21Z

In #3789, simply rerunning the Circle CI test succeeded.

makyen · 2020-04-03T09:31:06Z

The reported error is a KeyError: 'items', which implies that the items property didn't exist in the response SD received from the SE API when it accesses the /answers/{} endpoint.

My guess at what is happening is that the SE API is rate limiting the IP address. This may, or may not, have been preceded by the SE API sending a backoff in a response. The tests which are failing are doing basically the same thing multiple times in rapid succession, which would tend to load the SE API, making it more likely to be rate limited. It's also possible the external Circle CI IP address is shared with other testing, which could impact SE API use. Overall, such issues will be intermittent, as when the SE API applied rate limiting is dependent on the overall load on the servers SE is using for the SE API.

There was a bug in the !!/allspam code, which resulted in any backoff sent by the SE API, when the !!/alspam code accessed the /answers/{} endpoint, to be ignored. I've fixed that. It's unclear that fix will actually resolve this problem, but it should improve the issue.

makyen · 2020-04-03T09:41:24Z

The code in that section of the !!/allspam command really should be restructured. Currently, it sends a separate request to the /answers/{} endpoint for each of the user's answers. As long as there are < 100 answers, then this could be handled with a single request to the SE API.

tripleee · 2020-04-09T09:18:57Z

Retitled with a larger scope because the tests introduced in #3790 also suffer from this problem, in spades. Both Travis and Circle CI appear to have unreliable DNS. There has been discussion in chat between @makyen and @teward and myself regarding how exactly to tackle this.

Spurious failures in CI because of flaky DNS

tripleee · 2020-04-09T09:34:20Z

(Giftedly copy/pasted a spelling error. Let's not make more changes on mobile.)

teward · 2020-04-09T12:09:34Z

So let me break this down a little.

CI failures have happened with select domains. These domains are getting in DNS what we call SERVFAIL lookups - which means that whatever authoritative DNS server serves records for that domain could not be reached or is misconfigured.

This isnt necessarily DNS issues in CircleCI or Travis but issues with the specific domains being tested's nameservers.

We need to adjust our tests to catch the SERVFAIL cases and such. Let me poke around some tests myself and see if we can adjust the tests to catch these SERVFAILS and just pass over those tests without hard failing...

ArtOfCode- · 2020-04-09T12:54:52Z

A DNS SERVFAIL should result in a warning in the test environment. It's not something critical that will stop Smokey running, but should be logged so that someone can come around and remove the failing domains.

teward · 2020-04-09T12:59:02Z

A DNS SERVFAIL should result in a warning in the test environment. It's not something critical that will stop Smokey running, but should be logged so that someone can come around and remove the failing domains.

Agreed. However, what we've got in CI is that it's hard-failing because it's an uncaught exception and raising dns.errors.NoNameservers which isn't handled in existing tests.

teward · 2020-04-09T13:08:35Z

Keep an eye on https://github.com/Charcoal-SE/SmokeDetector/tree/dns-tests

This adds an except handler to handle NoNameservers errors - this was previously an uncaught error in the tests, but now it'll catch and debug log just like the other errors we catch, but without resolving any details because there's no DNS Nameservers available for the request. This will, however, catch the error.

…3806) * Enable DNS tests, catch on dns.resolver.NoNameservers for SERVFAILs * Fix log level * FLAKE fixes.

makyen · 2020-04-11T23:00:16Z

Unfortunately, we're still getting spurious CI failures, at least on Travis CI: 1, 2. Both of those are:

dns.exception.Timeout: The DNS operation timed out after 30.00…

teward · 2020-04-12T00:22:36Z

@makyen yet another uncaught exception. That we can fix for tests too. I'll write up a bit for those failures shortly and get that pushed in.

teward · 2020-04-12T01:12:32Z

I pushed an additional handler capture on DNS Timeouts - the DNS Timeout one now throws a warning into the logs but doesn't error out with an unhandled exception.

It's probable that Travis or Circle CIs might have janky DNS capabilities, so we'll have to just gracefully handle DNS lookup errors instead of letting them go uncaught which is what caused the spurious CI errors.

makyen added a commit that referenced this issue Apr 3, 2020

Use correct "backoff" in !!/allspam #3793 -autopull

69481be

tripleee changed the title ~~Spurious failures in chatcommands test~~ Spurious failures in CI Apr 9, 2020

tripleee mentioned this issue Apr 9, 2020

YAML validation #3790

Merged

tripleee added a commit that referenced this issue Apr 9, 2020

Temporarily disable DNS tests to work around #3793

a203bcc

Spurious failures in CI because of flaky DNS

tripleee added a commit that referenced this issue Apr 9, 2020

Temporarily disable more DNS tests re #3793

7c2e2ed

teward mentioned this issue Apr 9, 2020

Reenable DNS tests, catch on dns.resolver.NoNameservers (fix #3793) #3806

Merged

teward closed this as completed in #3806 Apr 9, 2020

teward added a commit that referenced this issue Apr 9, 2020

Reenable DNS tests, catch on dns.resolver.NoNameservers (fix #3793) (#…

e3c1708

…3806) * Enable DNS tests, catch on dns.resolver.NoNameservers for SERVFAILs * Fix log level * FLAKE fixes.

makyen reopened this Apr 11, 2020

teward self-assigned this Apr 12, 2020

teward added area: CI testing type: bug Aaaah! Kill it! labels Apr 12, 2020

teward closed this as completed in 185e514 Apr 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious failures in CI #3793

Spurious failures in CI #3793

tripleee commented Apr 3, 2020 •

edited

Loading

tripleee commented Apr 3, 2020

makyen commented Apr 3, 2020 •

edited

Loading

makyen commented Apr 3, 2020

tripleee commented Apr 9, 2020

tripleee commented Apr 9, 2020

teward commented Apr 9, 2020

ArtOfCode- commented Apr 9, 2020

teward commented Apr 9, 2020

teward commented Apr 9, 2020

makyen commented Apr 11, 2020

teward commented Apr 12, 2020

teward commented Apr 12, 2020

Spurious failures in CI #3793

Spurious failures in CI #3793

Comments

tripleee commented Apr 3, 2020 • edited Loading

What problem has occurred? What issues has it caused?

What would you like to happen/not happen?

tripleee commented Apr 3, 2020

makyen commented Apr 3, 2020 • edited Loading

makyen commented Apr 3, 2020

tripleee commented Apr 9, 2020

tripleee commented Apr 9, 2020

teward commented Apr 9, 2020

ArtOfCode- commented Apr 9, 2020

teward commented Apr 9, 2020

teward commented Apr 9, 2020

makyen commented Apr 11, 2020

teward commented Apr 12, 2020

teward commented Apr 12, 2020

tripleee commented Apr 3, 2020 •

edited

Loading

makyen commented Apr 3, 2020 •

edited

Loading