Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious failures in CI #3793

Closed
tripleee opened this issue Apr 3, 2020 · 12 comments · Fixed by #3806
Closed

Spurious failures in CI #3793

tripleee opened this issue Apr 3, 2020 · 12 comments · Fixed by #3806
Assignees
Labels

Comments

@tripleee
Copy link
Member

tripleee commented Apr 3, 2020

What problem has occurred? What issues has it caused?

Two of my recent PRs had Circle CI failures in code which I had not touched. One of them failed, then succeeded after I rebased my commits -- the other exhibited the opposite behavior: test passed, then I rebased the commits and force pushed, and the same code failed.

For the record; #3789 #3790

What would you like to happen/not happen?

The test should not fail spuriously.

I think I had this on my laptop occasionally too, so I don't think it's specific to Circle CI.

The failing code has some comments which vaguely hint at what might be wrong, but then why would they fail only some of the time?

@tripleee
Copy link
Member Author

tripleee commented Apr 3, 2020

In #3789, simply rerunning the Circle CI test succeeded.

@makyen
Copy link
Contributor

makyen commented Apr 3, 2020

The reported error is a KeyError: 'items', which implies that the items property didn't exist in the response SD received from the SE API when it accesses the /answers/{} endpoint.

My guess at what is happening is that the SE API is rate limiting the IP address. This may, or may not, have been preceded by the SE API sending a backoff in a response. The tests which are failing are doing basically the same thing multiple times in rapid succession, which would tend to load the SE API, making it more likely to be rate limited. It's also possible the external Circle CI IP address is shared with other testing, which could impact SE API use. Overall, such issues will be intermittent, as when the SE API applied rate limiting is dependent on the overall load on the servers SE is using for the SE API.

There was a bug in the !!/allspam code, which resulted in any backoff sent by the SE API, when the !!/alspam code accessed the /answers/{} endpoint, to be ignored. I've fixed that. It's unclear that fix will actually resolve this problem, but it should improve the issue.

@makyen
Copy link
Contributor

makyen commented Apr 3, 2020

The code in that section of the !!/allspam command really should be restructured. Currently, it sends a separate request to the /answers/{} endpoint for each of the user's answers. As long as there are < 100 answers, then this could be handled with a single request to the SE API.

@tripleee tripleee changed the title Spurious failures in chatcommands test Spurious failures in CI Apr 9, 2020
@tripleee
Copy link
Member Author

tripleee commented Apr 9, 2020

Retitled with a larger scope because the tests introduced in #3790 also suffer from this problem, in spades. Both Travis and Circle CI appear to have unreliable DNS. There has been discussion in chat between @makyen and @teward and myself regarding how exactly to tackle this.

tripleee added a commit that referenced this issue Apr 9, 2020
Spurious failures in CI because of flaky DNS
@tripleee
Copy link
Member Author

tripleee commented Apr 9, 2020

(Giftedly copy/pasted a spelling error. Let's not make more changes on mobile.)

@teward
Copy link
Member

teward commented Apr 9, 2020

So let me break this down a little.

CI failures have happened with select domains. These domains are getting in DNS what we call SERVFAIL lookups - which means that whatever authoritative DNS server serves records for that domain could not be reached or is misconfigured.

This isnt necessarily DNS issues in CircleCI or Travis but issues with the specific domains being tested's nameservers.

We need to adjust our tests to catch the SERVFAIL cases and such. Let me poke around some tests myself and see if we can adjust the tests to catch these SERVFAILS and just pass over those tests without hard failing...

@ArtOfCode-
Copy link
Member

A DNS SERVFAIL should result in a warning in the test environment. It's not something critical that will stop Smokey running, but should be logged so that someone can come around and remove the failing domains.

@teward
Copy link
Member

teward commented Apr 9, 2020

A DNS SERVFAIL should result in a warning in the test environment. It's not something critical that will stop Smokey running, but should be logged so that someone can come around and remove the failing domains.

Agreed. However, what we've got in CI is that it's hard-failing because it's an uncaught exception and raising dns.errors.NoNameservers which isn't handled in existing tests.

@teward
Copy link
Member

teward commented Apr 9, 2020

Keep an eye on https://github.com/Charcoal-SE/SmokeDetector/tree/dns-tests

This adds an except handler to handle NoNameservers errors - this was previously an uncaught error in the tests, but now it'll catch and debug log just like the other errors we catch, but without resolving any details because there's no DNS Nameservers available for the request. This will, however, catch the error.

teward added a commit that referenced this issue Apr 9, 2020
…3806)

* Enable DNS tests, catch on dns.resolver.NoNameservers for SERVFAILs
* Fix log level
* FLAKE fixes.
@makyen
Copy link
Contributor

makyen commented Apr 11, 2020

Unfortunately, we're still getting spurious CI failures, at least on Travis CI: 1, 2. Both of those are:

dns.exception.Timeout: The DNS operation timed out after 30.00…

@makyen makyen reopened this Apr 11, 2020
@teward
Copy link
Member

teward commented Apr 12, 2020

@makyen yet another uncaught exception. That we can fix for tests too. I'll write up a bit for those failures shortly and get that pushed in.

@teward teward self-assigned this Apr 12, 2020
@teward teward closed this as completed in 185e514 Apr 12, 2020
@teward
Copy link
Member

teward commented Apr 12, 2020

I pushed an additional handler capture on DNS Timeouts - the DNS Timeout one now throws a warning into the logs but doesn't error out with an unhandled exception.

It's probable that Travis or Circle CIs might have janky DNS capabilities, so we'll have to just gracefully handle DNS lookup errors instead of letting them go uncaught which is what caused the spurious CI errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants