-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410
Comments
Oh, I forgot to mention, @bcmills managed to reproduce this repeated internal timeout in the test on linux-amd64 by tuning just the right timeouts and adding some strategic sleeps. |
I've added a comment with a bit more detail about this race in the test, in https://go-review.git.corp.google.com/c/go/+/559835/4/src/net/http/serve_test.go. |
Change https://go.dev/cl/559835 mentions this issue: |
|
I believe what's happening is:
We've got a mechanism that to avoid an infinite retry loop when HTTP/1 connections constantly fail (if we make a request using a new connection, and that request fails, then don't retry). We don't have one for HTTP/2, and this is complicated by the fact that The right long-term fix is to unify the connection pools and simplify connection management, but that's a big project. (Which I'm in the middle of, but it's slow going.) There's probably a lower impact fix we can do here, though. Perhaps we can apply the "don't retry if this connection has never had a successful round trip" logic to HTTP/2 connections as well? |
https://go.dev/issue/16582 is going to be relevant here as a case where a HTTP/2 |
@bcmills and I have been working on addressing some scheduling concerns with net on wasm (#65178, #64317), notably via https://go.dev/cl/557437 and https://go.dev/cl/558915.
Each of these CLs works fine in isolation, but when applied together,
TestServerWriteTimeout/h2
started timing out: https://ci.chromium.org/ui/p/golang/builders/ci/gotip-wasip1-wasm_wasmtime/b8757522083839877441/overview (possibly others as well, masked behind the timeout).This failure can be reproduced at tip by cherry-picking https://go.dev/cl/557437 and running
GOARCH=wasm GOOS=wasip1 go test -v -run TestServerWriteTimeout/h2
.I dug into this failure a bit and believe it is due to bad retries within net/http and/or bad expectations of the test itself. My analysis below. The instrumentation I added can be found in https://go.dev/cl/559797.
First of all, without these CLs, this test passes on both linux-amd64 and on wasm, but the latter is more exciting:
linux-amd64:
wasipi1-wasm:
The linux-amd64 version simply passed, but the wasm version encountered 2 timeouts before passing. This test contains a top-level timeout that increases on retry. The initial 5ms timeout was apparently insufficient and it had to increase twice before the test succeeded.
If the fact that the wasm version of this test consistently has a few timeouts here before passing is fundamentally problematic, you can stop reading here and the focus should be to understand and fix that.
Assuming otherwise, here's what happens with both CLs:
We're getting seemingly similar timeouts, but now they are occurring over and over again. Based on logging from my instrumentation CL https://go.dev/cl/559797, what is happening here is:
In the passing case, the timeout error ends up coming out of
Transport.getConn
toTransport.roundTrip
here. There is no retry logic here, so the error ultimately bubbles all the way up to the test, which increases the timeout and retries.In the failing case, the timeout errors ends up coming out of the
pconn
RoundTrip call just a few lines later. This error has a conditional internal retry mechanism (L615-643), which decides that this error warrants a retry and internally loops and retries withinroundTrip
. Because it never went back to the test, the timeout is the same, and the same problem occurs over and over again.It is unclear to me what the "correct" behavior is. Perhaps timeouts should not be retryable? Or perhaps it is unreasonable for the test to expect a request to complete with the same initial timeout? Either way, it seems strange to me that the same timeout error from
getConn
vsRoundTrip
has different behavior, given that it seems to be just luck/timing which call gets the error.cc @neild @bcmills
The text was updated successfully, but these errors were encountered: