Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

Closed
bcmills opened this issue Jan 19, 2024 · 33 comments
Closed
Assignees
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 19, 2024

#!watchflakes
post <- goarch == "wasm" && pkg == "golang.org/x/net/http2" && test == "TestServer_Push_RejectAfterGoAway" && `timeout`

Go version

go version devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/usr/local/google/home/bcmills/.cache/go-build'
GOENV='/usr/local/google/home/bcmills/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/usr/local/google/home/bcmills/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/usr/local/google/home/bcmills'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/google/home/bcmills/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/google/home/bcmills/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000'
GCCGO='/usr/bin/gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/usr/local/google/home/bcmills/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1312466808=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Add the following test to the time package:

func TestAfterFuncStarvation(t *testing.T) {
	var (
		wg sync.WaitGroup
		stop atomic.Bool
		c = make(chan bool, 1)
	)

	wg.Add(2)
	go func() {
		for !stop.Load() {
			c <- true
		}
		close(c)
		wg.Done()
	}()
	go func() {
		for range c {}
		wg.Done()
	}()

	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	wg.Wait()
}

Then run it using:

GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000

What did you see happen?

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000
panic: test timed out after 10m0s
running tests:
        TestAfterFuncStarvation (9m56s)

goroutine 13835 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x178b040, {0x6a007, 0x13}, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x178b040)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x178b040, 0x1499c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a2a0, {0x37c140, 0x7f, 0x7f}, {0xc162d83908280698, 0x8bb3084da4, 0x37faa0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144e140)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 13833 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 13831 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0x1790198)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x1790190)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x178b1e0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x178b1e0, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 13832 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 13834 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    600.293s
FAIL

What did you expect to see?

The test completes 1000 iterations on wasip1/wasm in about 1.2 seconds on my machine, so 10000 iterations should only take ~12 seconds:

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=1000
ok      time    1.228s

Running native code on linux/amd64 is much faster still, but that's to be expected:

go test time -run=TestAfterFuncStarvation -count=10000
ok      time    0.587s

(attn @golang/wasm)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 19, 2024
@bcmills bcmills added the arch-wasm WebAssembly issues label Jan 19, 2024
@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

This also affects js/wasm: 100 iterations take <600ms, so 1000 iterations should take <6s (but times out after a full minute):

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=100
ok      time    0.519s

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=1000 -timeout=1m
panic: test timed out after 1m0s
running tests:
        TestAfterFuncStarvation (1m0s)

goroutine 2980 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x151fa00, {0x6e19f, 0x17}, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x151fa00)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x151fa00, 0x1493c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a360, {0x37fc60, 0x7f, 0x7f}, {0xc162d954b7abb580, 0xdf9e00901, 0x38b6e0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144c3c0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 6 [waiting]:
runtime.gopark(0x0, 0x0, 0x0, 0x0, 0x1)
        /usr/local/google/home/bcmills/go/src/runtime/proc.go:402 +0x25
runtime.handleEvent()
        /usr/local/google/home/bcmills/go/src/runtime/lock_js.go:296 +0x25
runtime.goexit({})
        /usr/local/google/home/bcmills/go/src/runtime/asm_wasm.s:434 +0x1

goroutine 2976 [semacquire]:
sync.runtime_Semacquire(0x140c188)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x140c180)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x151fba0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x151fba0, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 2977 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 2978 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 2979 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    60.358s
FAIL

@prattmic
Copy link
Member

Does this also affect linux-amd64 if GOMAXPROCS=1?

@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

No, it does not: if I update the test to explicitly call runtime.GOMAXPROCS(1), it continues to pass after even 100000 runs.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557037 mentions this issue: net: work around runtime scheduler starvation on js and wasip1

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557038 mentions this issue: time: add a regression test for time.AfterFunc goroutine starvation

@prattmic
Copy link
Member

prattmic commented Jan 19, 2024

At a theoretical level, this doesn't surprise me, and I would expect it to impact linux-amd64 GOMAXPROCS=1.

This test always has at least one runnable goroutine (one of the channel send/recv goroutines). In fact, because the channel is buffered in theory both goroutines may be runnable all of the time.

When the timer fires, a third goroutine will become runnable [1].

We have one thread on which to run three goroutines, so there must be some scheduling delay. Go does not have a real-time scheduler, or even a best-effort deadline-based scheduler, so it can theoretically have unbounded scheduling delay.

Readying a goroutine typically puts it on the head of the run queue to get it to run soon. But since that applies to almost every ready, the timer firing could put goroutine 3 on the head of the run queue, and then goroutine 1 or 2 could replace the head before the scheduler runs again.

If you dump the /sched/latencies:seconds histogram from runtime/metrics, I'd expect to see a long tail of long scheduling delays.

There are certainly many improvements that could go into the scheduler to try to make this situation better (#51071 comes to mind).

The more interesting case would be if this is actually completely stuck. i.e., the timer expiration or ready goroutine was somehow lost altogether. Figuring that out by running a test is a bit of a halting problem situation, but looking at runtime/metrics is probably a good start. Do we have many cases of long scheduling delay (but not enough to cause a timeout)? Or do things seem totally fine until one case gets stuck forever?

[1] We always run timers before looking for other work in the scheduler, so the timer should always fire. But an AfterFunc timer just starts a goroutine to run the func, which is the third goroutine.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

I agree that the runtime is theoretically allowed to produce this behavior according to the language spec.

However, the addition of non-cooperative preemption (#24543) set up a strong expectation that every runnable goroutine will eventually be scheduled, and resolved a large number of previous scheduler fairness issues (linked from that proposal).

We also explicitly made mutex locking starvation-free in #13086. To me, that suggests that this sort of goroutine starvation is not intended behavior.

This sort of starvation shows up in realistic programs that use time.AfterFunc to interrupt two or more running goroutines, as might be the case in a benchmark or stress-test. (See, for example, #65177.) It is sometimes possible to work around that starvation by making explicit calls to runtime.Gosched (as in https://go.dev/cl/557037), but arguably that's exactly the sort of explicit scheduler management that #24543 was intended to address.

@bcmills bcmills changed the title runtime: scheduler sometimes starves a runnable goroutine on wasip1/wasm runtime: scheduler sometimes starves a runnable goroutine on wasm platforms Jan 19, 2024
gopherbot pushed a commit that referenced this issue Jan 19, 2024
Fixes #65177.
Updates #65178.
Updates #64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
@prattmic
Copy link
Member

I think I poorly worded my previous message and came across too strongly. I don't think this is good behavior, I meant that there are two broad classes this problem may be in:

  1. There is nothing fundamentally wrong, but the existing design is producing poor behavior.
  2. Something is fundamentally broken, and this wakeup is lost/gone and no amount of waiting would resolve it.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 20, 2024

Ah, yeah. I think the evidence so far supports (1) — the goroutine dumps from the timed-out tests list the AfterFunc goroutine as runnable, it just isn't actually being run.

(The fact that it is listed as runnable, and the fact that adding an explicit runtime.Gosched in https://go.dev/cl/557037 seems to fix the starvation there, suggest that the goroutine is successfully being marked schedulable, it just isn't getting scheduled.)

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I thought this could be working on amd64 because the goroutines are being non-cooperatively preempted by GC which could randomly restart execution. But trying out with GOGC=off GOMAXPROCS=1 doesn't let me reproduce.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

Even more minified example reproduce for me on wasip1:

 func TestAfterFuncStarvation(t *testing.T) {
 	var (
 		wg sync.WaitGroup
 		stop atomic.Bool
-		c = make(chan bool, 1)
+		c = make(chan bool)
 	)
 
 	wg.Add(2)
 	go func() {
 		for !stop.Load() {
 			c <- true
 		}
 		close(c)
 		wg.Done()
 	}()
 	go func() {
 		for range c {}
 		wg.Done()
 	}()
 
 	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
 	wg.Wait()
 }

It wouldn't reliably fail with -count=1000 anymore however, but with -count=10000 it would (tested 10/10 times), I don't know what this could mean however.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

Even more surprisingly this works on GOARCH=amd64 GOMAXPROCS=1 (even with GOGC=off):

func TestAfterFuncStarvation2(t *testing.T) {
	var stop atomic.Bool
	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	for !stop.Load() {}
}

But expectedly never works on wasip1.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I am able to repro on GOARCH=amd64 GOMAXPROCS=1 by adding && false this if clause:

if GOARCH != "wasm" { // no threads on wasm yet, so no sysmon

-count=10000 works in 0.173s but -count=100000 timeout with equivalent stack traces.

This explains why wasm is so special here, other arches have a watchdog.

gopherbot pushed a commit that referenced this issue Jan 22, 2024
The test is skipped on wasm platforms for now, because it
successfully detects a starvation bug on those platforms.

For #65178.

Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/557038
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
@prattmic
Copy link
Member

Thanks for digging @Jorropo! sysmon preempts long-running goroutines, so without that preemption, the tight stop.Load() loop won't ever yield. That said, even with sysmon, the right combinations of goroutines should still theoretically be able to trigger long wait times.

Still, the lack of sysmon on wasm makes me wonder if we should be more aggressive about self-preemption there.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I had the same question, I don't see how to solve the tight loop without the compiler aggressively injecting runtime.Gosched() in loops (maybe with a local counter to only yield after X iterations ? or based on monotonic time if this is cheap ?).

I find it interesting that the original example repro on amd64 after I disable sysmon.
I havn't looked at the scheduler code in a while, but if I remember there is a special spot in the worklist for a hot goroutine, could it be that the two goroutines keep exchanging themselves from the hot slot ?
I guess after every 4096 schedule (or whatever) should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

@aclements
Copy link
Member

I guess after every 4096 schedule (or whatever) should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

On non-wasm platforms, we already do this using the "inheritTime" result from runqget. What's supposed to be happening is that goroutines can switch off in runnext (the "hot slot") but they share a single time slice, and will get preempted for another goroutine if that time slice runs out. But sysmon is responsible for time slice preemption. You're right that without that, things can be arbitrarily unfair.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557615 mentions this issue: http2: avoid goroutine starvation in TestServer_Push_RejectAfterGoAway

gopherbot pushed a commit to golang/net that referenced this issue Jan 22, 2024
CL 557037 added a runtime.Gosched to prevent goroutine starvation
in the wasm fake-net stack. Unfortunately, that Gosched causes the
scheduler to enter a very similar starvation loop in this test.

Add another runtime.Gosched to break this new loop.

For golang/go#65178.

Change-Id: I24b3f50dd728800462f71f27290b0d8f99d5ae5b
Cq-Include-Trybots: luci.golang.try:x_net-gotip-wasip1-wasm_wasmtime,x_net-gotip-wasip1-wasm_wazero,x_net-gotip-js-wasm
Reviewed-on: https://go-review.googlesource.com/c/net/+/557615
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557437 mentions this issue: runtime: disable use of runnext on wasm

@prattmic
Copy link
Member

With https://go.dev/cl/557437, the original repro no longer hangs. This does not do anything to impact goroutines that simply never yield at all, like #65178 (comment).

@bcmills
Copy link
Contributor Author

bcmills commented Jan 22, 2024

That sounds great! I think it's reasonable to only fix goroutines that actually yield for now, since that sort of live-lock is relatively much easier to diagnose. (For the single-goroutine case, the live-locked goroutine will actually be running in a goroutine dump, as opposed to “running some of the time”.)

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 23, 2024
@cherrymui cherrymui added this to the Backlog milestone Jan 23, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559835 mentions this issue: net/http: set a client timeout in TestServerWriteTimeout

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559798 mentions this issue: runtime: disable use of runnext on wasm

ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Fixes golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
The test is skipped on wasm platforms for now, because it
successfully detects a starvation bug on those platforms.

For golang#65178.

Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/557038
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
When readying a goroutine, the scheduler typically places the readied
goroutine in pp.runnext, which will typically be the next goroutine to
run in the schedule.

In order to prevent a set of ping-pong goroutines from simply switching
back and forth via runnext and starving the rest of the run queue, a
goroutine scheduled via runnext shares a time slice (pp.schedtick) with
the previous goroutine.

sysmon detects "long-running goroutines", which really means Ps using
the same pp.schedtick for too long, and preempts them to allow the rest
of the run queue to run. Thus this avoids starvation via runnext.

However, wasm has no threads, and thus no sysmon. Without sysmon to
preempt, the possibility for starvation returns. Avoid this by disabling
runnext entirely on wasm. This means that readied goroutines always go
on the end of the run queue and thus cannot starve via runnext.

Note that this CL doesn't do anything about single long-running
goroutines. Without sysmon to preempt them, a single goroutine that
fails to yield will starve the run queue indefinitely.

For golang#65178.

Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae
Reviewed-on: https://go-review.googlesource.com/c/go/+/557437
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Mauri de Souza Meneguzzo <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Bryan Mills <[email protected]>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
This reverts CL 557437.

Reason for revert: Appears to have broken wasip1 builders.

For golang#65178.

Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/559237
LUCI-TryBot-Result: Go LUCI <[email protected]>
Auto-Submit: Michael Pratt <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@bcmills
Copy link
Contributor Author

bcmills commented Feb 22, 2024

@gopherbot, please backport to Go 1.22. The additional tests enabled in https://go.dev/cl/526117 made it more likely for this bug to be exposed in test runs, which may cause flakiness in pre-release (and backport CL) testing.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #65883 (for 1.22).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot gopherbot added the Testing An issue that has been verified to require only test changes, not just a test failure. label Feb 22, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/566175 mentions this issue: [release-branch.go1.22] net: work around runtime scheduler starvation on js and wasip1

@bcmills bcmills removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Feb 22, 2024
gopherbot pushed a commit that referenced this issue Feb 26, 2024
This breaks an unbounded client-side retry loop if the server's
timeout happens to fire during its final read of the TLS handshake.

The retry loop was observed on wasm platforms at CL 557437.
I was also able to reproduce chains of dozens of retries on my
linux/amd64 workstation by adjusting some timeouts and adding a couple
of sleeps, as in this patch:
https://gist.github.com/bcmills/d0a0a57e5f64eebc24e8211d8ea502b3
However, on linux/amd64 on my workstation the test always eventually
breaks out of the retry loop due to timing jitter.

I couldn't find a retry-specific hook in the http.Client,
http.Transport, or tls.Config structs, so I have instead abused the
Transport.Proxy hook for this purpose. Separately, we may want to
consider adding a retry-specific hook, or changing the net/http
implementation to avoid transparently retrying in this case.

Fixes #65410.
Updates #65178.

Change-Id: I0e43c039615fe815f0a4ba99a8813c48b1fdc7e6
Reviewed-on: https://go-review.googlesource.com/c/go/+/559835
Reviewed-by: Damien Neil <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
gopherbot pushed a commit that referenced this issue Feb 27, 2024
When readying a goroutine, the scheduler typically places the readied
goroutine in pp.runnext, which will typically be the next goroutine to
run in the schedule.

In order to prevent a set of ping-pong goroutines from simply switching
back and forth via runnext and starving the rest of the run queue, a
goroutine scheduled via runnext shares a time slice (pp.schedtick) with
the previous goroutine.

sysmon detects "long-running goroutines", which really means Ps using
the same pp.schedtick for too long, and preempts them to allow the rest
of the run queue to run. Thus this avoids starvation via runnext.

However, wasm has no threads, and thus no sysmon. Without sysmon to
preempt, the possibility for starvation returns. Avoid this by disabling
runnext entirely on wasm. This means that readied goroutines always go
on the end of the run queue and thus cannot starve via runnext.

Note that this CL doesn't do anything about single long-running
goroutines. Without sysmon to preempt them, a single goroutine that
fails to yield will starve the run queue indefinitely.

For #65178.

Change-Id: I10859d088776125a2af8c9cd862b6e071da628b5
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/559798
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
gopherbot pushed a commit that referenced this issue Feb 28, 2024
… on js and wasip1

For #65883.
Updates #65177.
Updates #65178.
Updates #64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <[email protected]>
@bcmills
Copy link
Contributor Author

bcmills commented Mar 4, 2024

@prattmic, with https://go.dev/cl/559798 merged, should we close this issue? (Or is there more to do?)

@bcmills
Copy link
Contributor Author

bcmills commented Mar 4, 2024

I guess this doesn't address the case where a single long-running goroutine never yields. Not sure whether we want to consider that in-scope for this issue, or treat it as a separate feature request to support goroutine preemption on js/wasm. 🤔

bradfitz pushed a commit to tailscale/go that referenced this issue Mar 5, 2024
… on js and wasip1

For golang#65883.
Updates golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <[email protected]>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Mar 6, 2024
… on js and wasip1

For golang#65883.
Updates golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <[email protected]>
@prattmic
Copy link
Member

prattmic commented Mar 7, 2024

I think that should be a separate issue, or at least we should rename this to be more clearly about async preemption.

@labulakalia
Copy link

labulakalia commented Dec 30, 2024

this issue exist on branch master
this is code example

Build Wasm

$ gotip version
go version devel go1.24-2b794ed Fri Dec 27 17:23:24 2024 -0800 linux/amd64

example code

//net.go
package main

import (
	"fmt"
	"log"
	"log/slog"
	"net"
	"os"
	"time"
)

func main() {
	slog.SetLogLoggerLevel(slog.LevelDebug)
	f := os.NewFile(4, "")
	l, err := net.FileListener(f)
	defer f.Close()
	if err != nil {
		log.Fatalln("111", err)
	}
	fmt.Println("start listen")
	defer l.Close()
	for {
		fmt.Println("start accept")

		conn, err := l.Accept()
		if err != nil {
			log.Fatalln(err)
		}
		fmt.Println("start accept", conn.RemoteAddr())
		go func() {
			for {
				fmt.Println("start read")
				data := make([]byte, 1024)
				_, err = conn.Read(data)
				if err != nil {
					log.Println(err)
				}
			}
		}()
		conn.Write([]byte("test"))
		// time.Sleep(time.Second) // if not this line, `fmt.Println("start read")` never run
	}
}

build wasm

GOOS=wasip1 GOARCH=wasm gotip build -o net.wasm net.go

Run wasm use wazero

run wasm on wazero runtime

//wazero.go
package main

import (
	"context"
	"crypto/rand"
	_ "embed"
	"log"
	"log/slog"
	"net/http"
	"os"

	"github.com/tetratelabs/wazero"
	experimentalsock "github.com/tetratelabs/wazero/experimental/sock"
	"github.com/tetratelabs/wazero/imports/wasi_snapshot_preview1"
)


//go:embed net.wasm
var netWasm []byte
func main() {
	slog.SetLogLoggerLevel(slog.LevelDebug)

	sockCfg := experimentalsock.NewConfig().WithTCPListener("127.0.0.1", 19971)
	ctx := experimentalsock.WithConfig(context.Background(), sockCfg)

	r := wazero.NewRuntime(ctx)
	wasi_snapshot_preview1.MustInstantiate(ctx, r)
	conf := wazero.NewModuleConfig().
		WithFS(os.DirFS(".")).
		WithStdout(os.Stdout).
		WithStderr(os.Stderr).
		WithStdin(os.Stdin).
		WithRandSource(rand.Reader).
		WithSysNanosleep().
		WithSysNanotime().
		WithSysWalltime().WithArgs("wasi", "sock")
	_, err := r.InstantiateWithConfig(ctx, netWasm, conf)
	if err != nil {
		log.Panicln(err)
	}
}

Send msg

nc 127.0.0.1 19971
test1
1
1
1
1
$ gotip run example.go 
start listen
start accept
start accept :0
start accept

note: this line code fmt.Println("start read") not run,if add time.Sleep will prinf start read
@prattmic

@labulakalia
Copy link

https://go.dev/blog/wasi
Wasm is a single threaded architecture with no parallelism. The scheduler can still schedule goroutines to run concurrently, and standard in/out/error is non-blocking, so a goroutine can execute while another reads or writes, but any host function calls (such as requesting random data using the example above) will cause all goroutines to block until the host function call has returned.
it is right

@prattmic
Copy link
Member

prattmic commented Jan 6, 2025

@labulakalia If I understand correctly, the conn.Write blocks until there is a reader, but because we are blocked in write, the reader will never run. You seem to have found some documentation about this with #65178 (comment).

I don't think there is much we can do about this (FD 4 needs to be non-blocking, maybe we can make that easier somehow).

Regardless, this isn't the same issue that was being discussed here (though similar), so please open a new issue if you think we could do better here.

Per #65178 (comment), this issue should be closed in favor of #71134.

@prattmic prattmic closed this as completed Jan 6, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Done
Development

No branches or pull requests

7 participants