Redispatch local tasks before thread parking in runBlocking #4312

0x746143 · 2024-12-24T01:23:38Z

The original issue is in Ktor KTOR-7986. A coroutine deadlock occurs when runBlocking is used inside coroutines (unsurprisingly). While it is widely discouraged to call runBlocking from a suspend function, the deadlock in this case is not a typical thread deadlock (see technical details below).

Unlike most other deadlocks, there is a potential solution to prevent this type of deadlock. I propose redispatching coroutine tasks from the local queue of the current worker thread before it is parked. I think having an imperfect but functional solution (redispatching local tasks) is better than one that strictly adheres to best practices but fails in real practice (do not use runBlocking inside coroutines).

As described in the original issue the function that uses runBlocking is not a suspend function in the Ktor project. A coroutine context only appears in the test environment. While this is not good, as noted in the Ktor issue, such a scenario can arise in large projects with complex interactions, where a non-suspend function containing runBlocking is called inside a coroutine context in a certain non-standard case.

Technical details from the Ktor issue:

The following is a simplified version of the Ktor code (used in ActorSelectorManager and SocketImpl):

import java.util.concurrent.CountDownLatch
import kotlinx.coroutines.*
import kotlin.coroutines.*
import kotlin.coroutines.intrinsics.*
import kotlin.test.Test

class DeadlockTest {
    
    @Test(timeout = 1000)
    fun testDeadlock() {
        runBlocking(Dispatchers.IO + CoroutineName("testApplication")) {
            val latch = CountDownLatch(1)
            lateinit var contSelector: Continuation<Unit>
            lateinit var contSocket: Continuation<Unit>
            CoroutineScope(Dispatchers.IO + CoroutineName("selector")).launch {
                suspendCoroutineUninterceptedOrReturn {
                    contSelector = it
                    latch.countDown()
                    COROUTINE_SUSPENDED
                }
                yield()
                contSocket.resume(Unit)
            }
            latch.await()
            runBlocking(CoroutineName("socket")) {
                suspendCancellableCoroutine {
                    contSocket = it
                    contSelector.resume(Unit)
                }
            }
        }
    }
}

Since an unintercepted Continuation is used, the selector coroutine resumes synchronously in the same stack frame and hence the same thread, behaving similarly to the Unconfined dispatcher. The yield() function dispatches a new coroutine task to the local queue of the current worker thread because it belongs to the same dispatcher specified for the selector coroutine (Dispatchers.IO). Next, the selector coroutine is suspended, and the thread is released. However, this thread is subsequently blocked by runBlocking(CoroutineName("socket")). As a result, the previously created coroutine task gets stuck in the local queue of the blocked thread.

dkhalanskyjb · 2025-01-07T11:56:30Z

You are right, this is indeed different from our usual deadlock. Thank you for the well-presented reproducer! Could you file an issue with this so that we can discuss the best way to solve this (if not, I'll do it)? Typically, it's better to start with a filed issue instead of jumping directly to an implementation.

The solution that comes to mind is that runBlocking should awaken other threads in the scheduler so that they could steal the job from the local queue. See https://pl.kotl.in/QBE8w3-2g : the existence of another I/O thread resolves the deadlock, but this should be ensured automatically somehow.

I think having an imperfect but functional solution (redispatching local tasks) is better than one that strictly adheres to best practices but fails in real practice (do not use runBlocking inside coroutines).

The compromise you are proposing is not as clear-cut as it may seem. This solution may lead to liveness issues:

Dispatchers.IO receives a runBlocking task.
Dispatchers.IO receives a task to download a video from the Internet.
The runBlocking block launches an await { } coroutine on another dispatcher and suspends waiting for its result.
runBlocking notices that its worker has a task to download a video and starts doing that.
The coroutine that should awaken the runBlocking block does exactly that.
But runBlocking is still busy downloading the videos!

runBlocking could have continued actually doing its job, while other Dispatchers.IO threads would steal the video-downloading task, but instead, runBlocking takes 15 seconds to complete while it could have finished in a couple of milliseconds.

This is extremely non-obvious and difficult to debug. Deadlocks are typically easier to detect, and while they are irritating, they can be diagnosed more easily than this spooky action at a distance, due to using the threads that are actually exhibiting the problem.

I'm proposing to close this PR until it becomes obvious that a better solution is impossible for some reason.

0x746143 · 2025-01-08T03:35:45Z

I checked again, and it seems that the deadlock was fixed by PR 4255. It appears I initially tested in the master branch, which is why I encountered the deadlock. Later, I committed to the develop branch and forgot to test again :(

By the way, the liveness issue described in the example shouldn't occur because in the case of the deadlock calling blockedThread.scheduler.dispatch() transfers the task from the local queue of the current thread/worker to another thread/worker, and it no longer runs in the runBlocking thread. After PR 4255, this is now handled during yield() execution.

dkhalanskyjb · 2025-01-08T09:17:06Z

Strange. That commit was already in master when you opened the PR. But ok, good to know, thanks!

0x746143 · 2025-01-08T10:23:30Z

I forked the repository on 2024-12-18 and tested version 1.9.0, where the last commit in master was from 2024-09-27. According to the commit history, develop was merged into master on 2024-12-19. When I committed to develop on 2024-12-24, I ran the tests and verified that there were no errors. However, I did not check if the deadlock was still reproducible without my changes. I apologize for any inconvenience.

Redispatch local tasks before thread parking in runBlocking

b360b1a

0x746143 changed the base branch from master to develop December 24, 2024 01:27

0x746143 closed this Jan 8, 2025

0x746143 deleted the develop branch January 8, 2025 10:26

0x746143 restored the develop branch January 8, 2025 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redispatch local tasks before thread parking in runBlocking #4312

Redispatch local tasks before thread parking in runBlocking #4312

0x746143 commented Dec 24, 2024

dkhalanskyjb commented Jan 7, 2025

0x746143 commented Jan 8, 2025

dkhalanskyjb commented Jan 8, 2025

0x746143 commented Jan 8, 2025

Redispatch local tasks before thread parking in runBlocking #4312

Redispatch local tasks before thread parking in runBlocking #4312

Conversation

0x746143 commented Dec 24, 2024

Technical details from the Ktor issue:

dkhalanskyjb commented Jan 7, 2025

0x746143 commented Jan 8, 2025

dkhalanskyjb commented Jan 8, 2025

0x746143 commented Jan 8, 2025