When a build process is stared, there are several threads which communicate via Gradle's messaging infrastructure (MessagingServices
):
- The worker thread(s) in the build process that start the worker, send some messages, wait for something to happen and wait for the worker process to finish.
- A couple of threads in the build process that send and receive on the connection to the worker.
- The main thread in the worker process, that runs the worker action that sets up the workers, waits for something to happen and exits the process.
- A couple of threads in the worker process that send and receive on the connection to the build process.
- The worker thread(s) in the worker process that do the work in response to messages, and maybe sends some messages back.
Currently, any or all of these can fail, and at the moment we don't deal with this particularly well. We want to add some robustness at this level first, before adding robustness down deeper.
This is the most likely point of failure.
- When a worker fails, it should no longer be used.
- when all the workers fail, a 'workers failed' message is sent back to the build process, tear down the connection, unblock the main thread and exit the worker process. The worker threads in the build process would be notified of this failure and can clean up.
Move this responsibility to the infrastructure so that the worker action is not involved in cleanup. This way the robustness can live in a single place rather than being reimplemented in each of the worker actions.
Dealing with failures on the worker side of the connection... On failure to receive or send on the worker process side, we could just tear down the connection and worker process, perhaps waiting for the worker threads to finish up first. This would look similar to the previous failure to the build process. The worker threads would be notified that the worker is gone and can clean up. Same for when the worker process crashes (eg kill -9, segfault, Runtime.halt(), System.exit()). This would also deal with the build process going missing.
forcefully kill the worker process after some timeout on shutdown.
Notify the worker threads know and they can initiate a shutdown.