[draft] erts: kill spawned child processes on VM exit #9453

adamwight · 2025-02-17T23:09:38Z

This is a very rough proof-of-concept for discussion, which ensures all children spawned with open_port are terminated along with the BEAM.

Will be discussed in https://erlangforums.com/t/open-port-and-zombie-processes

CLAassistant · 2025-02-17T23:09:45Z

All committers have signed the CLA.

github-actions · 2025-02-17T23:10:26Z

CT Test Results

2 files 15 suites 13m 31s ⏱️
115 tests 102 ✅ 3 💤 10 ❌
135 runs 116 ✅ 3 💤 16 ❌

For more details on these failures, see this check.

Results for commit f7c336f.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

erts/emulator/sys/unix/erl_child_setup.c

garazdawi · 2025-02-20T08:17:00Z

Hello!

I think that we can move forward with this. There is no need to have an option to disable it for now (unless our existing tests shows that it is needed...), but there needs to be testcases to test that it works as expected on both Unix and Windows.

I do wonder however if we should send some other signal than KILL? Should we allow the child to be able to catch it and deal with it if they want to?

adamwight · 2025-02-20T08:44:46Z

I do wonder however if we should send some other signal than KILL? Should we allow the child to be able to catch it and deal with it if they want to?

Good point, TIL that sigkill is untrappable. Looking at erlexec for precedents, its default behavior is to send a sigterm to the direct child process, wait a configurable 5 seconds, and then send sigkill.

I experimented a bit locally to see if sh would react to a sigterm by stopping its own children, and it does not—so in order for spawn commands to benefit as well as spawn_executable, I would stick with the choice to send a signal to the entire child process group, but send TERM to be more polite. I like that this also offers the descendant processes a second and more straightforward workaround to prevent the termination if needed.

adamwight · 2025-02-23T13:55:47Z

Now includes a test for normal erl shutdown (halt() self), and I'll try to also write one for abnormal shutdown (receiving SIGKILL). There's a small amount of race condition remaining in the test which I'd like to clean up—perhaps by direct communication from the grandchild process (new test utility file_when_alive) back to the test executor, I'm open to suggestions about how to do that.

There are a few other flapping tests, I don't think this is related to my patch but can't say for sure...

Running the tests on Windows is not going well for me, and it seems there's no CI for that yet? In theory my patch and the test will also run on win32 but I'd like to see that happen.

garazdawi · 2025-02-24T08:34:39Z

perhaps by direct communication from the grandchild process (new test utility file_when_alive) back to the test executor, I'm open to suggestions about how to do that.

If the grandchild is an Erlang node, it could communicate via Erlang distribution? Otherwise a file seem reasonable.

Running the tests on Windows is not going well for me, and it seems there's no CI for that yet? In theory my patch and the test will also run on win32 but I'd like to see that happen.

No, there is no github CI for that yet. I have a branch that I work on from time to time to try to bring it in, but the tests are not stable enough yet. Maybe you can temporarily use it as a base for your changes and you should atleast be able to see if your tests pass or fail?

garazdawi · 2025-02-24T08:41:44Z

A couple of other things that popped into my mind:

What do we do when someone does port_close/1? To me it seems reasonable that the behaviour should be the same as if the Erlang VM terminated?

like that this also offers the descendant processes a second and more straightforward workaround to prevent the termination if needed.

I know that there are users that rely on being able to spawn daemon processes through os:cmd("command &"). This is tested in os_SUITE:background_command/1, but the test will not catch what happens when the emulator dies. I'm unsure what the user want to happen there, I can see them both wanting it to die and survive... though since the current behaviour is that it survives I think we need to keep that.

adamwight · 2025-02-26T16:56:28Z

Maybe you can temporarily use it as a base for your changes and you should atleast be able to see if your tests pass or fail?

Great! I'm working on that now, and learned that my proposed feature needs to be reimplemented separately for the win32 spawn driver.

What do we do when someone does port_close/1? To me it seems reasonable that the behaviour should be the same as if the Erlang VM terminated?

That makes sense, the same principle applies IMHO and it feels consistent to attempt a direct termination any time Erlang will lose its connection to the child. I'll add this.

os:cmd("command &").

I see... Interestingly, the "&" in that test is only relevant for allowing os:cmd to return immediately, the shell job control seems to be unimportant. In other words, the test is equivalent to calling open_port and not waiting for the process to finish, so this syntax is more a convenience than a special use case. But it would definitely indicate an intention to start a daemon with no direct link to Erlang, +1 that we should respect this usage!

As a tourist to the BEAM, all I can do is describe the options but I don't have instincts for which is the best way to go. We could preserve this "&" usage by only killing the immediate child process, which would be the shell. This still offers some benefits, since the application developer may be able to call open_port with spawn_executable, making their process the immediate child and causing it to be cleaned up without needing a wrapper script. It also simplifies reasoning about the "process group", killing exactly one child process is much more predictble.

michalmuskala · 2025-02-26T20:03:03Z

To some extent, I think there's a bigger problem here where we could have a whole new API for managing external processes - the current port API is neither powerful nor ergonomic.
This change to kill things proactively is definitely a good one, but if we're looking more into that, I think things could be improved dramatically.

As @garazdawi mentioned, today it's not even possible to easily kill the process once you spawn it, and port_close will just close the stdin.

adamwight · 2025-02-26T23:34:07Z

I found that shell "&" assigns the background job to a new process group, which IMHO means that killing children by process group is back on the table. For now however, my patch is rewritten to kill only the direct child process.

The latest branch also kills a port's child during port_close.

Splitting this responsibility between the main VM process and the forker is causing a memory leak (and a leaky abstraction), and I'm imagining this can be resolved by sending another protocol message to the forker to allow it to perform cleanup such as killing the process, then freeing memory used to track the child. Introducing this new message has some small overhead but I don't see any obvious, existing means for the forker to detect that the port was closed by beam.

we could have a whole new API for managing external processes

+1 that direct OS process management could be a nice addition to the core libraries, but the current iteration can be done without larger changes to the opaque port concept.

erts/emulator/sys/unix/erl_child_setup.c

garazdawi · 2025-02-27T11:36:58Z

Splitting this responsibility between the main VM process and the forker is causing a memory leak (and a leaky abstraction), and I'm imagining this can be resolved by sending another protocol message to the forker to allow it to perform cleanup such as killing the process, then freeing memory used to track the child. Introducing this new message has some small overhead but I don't see any obvious, existing means for the forker to detect that the port was closed by beam.

Yes, this seems like a good approach.

Demonstrate failing examples of how a spawned child process will continue after its port is closed or the VM dies.

Currently a no-op patch. This message will allow the forker to clean up internal resources and kill the child process in a later patch.

Previously, the forker start command would use the presence of port_id to indicate whether the caller had specified :exit_status to open_port. This no-op patch splits this out to an explicit second field `want_exit_status`, so that we can always track port_id. A later patch will use the os_pid->port_id mapping and needs all children tracked even when exit status was not requested.

If the forker's connection to the parent BEAM is broken or closed, react by killing all spawned children. When a spawned port is closed, kill the associated OS process. A concise demonstration of the problem being solved is to run the following command with and without the patch, then kill the BEAM. Without the patch, the "sleep" process will continue: erl -noshell -eval 'os:cmd("sleep 60")' TODO: * Needs a decision made between killing the process or process group. * Separate patch for win32

adamwight commented Feb 18, 2025

View reviewed changes

erts/emulator/sys/unix/erl_child_setup.c Outdated Show resolved Hide resolved

adamwight force-pushed the aw-orphans branch 5 times, most recently from 9f87bc1 to dba896f Compare February 20, 2025 07:45

garazdawi self-assigned this Feb 20, 2025

garazdawi added the team:VM Assigned to OTP team VM label Feb 20, 2025

adamwight force-pushed the aw-orphans branch 4 times, most recently from 8efd4f5 to 81abb88 Compare February 22, 2025 16:47

adamwight force-pushed the aw-orphans branch from 81abb88 to 80f301a Compare February 24, 2025 00:35

adamwight force-pushed the aw-orphans branch from 80f301a to a9b6e1c Compare February 26, 2025 23:15

adamwight commented Feb 26, 2025

View reviewed changes

erts/emulator/sys/unix/erl_child_setup.c Show resolved Hide resolved

adamwight force-pushed the aw-orphans branch from a9b6e1c to f7eee72 Compare February 27, 2025 14:37

adamwight added 3 commits February 27, 2025 16:44

Write tests for child process cleanup (failing)

cb578bb

Demonstrate failing examples of how a spawned child process will continue after its port is closed or the VM dies.

Send the forker a stop message on port_close

0c58019

Currently a no-op patch. This message will allow the forker to clean up internal resources and kill the child process in a later patch.

adamwight force-pushed the aw-orphans branch from f7eee72 to f7c336f Compare February 27, 2025 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] erts: kill spawned child processes on VM exit #9453

[draft] erts: kill spawned child processes on VM exit #9453

adamwight commented Feb 17, 2025

CLAassistant commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

garazdawi commented Feb 20, 2025

adamwight commented Feb 20, 2025

adamwight commented Feb 23, 2025

garazdawi commented Feb 24, 2025

garazdawi commented Feb 24, 2025

adamwight commented Feb 26, 2025

michalmuskala commented Feb 26, 2025

adamwight commented Feb 26, 2025

garazdawi commented Feb 27, 2025

[draft] erts: kill spawned child processes on VM exit #9453

Are you sure you want to change the base?

[draft] erts: kill spawned child processes on VM exit #9453

Conversation

adamwight commented Feb 17, 2025

CLAassistant commented Feb 17, 2025 • edited Loading

github-actions bot commented Feb 17, 2025 • edited Loading

CT Test Results

Artifacts

garazdawi commented Feb 20, 2025

adamwight commented Feb 20, 2025

adamwight commented Feb 23, 2025

garazdawi commented Feb 24, 2025

garazdawi commented Feb 24, 2025

adamwight commented Feb 26, 2025

michalmuskala commented Feb 26, 2025

adamwight commented Feb 26, 2025

garazdawi commented Feb 27, 2025

CLAassistant commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading