-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Discussion about the use of multiprocessing #158
Comments
I really like the idea of making
This is true for the "running the dataflow", but is not true for our test suite. Some of it uses Perhaps something like In any case, this is something to think about in regards to this change. Although it is very finicky, I still aspire for us to have "tested documentation" and if there's a big divergence between how we make multi-process testing work and how you run a demo, it might be more work to bridge the gap back to allow a documentation example that starts up a cluster to be programmatically tested. Currently, we are just using "capture and compare stdout" so that might not actually be that hard to bridge. But the command to spawn the cluster would need to work in the doctests and markdown tests environments. It might be that this is too much work and not worth it, but again something to think about here. |
Right, what I meant is that if we don't need it in the execution, we just need to find a different way to test the behavior.
You are right about doctests and markdown tests though, that could be trickier depending on how we implement |
Another issue we had with pickling for tests https://bytewax.slack.com/archives/C028Q25AK3K/p1671138645483599
Turns out it's related to importing Apparently So we might continue to run into strange issues. |
Is your feature request related to a problem? Please describe.
To spawn a cluster we use the
spawn_cluster
function, which is defined in python and does this:It spawns
proc_count
processes using multiprocess.This works fine, but it requires all the arguments of
apply_sync
to be pickable, which is limiting us in some cases (for example we can't use decorators that return generators), and it requires some boilerplate on the Rust side (see the "Egregious hack").The point is that we don't use any of the multiprocessing utilities, except in a couple of tests, so if we can find a way to avoid using multiprocessing we could remove a lot of the boilerplate, remove a dependency, and avoid the pitfalls of requiring everything to be pickable.
Describe the solution you'd like
Since we don't need to share state between the processes (not in python at least), we could make
spawn_cluster
spawn subprocesses in the system, so that the python file is evaluated anew for each process.Something like this:
Describe alternatives you've considered
I don't particularly like this solution, but it's one way to solve the problem.
I opened this issue to discuss possible alternatives.
Another thing we could do is make the
spawn_cluster
an "external" script, so not called from python directly.We could remove the
spawn_cluster
function from the python api, and let users use justcluster_main
.cluster_main
could read the other parameters (proc_id
,worker_count_per_proc
andaddresses
) from args, or another env var, and thespawn_process
script (which could also be written in rust, or a shell script, or another python script, it doesn't really matter) would set the variables similarly to what I proposed above:Which then you'd run with:
But I still haven't properly explored this solution, it might hide some other problems
Additional context
Let me know what you think, and if you have other ideas I can spend some time building a proof of concept and see if it could work
The text was updated successfully, but these errors were encountered: