-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The meta
passed to map_partitions
is not used during graph optimization
#571
Comments
@pfackeldey , please point to your declarative mapper function, perhaps with an example for this specific case. We need docs around this to be rock solid! @chuyuanliu : the only way to automatically know what columns are used within a given function is to run it. Furthermore, if the output columns are not derived from the very same buffers coming in, we can no longer trace what is required. In practice, that would mean we cannot optimise away loading any column passed to an opaque function with map_partitions. A common pattern in the past has been to wrap a function and check for typetracers (as you hint), but @pfackeldey 's contribution makes this much nicer. |
@chuyuanliu if you want to touch all inputs to the function (i.e. mark all data buffers as required), then there is |
I think you're referring to #551. This was an attempt to make In general, I'd rather have #565 though instead of this declarative map (I'm considering closing #551). This new |
optimize=False turns off all algorithms, you can't turn off only dask-awkward's contribution this way. |
Thanks for all the comments. For the old code migration, #551 sounds like exactly what I need, but there is a case that is probably not covered by the current static def test_func(array: ak.Array, cond: str):
match cond:
case "cond1":
# some untraceable operations with "col1" and "col2"
...
case "cond2":
# some untraceable operations with "col1" and "col3"
... This may require the def mock_func(array: ak.Array, cond: str):
# array here can be a typetracer
match cond:
case "cond1":
return {"array": ["col1", "col2"]} # or a typetracer derived from array
...
case "cond2":
return {"array": ["col1", "col3"]} # or a typetracer derived from array
... For the new code written from scratch to work with dask awkward, wrap the "atomic" untraceable functions with |
Hi @chuyuanliu,
Yes, that is true. This mechanism allows you to run the typetracer to infer all needed columns, and then add by hand the missing ones for other if-branches. This may be not a perfect workflow, but this is the price we're paying for the 'sharp bits' of a tracing mechanism. Problematic if-branches are (afaik) the ones that depend on the current partition number (which a user won't ever have access to during execution of a single partition I think), if-branches that depend on global variables that are different at trace and execution time, or if-branches that depend on numeric values of columns. We're currently in the process of adding lazy (or My current understanding for the output mocking is that if you do the "manual" column projection/optimization, and then provide a |
When using
map_partitions
with a knownmeta
provided, the function will still be evaluated with typetracers ifoptimize_graph
is turned on. The example below will print "test_func called with typetracer" once.Is this behavior expected? Will it be possible to store a copy of meta somewhere and return it during the optimization? This is useful when some operations inside the function do not accept typetracers as argument but the structure of the final returned array is determinate.
The text was updated successfully, but these errors were encountered: