Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need an ak.packed function #746

Closed
jrueb opened this issue Feb 17, 2021 · 1 comment · Fixed by #912
Closed

We need an ak.packed function #746

jrueb opened this issue Feb 17, 2021 · 1 comment · Fixed by #912
Labels
feature New feature or request

Comments

@jrueb
Copy link
Contributor

jrueb commented Feb 17, 2021

Following the guide here https://awkward-array.org/how-to-convert-buffers.html it instructs to use ak.to_buffers in order to write HDF5 files. However, the output files can become unnecessary large very easily.
Please consider the following example

import numpy as np
import awkward as ak
arr = ak.Array({"x": np.random.rand(1000)})
mask = [0, 2]
arr = arr[mask]
form, length, container = ak.to_buffers(arr)

container, which will get saved to the file, contains an array of 1000 numbers, even though we only want 2 of them. It doesn't have to be 1000, in fact this number can be much larger.
What I think would be very nice here is an option to have the container be restricted to only the data that is necessary. This could even be an additional function, condensing an awkward array so that it is compact in memory.
I know that flattening can have a similar effect, but it doesn't work on arrays with records. Surprisingly doing something like ak.from_array(ak.to_arrow(arr)) has the desired effect on the array. However, this seems to be a very crude workaround.

@jrueb jrueb added the feature New feature or request label Feb 17, 2021
@jpivarski
Copy link
Member

This is #701, and I agree that we need a function to trim the unreachable elements from an array. When manipulating these in memory, we want to keep them to avoid duplicating data, but writing them to disk is a copy anyway and at that point, it's time to trim them down. Such an operation should be automatically applied in pickling, for instance, and it should be discussed in the tutorial you're referencing.

This was also an issue in Awkward 0: scikit-hep/awkward-0.x#246. As indicated in #701, the ak.packed function would be a good addition. You found the same `ak.from_arrow(ak.to_arrow(arr))`` workaround we discussed there, which does some wasteful computations if there are option-types around (not so much otherwise, but it's still crude).

@jpivarski jpivarski changed the title Only output minimal container in ak.to_buffers We need an ak.packed function Feb 17, 2021
@jpivarski jpivarski mentioned this issue Jun 11, 2021
13 tasks
@jpivarski jpivarski linked a pull request Jun 14, 2021 that will close this issue
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants