Skip to content
This repository has been archived by the owner on Mar 4, 2021. It is now read-only.

File too slow #41

Open
4 tasks
dneise opened this issue May 15, 2018 · 0 comments
Open
4 tasks

File too slow #41

dneise opened this issue May 15, 2018 · 0 comments

Comments

@dneise
Copy link
Member

dneise commented May 15, 2018

The protozfits.File can be used in two ways:

  • Slow but comfy
  • Faster but annoying

Slow but comfortable

with protozfits.File('some/path.fits.fz') as f:
    for event in f.Events:
        # now event.anything will be a "useful" Python thing, 
        # an np.array for example

Doing this will give you a very easy to use event.

Faster but annoying

Assuming you need faster iteration speed, then you can omit the conversion to "useful" Python things completely, by setting pure_protobuf=True:

with protozfits.File('some/path.fits.fz', pure_protobuf=True) as f:
    for event in f.Events:

Now event is a Python object, just like before, but it has no tab-completion, all its array-like members are not numpy-arrays but instances of AnyArray that need to be converted using for example protozfits.any_array_to_numpy.

This means, people who know for example they need only access to one very specific member of the event which also happens to be an integer, they can use pure_protobuf=True and iterate faster over all events of the file.



There must be a better way!

Offering these two possibilities might sound nice in the first moment, but it splits the users in two groups. Those who need speed and those who want to keep it simple. I believe a good software does not force users to choose between these two options, at least not at the very start of every project .. and this reader happens to be at the start ...

These are the features the "comfy" option has and the "fast" option is missing:

  • tab completion
  • correct enum representation (not mentioned above)
  • complete auto numpy conversion
  • shorter string representation (using the two above)

The speedwise problematic part in my understanding is the "auto numpy conversion". All arrays are always converted from AnyArray to numpy, even when the user never accesses them.

So the solution to the problem seems to be lazy evaluation.

At the moment (in the "comfy" version) the whole event is converted into a collections.named_tuple and given to the user. This conversion includes converting all AnyArrays into np.arrays.
Instead what could be done at import time is dynamically generate "useful" classes from the google protobuf descriptors. Then at iteration time, the "pure protobuf" object is hidden inside one of these new "useful" class instances, which offers tab completion on members. And only when a member is acutally accessed the instance looks at its internal pure_protobuf and performs whatever conversion is needed.

The thing is ... I have no idea (yet) how to do this and if initializing these instances in the end is more costly than just leaving the situation as it is.

I believe this is worth a little study. At least it is worth it, if already now users are having the feeling that the read performance in python should be better.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant