File too slow #41

dneise · 2018-05-15T13:13:49Z

The protozfits.File can be used in two ways:

Slow but comfy
Faster but annoying

Slow but comfortable

with protozfits.File('some/path.fits.fz') as f:
    for event in f.Events:
        # now event.anything will be a "useful" Python thing, 
        # an np.array for example

Doing this will give you a very easy to use event.

Faster but annoying

Assuming you need faster iteration speed, then you can omit the conversion to "useful" Python things completely, by setting pure_protobuf=True:

with protozfits.File('some/path.fits.fz', pure_protobuf=True) as f:
    for event in f.Events:

Now event is a Python object, just like before, but it has no tab-completion, all its array-like members are not numpy-arrays but instances of AnyArray that need to be converted using for example protozfits.any_array_to_numpy.

This means, people who know for example they need only access to one very specific member of the event which also happens to be an integer, they can use pure_protobuf=True and iterate faster over all events of the file.

There must be a better way!

Offering these two possibilities might sound nice in the first moment, but it splits the users in two groups. Those who need speed and those who want to keep it simple. I believe a good software does not force users to choose between these two options, at least not at the very start of every project .. and this reader happens to be at the start ...

These are the features the "comfy" option has and the "fast" option is missing:

tab completion
correct enum representation (not mentioned above)
complete auto numpy conversion
shorter string representation (using the two above)

The speedwise problematic part in my understanding is the "auto numpy conversion". All arrays are always converted from AnyArray to numpy, even when the user never accesses them.

So the solution to the problem seems to be lazy evaluation.

At the moment (in the "comfy" version) the whole event is converted into a collections.named_tuple and given to the user. This conversion includes converting all AnyArrays into np.arrays.
Instead what could be done at import time is dynamically generate "useful" classes from the google protobuf descriptors. Then at iteration time, the "pure protobuf" object is hidden inside one of these new "useful" class instances, which offers tab completion on members. And only when a member is acutally accessed the instance looks at its internal pure_protobuf and performs whatever conversion is needed.

The thing is ... I have no idea (yet) how to do this and if initializing these instances in the end is more costly than just leaving the situation as it is.

I believe this is worth a little study. At least it is worth it, if already now users are having the feeling that the read performance in python should be better.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File too slow #41

File too slow #41

dneise commented May 15, 2018

File too slow #41

File too slow #41

Comments

dneise commented May 15, 2018

Slow but comfortable

Faster but annoying

There must be a better way!