[Bug]: Reading `units` table is very slow when expanding `electrodes` table #237

alejoe91 · 2024-11-20T16:14:51Z

What happened?

We're mainly using NWB-Zarr and found that reading the units table and rendering as a dataframe is prohibitevly slow. This is mainly due to the electrodes column that copies the entire dynamic table region into the dataframe.

To give some numbers, reading a dataset with ~758 units takse over ~10 minutes. When index=True, reading time goes down to ~6s.
To investigate this performance issue, we also ran the same tests with the same file saved as HDF5, and here are the results (see steps to reproduce).

In general, Zarr is slower, but this could be due to the fact that everything is compressed by default, with no compression is applied in HDF5.

This barplot shows the reading time for each column in the units table, obtained with:

for col in colnames:
    t_start = time.perf_counter()
    nwbfile_zarr.units[col][:]
    t_stop = time.perf_counter()
    elapsed_zarr = np.round(t_stop - t_start, 2)

    t_start = time.perf_counter()
    nwbfile_hdf5.units[col][:]
    t_stop = time.perf_counter()
    elapsed_hdf5 = np.round(t_stop - t_start)

load_times_hdf5_zarr.pdf

Steps to Reproduce

### index = False
# zarr
t_start = time.perf_counter()
read_io_zarr =  NWBZarrIO(nwb_path_zarr, mode='r')
nwbfile_zarr = read_io_zarr.read()
nwbfile_zarr.units.to_dataframe(index=False)
t_stop = time.perf_counter()
elapsed = np.round(t_stop - t_start, 2)
print(f"Time reading Zarr units with index=False: {elapsed}s")

# hdf5
t_start = time.perf_counter()
read_io_hdf5 = NWBHDF5IO(nwb_paths_hdf5[file_index], mode='r')
nwbfile_hdf5 = read_io_hdf5.read()
nwbfile_hdf5.units.to_dataframe(index=False)
t_stop = time.perf_counter()
elapsed = np.round(t_stop - t_start, 2)
print(f"Time reading HDF5 units with index=False: {elapsed}s")

### index = False
# zarr
t_start = time.perf_counter()
read_io_zarr =  NWBZarrIO(nwb_path_zarr, mode='r')
nwbfile_zarr = read_io_zarr.read()
nwbfile_zarr.units.to_dataframe(index=True)
t_stop = time.perf_counter()
elapsed = np.round(t_stop - t_start, 2)
print(f"Time reading Zarr units with index=True: {elapsed}s")

# hdf5
t_start = time.perf_counter()
read_io_hdf5 = NWBHDF5IO(nwb_paths_hdf5[file_index], mode='r')
nwbfile_hdf5 = read_io_hdf5.read()
nwbfile_hdf5.units.to_dataframe(index=True)
t_stop = time.perf_counter()
elapsed = np.round(t_stop - t_start, 2)
print(f"Time reading HDF5 units with index=True: {elapsed}s")

Traceback

>>> Time reading Zarr units with index=False: 509.52s

>>> Time reading HDF5 units with index=False: 11.59s

>>> Time reading Zarr units with index=True: 5.48s

>>> Time reading HDF5 units with index=True: 0.45s

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

pynwb 2.8.2
hdmf 3.14.5
hdmf_zarr 0.9.0

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

alejoe91 · 2024-11-20T16:15:21Z

@rly here's the issue we discussed about!

mavaylon1 · 2025-01-23T22:25:03Z

@alejoe91 Is the file available to share via Dandi or elsewhere? If not, we can make a dummy file.

mavaylon1 · 2025-01-24T03:25:28Z

@alejoe91 I was able to reproduce with my own:
Time reading Zarr units with index=False: 12.82s
Time reading HDF5 units with index=False: 0.55s
Time reading Zarr units with index=True: 13.35s
Time reading HDF5 units with index=True: 0.57s

mavaylon1 · 2025-02-03T19:33:18Z

@alejoe91 Found the source. Exploring ideas to make it quicker.

mavaylon1 · 2025-02-07T04:22:55Z

@alejoe91 So I found an issue with my toy example. The time is being inflated by some stuff with the spiketimes, which reveals a separate optimization issue to tackle in terms of resolving vector index info. That being said, with that gone, the timing I get does show that zarr takes twice as long, but I do not have the wait time of 10 mins. Rather it is a difference of 4 seconds (zarr) vs 2 seconds (hdf5).

I assume this is just the nature of the differences in our files. That being said zarr shouldn't be twice as long. I can continue working on this, but we would need to explore this with you work to make sure the fix is impactful.

mavaylon1 · 2025-02-07T05:27:51Z

@alejoe91 You mention "copies the entire dynamic table region into the dataframe." It begs the question what would you want to see in the to_dataframe() visual of the units table. Would you want the electrodes column to just be a column of indices to the electrodes table? Would you want this as an arg in to_dataframe to toggle?

alejoe91 · 2025-02-07T07:29:10Z

Hi @mavaylon1

Thank you so much for working on this! The new times are perfectly acceptable :)

There's already an option not to expand the electrodes, using units.to_dataframe(index=True), so no action needed there!

mavaylon1 · 2025-02-07T15:52:37Z

@alejoe91 Is it possible for me to get access to the nwbfile? Those times aren't new times (sadly). I was saying that the example i created to mimic your issue was being inflated in time by other issues that I will make a ticket for. I created a new example to focus on the electrodes, but those are the times I am getting (800 units and 400 electrodes).

alejoe91 · 2025-02-07T17:34:46Z

Yep, here's a link to a test file where the issue is quite relevant because it has over 1000 units!

It's still uploasing, so give it a few minutes!

mavaylon1 · 2025-02-10T17:07:02Z

@alejoe91 What is the file size? I tried downloading to my large memory workstation via gdown but it has too many files. I can download it locally, but I need to know the size of the zarr file. Drive doesn't show it, do you know the size?

mavaylon1 · 2025-02-10T17:08:01Z

@alejoe91 Ryan and I talked on Friday. We have one idea to try. I am implementing that now and will test with your file once we figure out where to put it due to file size.

alejoe91 · 2025-02-10T17:19:55Z

@alejoe91 Ryan and I talked on Friday. We have one idea to try. I am implementing that now and will test with your file once we figure out where to put it due to file size.

It should be 248.5 MB

mavaylon1 · 2025-02-10T18:02:05Z

@alejoe91 Can't seem to read the file. Let me investigate. It has some kind of dateutils parser issue

alejoe91 · 2025-02-10T18:17:13Z

Maybe some zarr version? What zarr/hdmf-zarr version are you using?

mavaylon1 · 2025-02-10T18:19:54Z

@alejoe91 I see that you're subject information is all empty. This is leading somewhere to set these values to 0 (looked at the builder). This is returning issues in the objectmapper as the type is wrong. As for versioning, we do have different versions, but this isn't anything on release milestones that I can recall. I will use your versions to see if I get the same issue.

alejoe91 · 2025-02-10T18:37:02Z

Let me get back to you on this. I'll double check the file and that the upload was correct.

alejoe91 · 2025-02-10T18:53:17Z

Ok, not 10 minutes, but it still takes a while:

I'm just running this:

import time
from hdmf_zarr import NWBZarrIO

io = NWBZarrIO("test_nwb/", mode="r")

nwbfile = io.read()


t_start = time.perf_counter()
units = nwbfile.units.to_dataframe()
t_stop = time.perf_counter()
print(f"Elapsed {t_stop - t_start} s")

and getting:

Elapsed 98.72519794500113 s

Note that there is subject info!

mavaylon1 · 2025-02-10T19:24:30Z

Okay so I tried with your env and it still doesn't work. I see in that image you do actually have species info. I don't think the file is correct. @alejoe91

alejoe91 · 2025-02-10T19:36:44Z

Hi think that the upload probably didn't finish. I'll try to reupload tomorrow and send you an updated link

mavaylon1 · 2025-02-10T22:20:56Z

Try putting it in a tar first. @alejoe91

alejoe91 · 2025-02-11T09:58:28Z

@mavaylon1 here it is (zipped): https://drive.google.com/file/d/1Zl59744B1elF-AYrZoYvD9bTACeKI-D-/view?usp=sharing

mavaylon1 · 2025-02-13T02:17:07Z

@alejoe91 It looks to be good now. I will continue. Can I also have the hdf5 version you were using as well(i want to avoid differences in exporting)? I want to compare chunks and compression. Yes I can go on ahead with our idea and it may help. However, it would be useful to see why there is a difference. From my digging, there is a lot of nested loops which would cause things to slow down when the index is False. However, the question remains why is hdf5 faster. As for the file you provided, I can reproduce.

oruebel · 2025-02-13T02:37:27Z

However, the question remains why is hdf5 faster.

If this is related to references, then the speed difference probably due to the fact that we need to resolve references manually in Zarr

mavaylon1 · 2025-02-13T22:50:42Z

@alejoe91 With my update it is 9.9 seconds. With the old method it is a TBD of so far 2+mins and counting. I do see some other areas to fix as well. But a fix seems to have been found. Will keep you updated.

mavaylon1 · 2025-02-16T18:30:19Z

@alejoe91 Try this: hdmf-dev/hdmf#1248

mavaylon1 · 2025-02-18T17:50:17Z

@alejoe91 How is it on your end?

alejoe91 · 2025-02-18T18:07:38Z

Testing now

alejoe91 · 2025-02-18T18:13:15Z

It's actually worse :(

mavaylon1 · 2025-02-19T00:44:04Z

It's actually worse :(

Well this just got even weirder. I will escalate this to a group discussion because I don't know how to approach this now that it is faster for me but slower for you. @rly

alejoe91 · 2025-02-19T09:39:14Z

I'll do some more testing on my end and get back to you!

alejoe91 · 2025-02-19T10:09:28Z

@mavaylon1 I think I was using the wrong branch, sorry for the confusion!

It seems that the fix speeds things up!!!!

hdmf opt_get branch

import hdmf

print(hdmf.__version__)
>>> '4.0.1.dev6+gb94d9b40'

import time
from hdmf_zarr import NWBZarrIO

io = NWBZarrIO("test_nwb/", mode="r")

nwbfile = io.read()


t_start = time.perf_counter()
units = nwbfile.units.to_dataframe()
t_stop = time.perf_counter()
print(f"Elapsed {t_stop - t_start} s")

>>> Elapsed 1.1025180209999235 s

hdmf 4.0.0

import hdmf

print(hdmf.__version__)
>>> '4.0.0'

import time
from hdmf_zarr import NWBZarrIO

io = NWBZarrIO("test_nwb/", mode="r")

nwbfile = io.read()


t_start = time.perf_counter()
units = nwbfile.units.to_dataframe()
t_stop = time.perf_counter()
print(f"Elapsed {t_stop - t_start} s")

>>> Elapsed 76.43624574499972 s

And here's the dataset I showed yesterday, which went from >400s to ~8s!

mavaylon1 · 2025-02-19T16:56:04Z

@alejoe91 Okay great!!! Let me finish the PR and we will get it merged asap

mavaylon1 · 2025-02-20T15:20:11Z

@alejoe91 The fix has been merged into dev. The next bug release is first week of March. Are you okay with just using the dev branch for now?

alejoe91 · 2025-02-20T15:40:40Z

Sure, no problem at all! Thanks for the help on this @mavaylon1!!!

rly added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Nov 27, 2024

rly added this to the 1.1.0 milestone Nov 27, 2024

rly changed the title ~~[Bug]: Reading units table is very slow when expanding electrodes table~~ [Bug]: Reading units table is very slow when expanding electrodes table Jan 23, 2025

rly assigned mavaylon1 Jan 23, 2025

mavaylon1 mentioned this issue Feb 16, 2025

Slight refactor to Vector Index get hdmf-dev/hdmf#1248

Merged

4 tasks

mavaylon1 closed this as completed Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Reading `units` table is very slow when expanding `electrodes` table #237

[Bug]: Reading `units` table is very slow when expanding `electrodes` table #237

alejoe91 commented Nov 20, 2024 •

edited

Loading

alejoe91 commented Nov 20, 2024

mavaylon1 commented Jan 23, 2025

mavaylon1 commented Jan 24, 2025

mavaylon1 commented Feb 3, 2025

mavaylon1 commented Feb 7, 2025

mavaylon1 commented Feb 7, 2025

alejoe91 commented Feb 7, 2025

mavaylon1 commented Feb 7, 2025 •

edited

Loading

alejoe91 commented Feb 7, 2025

mavaylon1 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025 •

edited

Loading

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025 •

edited

Loading

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 11, 2025

mavaylon1 commented Feb 13, 2025

oruebel commented Feb 13, 2025

mavaylon1 commented Feb 13, 2025

mavaylon1 commented Feb 16, 2025

mavaylon1 commented Feb 18, 2025

alejoe91 commented Feb 18, 2025

alejoe91 commented Feb 18, 2025

mavaylon1 commented Feb 19, 2025

alejoe91 commented Feb 19, 2025

alejoe91 commented Feb 19, 2025

mavaylon1 commented Feb 19, 2025

mavaylon1 commented Feb 20, 2025

alejoe91 commented Feb 20, 2025

[Bug]: Reading units table is very slow when expanding electrodes table #237

[Bug]: Reading units table is very slow when expanding electrodes table #237

Comments

alejoe91 commented Nov 20, 2024 • edited Loading

What happened?

Steps to Reproduce

Traceback

Operating System

Python Executable

Python Version

Package Versions

Code of Conduct

alejoe91 commented Nov 20, 2024

mavaylon1 commented Jan 23, 2025

mavaylon1 commented Jan 24, 2025

mavaylon1 commented Feb 3, 2025

mavaylon1 commented Feb 7, 2025

mavaylon1 commented Feb 7, 2025

alejoe91 commented Feb 7, 2025

mavaylon1 commented Feb 7, 2025 • edited Loading

alejoe91 commented Feb 7, 2025

mavaylon1 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025 • edited Loading

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025 • edited Loading

alejoe91 commented Feb 10, 2025

mavaylon1 commented Feb 10, 2025

alejoe91 commented Feb 11, 2025

mavaylon1 commented Feb 13, 2025

oruebel commented Feb 13, 2025

mavaylon1 commented Feb 13, 2025

mavaylon1 commented Feb 16, 2025

mavaylon1 commented Feb 18, 2025

alejoe91 commented Feb 18, 2025

alejoe91 commented Feb 18, 2025

mavaylon1 commented Feb 19, 2025

alejoe91 commented Feb 19, 2025

alejoe91 commented Feb 19, 2025

hdmf opt_get branch

hdmf 4.0.0

mavaylon1 commented Feb 19, 2025

mavaylon1 commented Feb 20, 2025

alejoe91 commented Feb 20, 2025

[Bug]: Reading `units` table is very slow when expanding `electrodes` table #237

[Bug]: Reading `units` table is very slow when expanding `electrodes` table #237

alejoe91 commented Nov 20, 2024 •

edited

Loading

mavaylon1 commented Feb 7, 2025 •

edited

Loading

mavaylon1 commented Feb 10, 2025 •

edited

Loading

mavaylon1 commented Feb 10, 2025 •

edited

Loading