Write references in compound datasets as an array #146

sneakers-the-rat · 2023-12-08T05:39:38Z

Motivation

Writes references in compound datasets as an same-sized array, rather than iteratively as an array of lists

How to test the behavior?

Convert a dataset with compound datasets, confirm that it doesn't take a really long time.

Making a more detailed example blocked for me by #145 bc i am not sure if i should actually work on this since it seems like devs are just doing their own version here (???) #149

Checklist

very tired and so leaving this as a draft for now

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

…vely

src/hdmf_zarr/backend.py

…vely

…te-references-array

sneakers-the-rat · 2023-12-14T03:03:23Z

Implemented writing as a compound dset, I used the information from the options passed to the function rather than using info from the builder class to reduce coupling, but if y'all prefer it that way it's a simple drop-in fix.

So just after the dset is written...

>>> (Pdb) dset[0]
(3798, 1, {'source': '.', 'path': '/processing/stimulus/timestamps', 'object_id': 'ff28f35d-9211-4bdb-b026-ad27d46367a3', 'source_object_id': 'a695c6dd-5c46-428d-a927-5c2027c4b653'})
>>> (Pdb) dset.dtype
dtype([('idx_start', '<i4'), ('count', '<i4'), ('timeseries', 'O')])

This version also doesn't have the same problem as #149 , since the dataset itself is given the dtype, the returned type is a numpy void type rather than a tuple:

>>> (Pdb) type(dset[0])
<class 'numpy.void'>
>>> (Pdb) dset[0][1] = 5
>>> (Pdb) # (it works)

whereas if the zarr dataset is created with dtype=object...

>>> (Pdb) type(dset[0])
<class 'tuple'>
>>> (Pdb) dset[0][1] = 5
*** TypeError: 'tuple' object does not support item assignment

all tests pass. not sure what tests i would need to add here for this specific behavior but lmk

codecov-commenter · 2023-12-14T06:01:08Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (f5f09c8) 85.16% compared to head (b33301c) 85.59%.

Files	Patch %	Lines
src/hdmf_zarr/backend.py	87.50%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #146      +/-   ##
==========================================
+ Coverage   85.16%   85.59%   +0.42%     
==========================================
  Files           5        5              
  Lines        1146     1159      +13     
  Branches      291      295       +4     
==========================================
+ Hits          976      992      +16     
+ Misses        114      110       -4     
- Partials       56       57       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mavaylon1 · 2023-12-20T18:27:17Z

@sneakers-the-rat I'll take a look at the code coverage tonight, but from a quick glance it is passing.

mavaylon1 · 2023-12-21T02:07:01Z

@sneakers-the-rat Looks good to me. I can update the changelog for you if you'd like. After that, I'll do one last review and approve it.

sneakers-the-rat · 2023-12-21T04:43:38Z

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

mavaylon1 · 2023-12-21T04:44:59Z

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

Looking at the coverage, no further test is needed.

sneakers-the-rat · 2023-12-21T05:19:52Z

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec
edit2: (moved to comment below)

mavaylon1 · 2023-12-21T05:25:33Z

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec

So this should already be tested with existing code, that's why it is passing coverage prior.

sneakers-the-rat · 2023-12-21T05:26:03Z

moving this out of edit so it stays in sequence:
ok i don't see a better place for this, but can revert if not wanted. coverage should we the same since all the legs of the changed code are reached in the existing tests, but i don't see an explicit test for the dtype of the representation in the stored zarr array, so figured i would at least offer a test of the thing that i changed :)

rly · 2023-12-21T19:04:00Z

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

sneakers-the-rat · 2023-12-22T02:02:50Z

In this case the test also revealed a failure on windows, which is fun. Lemme debug

mavaylon1 · 2023-12-25T05:36:08Z

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

@rly I'm aware. My point is the fact that there already should be a test that checks compound data is written as compound data. If not, then I understand adding one. The changes here, in practice, shouldn't break any tests because the results should be the same. Checking if the data was written as a compound data correctly should be the same of the tests that already check compound data (again these should already exist). From my understanding, the issue isn't a big but an optimization problem. The "fix" should be already covered by existing tests.

sneakers-the-rat · 2023-12-25T05:42:03Z

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444
that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

mavaylon1 · 2023-12-25T05:53:05Z

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444

that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

I'll look after the break. By all means if there's a gap in the test we should fill it. My point was based on the assumption that there's tests that already do what we want because this isn't a format change. Just an order of operations. But if this process "exposed" a gap, then let's fill it.

mavaylon1 · 2023-12-27T05:34:59Z

@sneakers-the-rat I updated the test to our usual use of TestCase "assertEqual". This gave more insight on the issue for windows: AssertionError: dtype([('id', '<i4'), ('name', '<U25'), ('reference', 'O')]) != dtype([('id', '<i8'), ('name', '<U25'), ('reference', 'O')])

The id types are not matching.

Expand check for string types Make 'else' a failure condition in `__resolve_dtype_helper__` (rather than implicitly assuming list)

sneakers-the-rat · 2023-12-27T07:47:17Z

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1032 in 005bfbc

item['dtype'] = 'U25'

which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3

This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1010 in 9692a98

t = self.__resolve_dtype_helper__(i)

That gets saved as a dataset attribute:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1020 in 9692a98

dset.attrs['zarr_dtype'] = type_str

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

mavaylon1 · 2023-12-27T15:12:37Z

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1032 in 005bfbc

item['dtype'] = 'U25'

which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3

This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1010 in 9692a98

t = self.__resolve_dtype_helper__(i)

That gets saved as a dataset attribute:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1020 in 9692a98

dset.attrs['zarr_dtype'] = type_str

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

The strings don't need to be length 25. I also found that doing <U would return empty strings downstream. The U25 was me testing to see if providing a cap would fix that and it does. That's an open ended discussion that I thought was fixed with your implementation, but I think there's something lingering that has yet to be solidified. I'm going to dive into this Friday and see what I can come up with.

sneakers-the-rat · 2023-12-27T21:06:17Z

Basically all that needs to happen is to decide on either a single length, or calculate the Max needed length from the data, ideally in a single mapping method that can get hooked into from the various places one would need to get a zarr storage dtype from a source dtype. Hopefully not too complicated :)

src/hdmf_zarr/backend.py

mavaylon1 · 2024-01-03T17:46:30Z

src/hdmf_zarr/backend.py

+                    if field['dtype'] is str or field['dtype'] in (
+                            'str', 'text', 'utf', 'utf8', 'utf-8', 'isodatetime'
+                    ):
+                        new_dtype.append((field['name'], 'U25'))


@oruebel I feel that setting a cap on the length, though quick and it works, is not general. I think to get this ticket wrapped up we decide on a temporary cap and investigate why on read the field turns to an empty string if 'U25' is just 'U'. Thoughts?

I'm not sure if Zarr support variable length structured data types. That's something to look at. If not, then we may need to either: 1) introspect the data to determine the string length or 2) maybe it's possible to use 'O' as instead of 'U25' in the compound dtype or 3) store variable length compound types as objects. Did you try the second option. i.e., use 'O' or 'object' in the dtype instead of 'U'?

It doesnt support variable length strings, no. Object would almost certainly work at the cost of needing to invoke a json serializer, but since we already have to do that for the reference object it would probs be fine. Introspecting the string would also probably be fine but gives me fragility vibes. Can try both, would the deciding factor be perf or what?

Introspecting the string would also probably be fine but gives me fragility vibes

I agree, introspecting on write is not desirable both from a fragility and performance perspective.

mavaylon1 · 2024-01-10T16:12:44Z

Hey @sneakers-the-rat let me know if you want to finish this up or if you want me to.

sneakers-the-rat · 2024-01-11T00:46:14Z

Ah I was not sure what approach we were taking, I can do some tests today to finish it out

sneakers-the-rat · 2024-01-12T00:32:22Z

Alright, I didn't do seriously careful profiling here, but switching out the 'U25' string with 'O' is a perf wash, and when read back seems to be seamlessly recast back to a string.

On a single run (not robust) of the relevant test cases (cProfile's cumulative time):

Case	U25	Object
`test_read_reference_compound_buf`	0.0375	0.0455
`test_read_reference_compound`	0.0356	0.0325
`test_write_reference_compound`	0.0381	0.0388
total	0.0911	0.1

But on a "naturalistic" case where I just timed the whole write of the dataset i had referenced in the issue, oddly the object encoding was faster. for the write_dataset method:

object: 38s
U25: 39.1s

So pretty much the same. I won't add a test case for very long strings since if I did it would just be a hackjob attached to the presently modified test methods, and that seems like something better suited for a text fixture/parameterization.

sneakers-the-rat · 2024-01-12T00:52:25Z

Sorry for the force push, just to amend an incorrect commit message

sneakers-the-rat · 2024-02-02T02:14:44Z

I think this is good to go at this point :) lmk if changes still needed

Write references in compound datasets as an array rather than iterati…

6a3ab98

…vely

oruebel reviewed Dec 8, 2023

View reviewed changes

src/hdmf_zarr/backend.py Outdated Show resolved Hide resolved

oruebel assigned mavaylon1 Dec 8, 2023

oruebel added the category: enhancement improvements of code or code behavior label Dec 8, 2023

oruebel added this to the Next Release milestone Dec 8, 2023

Merge branch 'dev' into write-references-array

becf6bc

sneakers-the-rat mentioned this pull request Dec 14, 2023

Write references in compound datasets at the end #149

Closed

6 tasks

sneakers-the-rat added 2 commits December 13, 2023 19:00

Write references in compound datasets as an array rather than iterati…

31d0d07

…vely

Merge remote-tracking branch 'origin/write-references-array' into wri…

8100341

…te-references-array

mavaylon1 marked this pull request as ready for review December 20, 2023 18:26

Test that compound reference arrays are written as a compound array

63922ae

Update base_tests_zarrio.py

793e8fd

Use explicit type map instead of stringlike types

7174615

Expand check for string types Make 'else' a failure condition in `__resolve_dtype_helper__` (rather than implicitly assuming list)

mavaylon1 reviewed Jan 3, 2024

View reviewed changes

src/hdmf_zarr/backend.py Show resolved Hide resolved

mavaylon1 reviewed Jan 3, 2024

View reviewed changes

rly mentioned this pull request Jan 11, 2024

[CI]: Compare vs. last commit in black CI action #151

Closed

6 tasks

Write strings as objects in compound reference datasets

9447447

sneakers-the-rat force-pushed the write-references-array branch from ee73396 to 9447447 Compare January 12, 2024 00:51

Merge branch 'dev' into write-references-array

b33301c

mavaylon1 approved these changes Feb 2, 2024

View reviewed changes

mavaylon1 merged commit 68218cf into hdmf-dev:dev Feb 2, 2024
24 checks passed

oruebel mentioned this pull request Mar 5, 2024

default object_codec_class -> JSON #173

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write references in compound datasets as an array #146

Write references in compound datasets as an array #146

sneakers-the-rat commented Dec 8, 2023 •

edited by oruebel

Loading

sneakers-the-rat commented Dec 14, 2023 •

edited

Loading

codecov-commenter commented Dec 14, 2023 •

edited

Loading

mavaylon1 commented Dec 20, 2023

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023 •

edited

Loading

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023

rly commented Dec 21, 2023

sneakers-the-rat commented Dec 22, 2023

mavaylon1 commented Dec 25, 2023 •

edited

Loading

sneakers-the-rat commented Dec 25, 2023 •

edited

Loading

mavaylon1 commented Dec 25, 2023

mavaylon1 commented Dec 27, 2023

sneakers-the-rat commented Dec 27, 2023

mavaylon1 commented Dec 27, 2023

sneakers-the-rat commented Dec 27, 2023

mavaylon1 Jan 3, 2024

oruebel Jan 3, 2024

sneakers-the-rat Jan 3, 2024

oruebel Jan 3, 2024

mavaylon1 commented Jan 10, 2024

sneakers-the-rat commented Jan 11, 2024

sneakers-the-rat commented Jan 12, 2024

sneakers-the-rat commented Jan 12, 2024

sneakers-the-rat commented Feb 2, 2024

Write references in compound datasets as an array #146

Write references in compound datasets as an array #146

Conversation

sneakers-the-rat commented Dec 8, 2023 • edited by oruebel Loading

Motivation

How to test the behavior?

Checklist

sneakers-the-rat commented Dec 14, 2023 • edited Loading

codecov-commenter commented Dec 14, 2023 • edited Loading

Codecov Report

mavaylon1 commented Dec 20, 2023

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023 • edited Loading

mavaylon1 commented Dec 21, 2023

sneakers-the-rat commented Dec 21, 2023

rly commented Dec 21, 2023

sneakers-the-rat commented Dec 22, 2023

mavaylon1 commented Dec 25, 2023 • edited Loading

sneakers-the-rat commented Dec 25, 2023 • edited Loading

mavaylon1 commented Dec 25, 2023

mavaylon1 commented Dec 27, 2023

sneakers-the-rat commented Dec 27, 2023

mavaylon1 commented Dec 27, 2023

sneakers-the-rat commented Dec 27, 2023

mavaylon1 Jan 3, 2024

Choose a reason for hiding this comment

oruebel Jan 3, 2024

Choose a reason for hiding this comment

sneakers-the-rat Jan 3, 2024

Choose a reason for hiding this comment

oruebel Jan 3, 2024

Choose a reason for hiding this comment

mavaylon1 commented Jan 10, 2024

sneakers-the-rat commented Jan 11, 2024

sneakers-the-rat commented Jan 12, 2024

sneakers-the-rat commented Jan 12, 2024

sneakers-the-rat commented Feb 2, 2024

sneakers-the-rat commented Dec 8, 2023 •

edited by oruebel

Loading

sneakers-the-rat commented Dec 14, 2023 •

edited

Loading

codecov-commenter commented Dec 14, 2023 •

edited

Loading

sneakers-the-rat commented Dec 21, 2023 •

edited

Loading

mavaylon1 commented Dec 25, 2023 •

edited

Loading

sneakers-the-rat commented Dec 25, 2023 •

edited

Loading