-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write references in compound datasets as an array #146
Write references in compound datasets as an array #146
Conversation
Implemented writing as a compound dset, I used the information from the So just after the dset is written... >>> (Pdb) dset[0]
(3798, 1, {'source': '.', 'path': '/processing/stimulus/timestamps', 'object_id': 'ff28f35d-9211-4bdb-b026-ad27d46367a3', 'source_object_id': 'a695c6dd-5c46-428d-a927-5c2027c4b653'})
>>> (Pdb) dset.dtype
dtype([('idx_start', '<i4'), ('count', '<i4'), ('timeseries', 'O')]) This version also doesn't have the same problem as #149 , since the dataset itself is given the dtype, the returned type is a numpy void type rather than a tuple: >>> (Pdb) type(dset[0])
<class 'numpy.void'>
>>> (Pdb) dset[0][1] = 5
>>> (Pdb) # (it works) whereas if the zarr dataset is created with >>> (Pdb) type(dset[0])
<class 'tuple'>
>>> (Pdb) dset[0][1] = 5
*** TypeError: 'tuple' object does not support item assignment all tests pass. not sure what tests i would need to add here for this specific behavior but lmk |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #146 +/- ##
==========================================
+ Coverage 85.16% 85.59% +0.42%
==========================================
Files 5 5
Lines 1146 1159 +13
Branches 291 295 +4
==========================================
+ Hits 976 992 +16
+ Misses 114 110 -4
- Partials 56 57 +1 ☔ View full report in Codecov by Sentry. |
@sneakers-the-rat I'll take a look at the code coverage tonight, but from a quick glance it is passing. |
@sneakers-the-rat Looks good to me. I can update the changelog for you if you'd like. After that, I'll do one last review and approve it. |
@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :) lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky) |
Looking at the coverage, no further test is needed. |
Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with edit omg reading the rest of the tests i think i wrote this in the wrong place one sec |
So this should already be tested with existing code, that's why it is passing coverage prior. |
moving this out of edit so it stays in sequence: |
@sneakers-the-rat Thanks for the updates and test!
@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an |
In this case the test also revealed a failure on windows, which is fun. Lemme debug |
@rly I'm aware. My point is the fact that there already should be a test that checks compound data is written as compound data. If not, then I understand adding one. The changes here, in practice, shouldn't break any tests because the results should be the same. Checking if the data was written as a compound data correctly should be the same of the tests that already check compound data (again these should already exist). From my understanding, the issue isn't a big but an optimization problem. The "fix" should be already covered by existing tests. |
From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444 obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3 edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam |
I'll look after the break. By all means if there's a gap in the test we should fill it. My point was based on the assumption that there's tests that already do what we want because this isn't a format change. Just an order of operations. But if this process "exposed" a gap, then let's fill it. |
@sneakers-the-rat I updated the test to our usual use of TestCase "assertEqual". This gave more insight on the issue for windows: The id types are not matching. |
Expand check for string types Make 'else' a failure condition in `__resolve_dtype_helper__` (rather than implicitly assuming list)
ope. so windows treats looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for hdmf-zarr/src/hdmf_zarr/backend.py Line 1032 in 005bfbc
str in __dtypes .
It seems like there should be a single string dtype -> numpy dtype map method, but the Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. This might reveal another issue, but i'm not sure - just prior to the changes I made, a hdmf-zarr/src/hdmf_zarr/backend.py Line 1010 in 9692a98
That gets saved as a dataset attribute: hdmf-zarr/src/hdmf_zarr/backend.py Line 1020 in 9692a98
So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk. |
The strings don't need to be length 25. I also found that doing <U would return empty strings downstream. The U25 was me testing to see if providing a cap would fix that and it does. That's an open ended discussion that I thought was fixed with your implementation, but I think there's something lingering that has yet to be solidified. I'm going to dive into this Friday and see what I can come up with. |
Basically all that needs to happen is to decide on either a single length, or calculate the Max needed length from the data, ideally in a single mapping method that can get hooked into from the various places one would need to get a zarr storage dtype from a source dtype. Hopefully not too complicated :) |
src/hdmf_zarr/backend.py
Outdated
if field['dtype'] is str or field['dtype'] in ( | ||
'str', 'text', 'utf', 'utf8', 'utf-8', 'isodatetime' | ||
): | ||
new_dtype.append((field['name'], 'U25')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oruebel I feel that setting a cap on the length, though quick and it works, is not general. I think to get this ticket wrapped up we decide on a temporary cap and investigate why on read the field turns to an empty string if 'U25' is just 'U'. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if Zarr support variable length structured data types. That's something to look at. If not, then we may need to either: 1) introspect the data to determine the string length or 2) maybe it's possible to use 'O' as instead of 'U25' in the compound dtype or 3) store variable length compound types as objects. Did you try the second option. i.e., use 'O' or 'object' in the dtype instead of 'U'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesnt support variable length strings, no. Object would almost certainly work at the cost of needing to invoke a json serializer, but since we already have to do that for the reference object it would probs be fine. Introspecting the string would also probably be fine but gives me fragility vibes. Can try both, would the deciding factor be perf or what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introspecting the string would also probably be fine but gives me fragility vibes
I agree, introspecting on write is not desirable both from a fragility and performance perspective.
Hey @sneakers-the-rat let me know if you want to finish this up or if you want me to. |
Ah I was not sure what approach we were taking, I can do some tests today to finish it out |
Alright, I didn't do seriously careful profiling here, but switching out the On a single run (not robust) of the relevant test cases (cProfile's cumulative time):
But on a "naturalistic" case where I just timed the whole write of the dataset i had referenced in the issue, oddly the object encoding was faster. for the
So pretty much the same. I won't add a test case for very long strings since if I did it would just be a hackjob attached to the presently modified test methods, and that seems like something better suited for a text fixture/parameterization. |
ee73396
to
9447447
Compare
Sorry for the force push, just to amend an incorrect commit message |
I think this is good to go at this point :) lmk if changes still needed |
Motivation
Fix #144
Writes references in compound datasets as an same-sized array, rather than iteratively as an array of lists
How to test the behavior?
Convert a dataset with compound datasets, confirm that it doesn't take a really long time.
Making a more detailed example blocked for me by #145 bc i am not sure if i should actually work on this since it seems like devs are just doing their own version here (???) #149
Checklist
very tired and so leaving this as a draft for now
ruff
from the source directory.