-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cloud protocol support, starting with BossDB #41
Add cloud protocol support, starting with BossDB #41
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is on a good way. Now you only need to implement the classes in intern_wrapper
and add tests.
I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
Yes, I think that should rather go into |
1 similar comment
I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
Yes, I think that should rather go into |
@constantinpape — ready for your review and thoughts I think! I added tests and as far as I can tell, things are playing nicely. Curious to see if the test workflow passes :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. We could still add some chunking information to the wrapper if possible and slightly extend the tests.
|
||
# TODO chunks are arbitrary, how do we handle this? | ||
@property | ||
def chunks(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some chunking that should be observed? (Even if it's not exposed as chunks
by the intern API?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran some benchmarks and found that there's not an enormous difference between "cuboid-aligned" and "non-aligned" reads with BossDB in terms of performance, because of the server-side cache. We can add the 512²64 chunks here but it's luckily not a big contributor/detractor to performance! I figured it made more sense to remain "agnostic" here in the same way MRC does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And are parallel writes to data with overlapping chunks ok? If yes, and if performance is not an issue, we can return None
indeed.
(Note that this will need some updates in cluster_tools
then, but I think it's better to update it there so that it can deal with arbitrary chunk sizes rather than adding an artificial one here.)
test/io_tests/test_intern_wrapper.py
Outdated
|
||
ds = InternDataset("bossdb://witvliet2020/Dataset_1/em") | ||
cutout = ds[210:211, 7000:7064, 7000:7064] | ||
self.assertEqual(cutout.shape, (1, 64, 64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also download directly via the intern API here and check for equality?
(I know that the wrapper is very thin, so this is very unlikely to fail, but I think it's better to be more careful in the tests ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! I meant to ask how you wanted me to handle this to keep our tests in this repo isolated from the "in-use" database. My original plan was to download from a dataset and check a handful of voxels for equality, but that feels sloppy. We could also include a small .npy array in the tests directory to check full-array equality of a few cutouts, but that feels like adding unnecessary clutter.
My instinct is to do the former:
data = InternDataset("bossdb:// ... ") # some known, public dataset
assert data[100, 200, 300] == 137 # magic number
assert data[10, 300, 200] == 42 # magic number
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct is to do the former:
data = InternDataset("bossdb:// ... ") # some known, public dataset assert data[100, 200, 300] == 137 # magic number assert data[10, 300, 200] == 42 # magic numberHow does that sound?
Yes, I think that's a good solution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just added this in the latest commit!)
I'm realizing the new tests are being skipped because there's no test_can_access_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_can_download_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_file (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)' Mind if I add it to the test suite? (I think the right way to do this is to use the - intern Does that sound right? https://github.com/constantinpape/elf/blob/master/.github/workflows/environment.yaml#L5 |
Yes, please go ahead and add it to the env file. |
Good that we activated the tests, looks like there is indeed something wrong ;):
|
Oops. The thing that's wrong is I don't know how to copy and paste correctly! :) The test was wrong, not the code. A good catch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we only need to return the dtype as np dtype (I went ahead and did this directly.)
@j6k4m8 looks like tests are passing now. This is good to be merged from my side. Anything you still want to add? |
Awesome!! Looks like it's still broken on Windows? (idk if that's expected..?) I'm good to merge this once you're happy with it! My next step is to write a simple cloud segmentation example next, "read from the cloud, segment, and write completed seg back to the cloud"... I think in cluster_tools! :) |
Sorry I only waited till the linux tests passed to write this...
Sounds good! |
It looks like there is some issue with the conda |
This PR starts to add support for cloud datasets by passing a file with the neuroglancer "protocol"-style prefixes (e.g.
bossdb://
).It works by shimming a
File
andDataset
class to wrap theintern
library so that it behaves like theh5py.File
API:(For reference, all public BossDB datasets are listed here; Janelia DVID data (
dvid://
) are listed here. As far as I know, there isn't a central repository for CloudVolume-format (precomputed://
) datasets.)Some more discussion in constantinpape/cluster_tools#23
Still to-do:
Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)Feedback welcome, @constantinpape! :)