Add cloud protocol support, starting with BossDB #41

j6k4m8 · 2021-11-09T14:44:33Z

This PR starts to add support for cloud datasets by passing a file with the neuroglancer "protocol"-style prefixes (e.g. bossdb://).

It works by shimming a File and Dataset class to wrap the intern library so that it behaves like the h5py.File API:

>>> f = InternFile("bossdb://phelps_hildebrand_graham2021/FANC/em")
>>> f['data'] # numpy-like Dataset

(For reference, all public BossDB datasets are listed here; Janelia DVID data (dvid://) are listed here. As far as I know, there isn't a central repository for CloudVolume-format (precomputed://) datasets.)

Some more discussion in constantinpape/cluster_tools#23

Still to-do:

Add tests
~~Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?~~
~~Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)~~

Feedback welcome, @constantinpape! :)

elf/io/extensions.py

elf/io/files.py

elf/io/intern_wrapper.py

setup.py

constantinpape

I think this is on a good way. Now you only need to implement the classes in intern_wrapper and add tests.

elf/io/extensions.py

elf/io/files.py

elf/io/intern_wrapper.py

constantinpape · 2021-11-09T16:15:42Z

* Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?

I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
But I have not tested it much regarding performance and not used it much at all; so it might not be very useful yet.
It's also not used in cluster_tools yet, but that would be easy.
I would be very interested in a contribution that either tests and improves CachedVolume or implements a better caching mechanism, but I suggest we leave that for a follow up PR.

* Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)

Yes, I think that should rather go into cluster_tools once we have a working intern wrapper here.

constantinpape · 2021-11-09T16:15:50Z

* Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?

I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
But I have not tested it much regarding performance and not used it much at all; so it might not be very useful yet.
It's also not used in cluster_tools yet, but that would be easy.
I would be very interested in a contribution that either tests and improves CachedVolume or implements a better caching mechanism, but I suggest we leave that for a follow up PR.

* Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)

Yes, I think that should rather go into cluster_tools once we have a working intern wrapper here.

…nto add-cloud-protocols

j6k4m8 · 2021-11-11T00:35:12Z

@constantinpape — ready for your review and thoughts I think!

I added tests and as far as I can tell, things are playing nicely. Curious to see if the test workflow passes :)

constantinpape

Looks good overall. We could still add some chunking information to the wrapper if possible and slightly extend the tests.

elf/io/extensions.py

constantinpape · 2021-11-11T10:27:52Z

elf/io/intern_wrapper.py

+
+    # TODO chunks are arbitrary, how do we handle this?
+    @property
+    def chunks(self):


Is there some chunking that should be observed? (Even if it's not exposed as chunks by the intern API?)

We ran some benchmarks and found that there's not an enormous difference between "cuboid-aligned" and "non-aligned" reads with BossDB in terms of performance, because of the server-side cache. We can add the 512²64 chunks here but it's luckily not a big contributor/detractor to performance! I figured it made more sense to remain "agnostic" here in the same way MRC does.

And are parallel writes to data with overlapping chunks ok? If yes, and if performance is not an issue, we can return None indeed.
(Note that this will need some updates in cluster_tools then, but I think it's better to update it there so that it can deal with arbitrary chunk sizes rather than adding an artificial one here.)

constantinpape · 2021-11-11T10:29:32Z

test/io_tests/test_intern_wrapper.py

+
+        ds = InternDataset("bossdb://witvliet2020/Dataset_1/em")
+        cutout = ds[210:211, 7000:7064, 7000:7064]
+        self.assertEqual(cutout.shape, (1, 64, 64))


Can you also download directly via the intern API here and check for equality?
(I know that the wrapper is very thin, so this is very unlikely to fail, but I think it's better to be more careful in the tests ;))

Definitely! I meant to ask how you wanted me to handle this to keep our tests in this repo isolated from the "in-use" database. My original plan was to download from a dataset and check a handful of voxels for equality, but that feels sloppy. We could also include a small .npy array in the tests directory to check full-array equality of a few cutouts, but that feels like adding unnecessary clutter.

My instinct is to do the former:

data = InternDataset("bossdb:// ... ") # some known, public dataset assert data[100, 200, 300] == 137 # magic number assert data[10, 300, 200] == 42 # magic number

How does that sound?

My instinct is to do the former:

data = InternDataset("bossdb:// ... ") # some known, public dataset assert data[100, 200, 300] == 137 # magic number assert data[10, 300, 200] == 42 # magic number

How does that sound?

Yes, I think that's a good solution!

(Just added this in the latest commit!)

j6k4m8 · 2021-11-11T14:19:47Z

I'm realizing the new tests are being skipped because there's no intern installation:

test_can_access_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_can_download_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_file (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'

Mind if I add it to the test suite? (I think the right way to do this is to use the environment.yaml dependency mgmt and add a line like

- intern

Does that sound right?

https://github.com/constantinpape/elf/blob/master/.github/workflows/environment.yaml#L5
(https://github.com/conda-forge/intern-feedstock)

constantinpape · 2021-11-11T14:58:29Z

Yes, please go ahead and add it to the env file.

constantinpape · 2021-11-11T16:13:24Z

Good that we activated the tests, looks like there is indeed something wrong ;):

Traceback (most recent call last):
  File "/home/runner/work/elf/elf/test/io_tests/test_intern_wrapper.py", line 20, in test_can_access_dataset
    self.assertEqual(ds.shape, (300, 36000, 22000))
AssertionError: Tuples differ: (300, 26000, 22000) != (300, 36000, 22000)

First differing element 1:
26000
36000

- (300, 26000, 22000)
?       ^

+ (300, 36000, 22000)
?       ^

j6k4m8 · 2021-11-11T16:34:21Z

Oops. The thing that's wrong is I don't know how to copy and paste correctly! :) The test was wrong, not the code. A good catch!

elf/io/intern_wrapper.py

constantinpape

Now we only need to return the dtype as np dtype (I went ahead and did this directly.)

elf/io/intern_wrapper.py

constantinpape · 2021-11-11T17:01:11Z

@j6k4m8 looks like tests are passing now. This is good to be merged from my side. Anything you still want to add?

j6k4m8 · 2021-11-11T18:20:00Z

Awesome!! Looks like it's still broken on Windows? (idk if that's expected..?)

I'm good to merge this once you're happy with it! My next step is to write a simple cloud segmentation example next, "read from the cloud, segment, and write completed seg back to the cloud"... I think in cluster_tools! :)

constantinpape · 2021-11-11T18:24:41Z

Looks like it's still broken on Windows? (idk if that's expected..?)

Sorry I only waited till the linux tests passed to write this...
Indeed, it still fails on windows because it can't find the intern package installed; maybe there's something wrong with the windows conda package?!
I can have a quick look later and gonna ping you here if I find something.

My next step is to write a simple cloud segmentation example next, "read from the cloud, segment, and write completed seg back to the cloud"... I think in cluster_tools! :)

Sounds good!

constantinpape · 2021-11-11T21:30:30Z

It looks like there is some issue with the conda intern pacakge:
Apparently import intern works, but from intern import array does not. With the change in 8f9edfd from intern import array is used to check if intern is available and now the tests are skipped on windows because this fails.
I will merge this anyway as this seems to be an upstream problem and usage on windows is probably not so important, but I will create an issue to keep track of this.

j6k4m8 and others added 6 commits November 8, 2021 20:29

Begin to add support for cloud protocols in open_file

2b7081f

Add intern shims to extensions.py

590b8a9

Update intern_wrapper to better match MRC

b81aa5c

Add is_intern check

25a0180

Use protocol style notation for intern ext

509d2c5

Update setup.py

e0254ef

j6k4m8 commented Nov 9, 2021

View reviewed changes

elf/io/extensions.py Show resolved Hide resolved

j6k4m8 commented Nov 9, 2021

View reviewed changes

elf/io/files.py Show resolved Hide resolved

elf/io/intern_wrapper.py Show resolved Hide resolved

setup.py Show resolved Hide resolved

j6k4m8 mentioned this pull request Nov 9, 2021

Support for cloud-based datastores? constantinpape/cluster_tools#23

Open

constantinpape reviewed Nov 9, 2021

View reviewed changes

elf/io/extensions.py Show resolved Hide resolved

elf/io/files.py Show resolved Hide resolved

elf/io/intern_wrapper.py Show resolved Hide resolved

elf/io/intern_wrapper.py Show resolved Hide resolved

elf/io/intern_wrapper.py Show resolved Hide resolved

j6k4m8 added 3 commits November 10, 2021 19:12

Add a check for intern import failure

067ccc7

Merge branch 'add-cloud-protocols' of https://github.com/j6k4m8/elf i…

6aff9a3

…nto add-cloud-protocols

Add intern wrapper unit tests

945d3e6

j6k4m8 marked this pull request as ready for review November 11, 2021 00:35

constantinpape reviewed Nov 11, 2021

View reviewed changes

Update intern tests to verify data can be accessed

d428022

Add intern to intsallation environment

d5d3d2b

Fix test typo

0d1a2b6

constantinpape reviewed Nov 11, 2021

View reviewed changes

elf/io/intern_wrapper.py Outdated Show resolved Hide resolved

constantinpape reviewed Nov 11, 2021

View reviewed changes

elf/io/intern_wrapper.py Show resolved Hide resolved

constantinpape added 2 commits November 11, 2021 17:54

Update elf/io/intern_wrapper.py

9932ccc

Update elf/io/intern_wrapper.py

b87f36e

Update test_intern_wrapper.py

8f9edfd

constantinpape merged commit 5141cfc into constantinpape:master Nov 11, 2021

constantinpape mentioned this pull request Nov 11, 2021

intern does not work on windows? #42

Open

j6k4m8 deleted the add-cloud-protocols branch November 11, 2021 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cloud protocol support, starting with BossDB #41

Add cloud protocol support, starting with BossDB #41

j6k4m8 commented Nov 9, 2021 •

edited

Loading

constantinpape left a comment

constantinpape commented Nov 9, 2021

constantinpape commented Nov 9, 2021

j6k4m8 commented Nov 11, 2021

constantinpape left a comment

constantinpape Nov 11, 2021

j6k4m8 Nov 11, 2021

constantinpape Nov 11, 2021

constantinpape Nov 11, 2021

j6k4m8 Nov 11, 2021

constantinpape Nov 11, 2021

j6k4m8 Nov 11, 2021

j6k4m8 commented Nov 11, 2021

constantinpape commented Nov 11, 2021

constantinpape commented Nov 11, 2021

j6k4m8 commented Nov 11, 2021

constantinpape left a comment •

edited

Loading

constantinpape commented Nov 11, 2021

j6k4m8 commented Nov 11, 2021

constantinpape commented Nov 11, 2021

constantinpape commented Nov 11, 2021

Add cloud protocol support, starting with BossDB #41

Add cloud protocol support, starting with BossDB #41

Conversation

j6k4m8 commented Nov 9, 2021 • edited Loading

Still to-do:

constantinpape left a comment

Choose a reason for hiding this comment

constantinpape commented Nov 9, 2021

constantinpape commented Nov 9, 2021

j6k4m8 commented Nov 11, 2021

constantinpape left a comment

Choose a reason for hiding this comment

constantinpape Nov 11, 2021

Choose a reason for hiding this comment

j6k4m8 Nov 11, 2021

Choose a reason for hiding this comment

constantinpape Nov 11, 2021

Choose a reason for hiding this comment

constantinpape Nov 11, 2021

Choose a reason for hiding this comment

j6k4m8 Nov 11, 2021

Choose a reason for hiding this comment

constantinpape Nov 11, 2021

Choose a reason for hiding this comment

j6k4m8 Nov 11, 2021

Choose a reason for hiding this comment

j6k4m8 commented Nov 11, 2021

constantinpape commented Nov 11, 2021

constantinpape commented Nov 11, 2021

j6k4m8 commented Nov 11, 2021

constantinpape left a comment • edited Loading

Choose a reason for hiding this comment

constantinpape commented Nov 11, 2021

j6k4m8 commented Nov 11, 2021

constantinpape commented Nov 11, 2021

constantinpape commented Nov 11, 2021

j6k4m8 commented Nov 9, 2021 •

edited

Loading

constantinpape left a comment •

edited

Loading