-
Notifications
You must be signed in to change notification settings - Fork 10
Pipeline Dreams and Roadmap
Make all datasets in the lab accessible, visible, easily computable, hackable, and reconstructable by anyone in the lab.
The neuroglancer pipeline was born of necessity. We needed a way to ingest datasets in hdf5 and other formats lying around the lab. We first developed the scripts ingest.py and digest.py. The former would upload chunks to Google Cloud Storage and the second one would downsample them. We would run ingest.py from our laptop and many copies of digest.py on the titans, and it was good.
Within a week or two, we wanted to add meshing to digest.py, but we needed a 1 pixel overlap for marching cubes. The scripts became really ugly working with awkward rectangular chunks of the dataset. This necessitated the creation of GCloudVolume
and later Precomputed
to allow arbitrary cutouts of a dataset. Performing meshing using a cutout simplified the code significantly.
Above: An early design of the pipeline drawn in the Secret Spot
Eventually, we wanted to ingest s1 using the titan, but it took forever (we had the option to process a 1TB hdf5 or many little hdf5s). We created a task system using the Google Cloud Task Pull Queue API, bundled digest.py into a docker container, and used Kubernetes to run them on Google's infrastructure.
After that, we were backstopping Watershed. We created Watershed tasks and supported storing data and running computations on AWS/S3. While we later learned that gsutil
, our previously preferred way to download data from Google Storage, could be used with S3, it was either slow (-c
mode) or didn't play nice with other Kubernetes pods (-m
mode). We created multithreaded Storage
as an abstraction over Google Storage, Amazon S3, and local filesystems to enable high speed uploads/downloads that play nice with other pods,
is easy to write functional tests for, and is easy to use in native Python code, and doesn't have to serialize to disk.
We now have a dozen tasks, the most complex being Jonathan's Stage 3, which iteratively runs two covnets and acquires locks for modifying adjacent chunk's region graphs.
- Any dataset in Google Storage or S3 can be visualized
- Can run arbitrary processes on arbitrary chunks of datasets
- Ingest: Import HDF5 into Neuroglancer format
- Downsample: Make large datasets easy to visualize zoomed out and assist in making computations that don't require high resolution by providing a low res dataset version
- Mesh: Create meshes for 3D visualization
- MeshManifest: Secondary processing of meshes to label which mesh fragments belong together
- Quantize: Visualize affinities by creating an 8-bit version of one channel
- Transfer: Convert a source dataset into a destination dataset, e.g. s3->gs, rechunking, changing encoding
- Watershed: Create agglomerations of affinities
- Remap: Remap watershed output using agglomeration results
- BigArray: Import BigArray data into Neuroglancer format
- HyperSquare: Import Hypersquare data into Neuroglancer format
- .. and more!
Splitting and merging of supervoxels works for small volumes. For large volumes we need to scale the graph server and the meshes. The graph server could be scaled using a graph for every chunk, as it was done for Jonanthan's Stage 3 task.
Meshes might require simultaneous use of two layers, one of the raw supervoxels and one of the merged ones. Tasks can be used to do real time remeshing. Level of details for meshes, where the least detailed one is just skeletons is likely to be necessary as well.