Pipeline Dreams and Roadmap

Vision

Make all datasets in the lab accessible, visible, easily computable, hackable, and reconstructable by anyone in the lab.

History

The neuroglancer pipeline was born of necessity. We needed a way to ingest datasets in hdf5 and other formats lying around the lab. We first developed the scripts ingest.py and digest.py. The former would upload chunks to Google Cloud Storage and the second one would downsample them. We would run ingest.py from our laptop and many copies of digest.py on the titans, and it was good.

Within a week or two, we wanted to add meshing to digest.py, but we needed a 1 pixel overlap for marching cubes. The scripts became really ugly working with awkward rectangular chunks of the dataset. This necessitated the creation of GCloudVolume and later Precomputed to allow arbitrary cutouts of a dataset. Performing meshing using a cutout simplified the code significantly.

Neuroglancer Pipeline

Above: An early design of the pipeline drawn in the Secret Spot

Eventually, we wanted to ingest s1 using the titan, but it took forever (we had the option to process a 1TB hdf5 or many little hdf5s). We created a task system using the Google Cloud Task Pull Queue API, bundled digest.py into a docker container, and used Kubernetes to run them on Google's infrastructure.

After that, we were backstopping Watershed. We created Watershed tasks and supported storing data and running computations on AWS/S3. While we later learned that gsutil, our previously preferred way to download data from Google Storage, could be used with S3, it was either slow (-c mode) or didn't play nice with other Kubernetes pods (-m mode). We created multithreaded Storage as an abstraction over Google Storage, Amazon S3, and local filesystems to enable high speed uploads/downloads that play nice with other pods, is easy to write functional tests for, and is easy to use in native Python code, and doesn't have to serialize to disk.

We now have a dozen tasks, the most complex being Jonathan's Stage 3, which iteratively runs two covnets and acquires locks for modifying adjacent chunk's region graphs.

Current Capabilities

Any dataset in Google Storage or S3 can be visualized
Can run arbitrary processes on arbitrary chunks of datasets

Supported Tasks

Ingest: Import HDF5 into Neuroglancer format
Downsample: Make large datasets easy to visualize zoomed out and assist in making computations that don't require high resolution by providing a low res dataset version
Mesh: Create meshes for 3D visualization
MeshManifest: Secondary processing of meshes to label which mesh fragments belong together
Quantize: Visualize affinities by creating an 8-bit version of one channel
Transfer: Convert a source dataset into a destination dataset, e.g. s3->gs, rechunking, changing encoding
Watershed: Create agglomerations of affinities
Remap: Remap watershed output using agglomeration results
BigArray: Import BigArray data into Neuroglancer format
HyperSquare: Import Hypersquare data into Neuroglancer format
.. and more!

Human Intervention

Splitting and merging of supervoxels works for small volumes. For large volumes we need to scale the graph server and the meshes. The graph server could be scaled using a graph for every chunk, as it was done for Jonanthan's Stage 3 task.

Meshes might require simultaneous use of two layers, one of the raw supervoxels and one of the merged ones. Tasks can be used to do real time remeshing. Level of details for meshes, where the least detailed one is just skeletons is likely to be necessary as well.

Provide feedback

Saved searches