Skip to content

Creating a Sharded Image from Scratch

William Silversmith edited this page Feb 14, 2022 · 7 revisions

A sharded image benefits from creating multiple orders of magnitude fewer files at the expense of ease of writing and efficiency of read access. Nonetheless, as datasets grow in size, the number of files generated by unsharded chunks can rise as high as billions of files for a petavoxel image. This becomes untenable for many filesystems that strain under the burden (or at the very least IT departments that rightly fear such images). Simply enumerating the files alone becomes a chore, let alone managing the metadata and replication.

The sharded format is much more difficult to write, but hopefully this tutorial will help you get started. If it were possible, we would provide an easy method for doing this, but dataset imports are so irregular that it's difficult to make assistive tools.

Picking a Shard Shape

The sharded format works by packing many smaller chunk files into one big file that has a two level index to enable random access via HTTP Range requests or file seeks. The shards are typically several gigabytes and represent a large 3d region of your image. This means that if your data import consists of a stack of flat image files, you'll end up needing to read each image in multiple times, probably once per intersecting shard. If you are importing a dataset that is in a pre-chunked format like Zarr, then your task will be a bit easier since there will be vastly less redundant IO (possibly zero if the chunks are sized in a multiple of the destination chunk size).

In order to pick a size, you'll need to specify three parameters for a sharded format info file. The shard_bits, minishard_bits, and preshift_bits control how the shard container is constructed and how individual chunks are assigned to the shard. Fortunately, you won't need to think too hard as algorithms for picking these parameters and generating an info file are embedded in igneous. At the time I wrote them, they made sense in Igneous, but now that I'm writing this article, they should probably be moved to CloudVolume to make this process easier.

The below code should be filled out with your dataset's parameters. The biggest choice you likely have to make is the uncompressed_shard_bytesize, which by default is set to 3.5 GB. Some reasonable guesses are provided for other aspects. The max_shard_index_bytes controls the size of the first level index which is of fixed size for all shards. The max_labels_per_minishard and max_minishard_index_bytes provide constraints for the second level of the index. It's important to strike a good balance of index sizes so that the indices can be cached for performance while avoiding large initial downloads.

import igenous.task_creation.image

cv = CloudVolume(...)

cv.info[mip]["sharding"] = igneous.task_creation.image.create_sharded_image_info(  
  dataset_size: ShapeType, # e.g. (1000,1000,100)
  chunk_size: ShapeType, # e.g. (128,128,64)
  encoding: str, # 'raw' or 'gzip'
  dtype: Any, # e.g. np.uint8, 'uint8', 8
  uncompressed_shard_bytesize: int = MEMORY_TARGET, 
  max_shard_index_bytes: int = 8192, # 2^13
  max_minishard_index_bytes: int = 40000,
  max_labels_per_minishard: int = 4000
)
cv.commit_info()

You can then compute the corresponding shard shape:

import igneous.shards

cv = CloudVolume(...)
spec_dict = cv.scale["sharding"]
shape = igneous.shards.image_shard_shape_from_spec(spec_dict, cv.volume_size, cv.chunk_size)