-
Notifications
You must be signed in to change notification settings - Fork 48
Creating a Sharded Image from Scratch
A sharded image benefits from creating multiple orders of magnitude fewer files at the expense of ease of writing and efficiency of read access. Nonetheless, as datasets grow in size, the number of files generated by unsharded chunks can rise as high as billions of files for a petavoxel image. This becomes untenable for many filesystems that strain under the burden (or at the very least IT departments that rightly fear such images). Simply enumerating the files alone becomes a chore, let alone managing the metadata and replication.
The sharded format is much more difficult to write, but hopefully this tutorial will help you get started. If it were possible, we would provide an easy method for doing this, but dataset imports are so irregular that it's difficult to make assistive tools.
The sharded format works by packing many smaller chunk files into one big file that has a two level index to enable random access via HTTP Range requests or file seeks. The shards are typically several gigabytes and represent a large 3d region of your image. This means that if your data import consists of a stack of flat image files, you'll end up needing to read each image in multiple times, probably once per intersecting shard. If you are importing a dataset that is in a pre-chunked format like Zarr, then your task will be a bit easier since there will be vastly less redundant IO (possibly zero if the chunks are sized in a multiple of the destination chunk size).
In order to pick a size, you'll need to specify three parameters for a sharded format info
file. The shard_bits
, minishard_bits
, and preshift_bits
control how the shard container is constructed and how individual chunks are assigned to the shard. Fortunately, you won't need to think too hard as algorithms for picking these parameters and generating an info
file are embedded in igneous
. At the time I wrote them, they made sense in Igneous, but now that I'm writing this article, they should probably be moved to CloudVolume to make this process easier.
The below code should be filled out with your dataset's parameters. The biggest choice you likely have to make is the uncompressed_shard_bytesize
, which by default is set to 3.5 GB. Some reasonable guesses are provided for other aspects. The max_shard_index_bytes
controls the size of the first level index which is of fixed size for all shards. The max_labels_per_minishard
and max_minishard_index_bytes
provide constraints for the second level of the index. It's important to strike a good balance of index sizes so that the indices can be cached for performance while avoiding large initial downloads. The memory limits are soft limits, so the shard sizes on paper may be up to a factor of two larger than specified.
If compression is enabled, the shard should be much smaller than the uncompressed image in memory. Make sure to set the encoding to "raw" if your files are already fully compressed file such as "jpeg".
import igenous.task_creation.image
cv = CloudVolume(...)
cv.info[mip]["sharding"] = igneous.task_creation.image.create_sharded_image_info(
dataset_size: ShapeType, # e.g. (1000,1000,100)
chunk_size: ShapeType, # e.g. (128,128,64)
encoding: str, # 'raw' or 'gzip'
dtype: Any, # e.g. np.uint8, 'uint8', 8
uncompressed_shard_bytesize: int = MEMORY_TARGET,
max_shard_index_bytes: int = 8192, # 2^13
max_minishard_index_bytes: int = 40000,
max_labels_per_minishard: int = 4000
)
cv.commit_info()
You can then compute the corresponding shard shape:
import igneous.shards
cv = CloudVolume(...)
spec_dict = cv.scale["sharding"]
shape = igneous.shards.image_shard_shape_from_spec(spec_dict, cv.volume_size, cv.chunk_size)
Once you know what shape you need, you'll need to read in an image block the size of your shape. The RAM required should be within a factor of 2 of the size you specified for the uncompressed_shard_bytesize
. Create this image as a 3d numpy array (you might need to add a fourth channel of dimension 1).
Once the array is assembled, feed it into CloudVolume like so:
from cloudfiles import CloudFiles
img = ....
cv = CloudVolume(..., mip=mip) # presumably
(filename, shard) = dst_vol.image.make_shard(
img, dst_bbox, mip, progress=False
)
del img
basepath = cv.meta.join(cv.cloudpath, cv.key)
CloudFiles(basepath).put(filename, shard)
Now you should be able to view that chunk in Neuroglancer.