Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient whole slide imaging IO #856

Open
Mr-Milk opened this issue Jan 31, 2025 · 1 comment
Open

Efficient whole slide imaging IO #856

Mr-Milk opened this issue Jan 31, 2025 · 1 comment

Comments

@Mr-Milk
Copy link

Mr-Milk commented Jan 31, 2025

Whole slide image (WSI) data plays a significant role in the digital pathology field. However, integrating WSI into SpatialData is quite challenging.

What makes WSI different:

  1. Large file size: WSI data is typically large on disk, ranging from approximately 300 MB to 2 GB per slide, even with JPEG compression.
  2. Proprietary formats: Most formats differ from TIFF, not even including OME-TIFF. Many formats require drivers like OpenSlide or BioFormat to be read.
  3. Read-only: In 99% of cases, users need only to read the WSI data instead of modifying anything.

So far, there are a few attempts to integrate WSI into SpatialData:

  1. DVP image readers lucas-diedrich/spatialdata-io#1
  2. SOPA's reader: https://github.com/gustaveroussy/sopa

The idea is to wrap OpenSlide behind xarray or the zarr store to mimic the image interface in SpatialData. The issue is that this approach creates an unnecessary copy of WSI data when serializing the SpatialData on disk. Without proper compression, this could lead to substantial disk usage. While it is a feasible solution for small datasets like ST with few slides, it becomes impractical in the digital pathology field, which often deals with thousands of slides.

I currently have a solution that extends SpatialData with WSI readers rendeirolab/wsidata. The wsidata will hold a reader object with extra APIs to access WSI images but will not mount the image to the images slot in SpatialData like previous solutions. This way, we can avoid unnecessary data copies during serialization. The main drawback of this solution is that it does not comply with the scverse ecosystem when it encounters anything related to images.

Another potential solution is to create soft links for the WSI image files on disk with SpatialData so that when a user saves a SpatialData object, we do not have to copy the WSI data.

Hi @LucaMarconato, I discussed this with you a few months ago at the scverse conference. Hope we can find a graceful solution soon!

@LucaMarconato
Copy link
Member

LucaMarconato commented Feb 3, 2025

Hi @Mr-Milk, thanks for sharing! I checked the code https://github.com/rendeirolab/wsidata and here are some initial thoughts that I have. I'll start with a recap of my understanding of the issues you encountered and that your solution aims to address.

Your cases

You are interested in a way to deal with WSI images in the case in which you have:
1. many large images (>1000s);
2. where read-only access is sufficient (in particular visualization and computing image tiles efficiently);
3. in a format that is not OME Zarr and where one tries to avoid to save the data to OME Zarr.

SpatialData cases

The cases above differ slightly from what we encounter in our framework, because while we also sometimes face case 1 (several hundreds of images as part of large atlases/collections of spatial omics datasets) and sometimes case 2 (for instance when a OME Zarr image should be accessed from the web in a read-only way #831), most of the times we don't fall into the case 3. In our case in fact we assume that for the user is almost always a possibility to convert the data into OME Zarr.

I said that most of the times we don't fall into case 3, because we have some cases in which we read small images in memory, such as small .jpg files, using spatialdata.models.Image2DModel.parse(dask_image.imread(...), ...) and we never save them to disk. This is ok as long as the images are small since for these cases having the image data being read from OME Zarr or from a jpg doesn't lead to a noticeable performance difference, and therefore all the APIs (napari-spatialdata, spatialdata-plot, ...) can be used without overhead. Note that this case is different from yours because you have large images.

Another case is when we rasterize large collections of vector locations that are arranged in a grid-like pattern, into raster data. This is what we do for Visium HD data with the rasterize_bins() API #578. In this case, no image data is present on-disk as we construct the image data lazily in-memory. Also, we never write such image to disk because otherwise the resulting image would become too large.

What happens if we read large non-OME-Zarr images in SpatialData

In spatialdata we do not deal with the case in which large non-OME-Zarr images are read lazily without then converting them to OME Zarr, because we want to build on top of the interoperability and performance capabilities of OME-Zarr. Having a SpatialData object that is built lazily from non-OME-Zarr file is still possible (this is what one gets by default for instance when running sdata = visium_hd('/path/to/data'), but our recommendation to the user is to immediately save this SpatialData object to Zarr and then read it again (see "Problem: I cannot visualize the data, everything is slow" from the spatialdata-io readme).

Your solution

Considering the above, I think that it is great that your solution would cover the missing case above! This is a case that spatialdata does not intend to cover anytime soon (if ever?) and I personally believe that if you have a concrete solution to a concrete problem today, that's the best short term outcome one can have for the users.

Longer term

Regarding your points above:

The main drawback of this solution is that it does not comply with the scverse ecosystem when it encounters anything related to images.

I think that one can implement a workaround to be compatible with the scverse in-memory way to represent images (i.e. the output of Image2DModel.parse(), together with coordinate transformations, etc.). For instance this could be achieved by implementing some converters that adjust the metadata in-memory but that point to the image data lazily.

I am more concerned about the compatibility with the OME-NGFF ecosystem. There is a growing community effort to build tooling in different programming languages around OME-NGFF, and for the long term I think that the effort to duplicate the data on disk by converting them to OME-Zarr would pay off over maintaining tools operating on data in several different formats. For instance, this webpage shows that recently more than 500 TB of OME-Zarr v2 data have been converted to OME-Zarr v3 data. This was a huge effort, but it will very likely pay off in the long term to adhere to an open standard like OME-NGFF on disk.

Final considerations, and future developments in spatialdata

I therefore think it is great that you can deliver a solution to a pressing problem for WSI, and I am sure the community would appreciate this. I would keep in mind the considerations above regarding on-disk storage for the longer term, but at the same time, for the short term, I am happy to help if you have any question on scverse in-memory compatibility.

To this regard, in spatialdata we are considering having an approach similar to the one that you outlined here:

Another potential solution is to create soft links for the WSI image files on disk with SpatialData so that when a user saves a SpatialData object, we do not have to copy the WSI data.

This would help for the use case in which one has a local object and some large remote images in a remote S3 bucket. If we implement something like this, it could be useful also for your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants