You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what is reported and the provided Figure 11, it seems that during training the image example with the lesion is encoded with the VAE along with its masked equivalent, and then stacked with a reshaped version of the mask to serve as input for the UNet. I have some doubts about the effect that this design choice has at inference time.
If the actual ground truth image that we are aiming to generate is provided as an input to the UNet during training, then at inference time there will be a "domain shift" between what the UNet expects as the first channel and what it will actually get. This is because the model has been trained to generate an image with lesion from some lesion masks AND the image with the lesion itself. In contrast, at inference time you are providing an image that DOES NOT have the lesion + some target lesion masks, and expecting to get an image inpainted with a lesion.
I wonder if this was something you thought of. I imagine that removing the actual image with the lesion from the first channel during training could improve performance. Did you experiment with this?
Additional question, in Figure 11 you show that it is the SD's UNet doing the inpainting. Is this UNet pretrained on simple text-to-mammography or trained from scratch on this inpainting objective?
Best,
Pedro
The text was updated successfully, but these errors were encountered:
Hi, very interesting work! Congrats!
I am particularly interested in the inpainting pipeline you report in this paper: MAM-E: Mammographic Synthetic Image Generation with Diffusion Models.
From what is reported and the provided Figure 11, it seems that during training the image example with the lesion is encoded with the VAE along with its masked equivalent, and then stacked with a reshaped version of the mask to serve as input for the UNet. I have some doubts about the effect that this design choice has at inference time.
If the actual ground truth image that we are aiming to generate is provided as an input to the UNet during training, then at inference time there will be a "domain shift" between what the UNet expects as the first channel and what it will actually get. This is because the model has been trained to generate an image with lesion from some lesion masks AND the image with the lesion itself. In contrast, at inference time you are providing an image that DOES NOT have the lesion + some target lesion masks, and expecting to get an image inpainted with a lesion.
I wonder if this was something you thought of. I imagine that removing the actual image with the lesion from the first channel during training could improve performance. Did you experiment with this?
Additional question, in Figure 11 you show that it is the SD's UNet doing the inpainting. Is this UNet pretrained on simple text-to-mammography or trained from scratch on this inpainting objective?
Best,
Pedro
The text was updated successfully, but these errors were encountered: