Am I misunderstanding how masked training works? #347
-
Hi all, I'm tinkering around with masked lora training, with 0 unmasked probability and 0 unmasked weight. In theory at least, i would expect it to ONLY learn the concepts that are within the mask. For example, a single masked person in a group shot... However, im seeing that there appears to be a lot of bleed through. My training data consists of images of a person (in this case), and im using a combination of solo and group shots. The individual person is masked each time, and the captions/tags created are based purely on the concepts within the mask (i have a ChatGPT python script to extract the masked part, run it against the model and produce captions - might be an idea something you want to add in future). However, even though only the singular person is masked, and all captions are solo based, it still appears to pick up on multiple people very quickly and runs with that right the way though (even when asking for solo in positive prompt samples etc) on each epoch sample. Using AdamW_8Bit, learning rate of 0.00001 - 119 training images, all captioned as a solo person and masked only on the individual. With that learning rate, from epoch 3 onwards, i tend to get group shot samples. Im just wondering, am I doing that right? Is there something special that needs to be set, or is there an actual bleedthrough like effect happening here where 0 outside the mask isn't truly 0 ? TIA! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Masked training doesn't work as you think it does. There are two issues here, the first being likely less important than the second for your use case. (1) Exact masks in pixel space aren't exact in latent space. Remember both the image and the mask have to get transformed through the VAE. (2) When you mask an area and set 0 unmasked weight, you aren't saying "only train what's inside of this mask", you are telling the model "ONLY what appears in this masked section matters". That's a subtle but extremely important difference. It means that the model can do ANYTHING outside of that section and it will not penalize the model for being wrong. So if it learned that in other images there are people out there (from a differently placed mask), it might settle on generating people. It might just start dropping some phantom limbs, or bodies, or other things from the mask. After all, it knows that it has to generate limbs, and the overall loss is lower when it generates limbs, so let's throw in a few extra limbs for good measure since there's no loss penalty to tell it not to do it. You're treating masks like crops, but they aren't crops. If you want to only train on a specific part of the image, you need to crop. |
Beta Was this translation helpful? Give feedback.
Masked training doesn't work as you think it does. There are two issues here, the first being likely less important than the second for your use case.
(1) Exact masks in pixel space aren't exact in latent space. Remember both the image and the mask have to get transformed through the VAE.
(2) When you mask an area and set 0 unmasked weight, you aren't saying "only train what's inside of this mask", you are telling the model "ONLY what appears in this masked section matters". That's a subtle but extremely important difference. It means that the model can do ANYTHING outside of that section and it will not penalize the model for being wrong. So if it learned that in other images there are p…