You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To optimize zero-shot performance, we are taking our MLM models through LM adaptation (see #5). For now, we are considering doing this for ~10% of the pre-training steps (around ~3GT). This is the same setup as T5, but it's pretty arbitrary.
Ideally, we should explore what's the optimal ratio of MLM to (C)LM training: 10%, 20%, 5%, 40%? For a fixed number of tokens (~30GT), we should plot the end-task performance at different ratio of MLM to CLM training. That will give an idea of the optimum is there is one.
Note that this is a nice-to-have, that we should only pursue if we have enough compute budget/bandwidth.
The text was updated successfully, but these errors were encountered:
Description
To optimize zero-shot performance, we are taking our MLM models through LM adaptation (see #5). For now, we are considering doing this for ~10% of the pre-training steps (around ~3GT). This is the same setup as T5, but it's pretty arbitrary.
Ideally, we should explore what's the optimal ratio of MLM to (C)LM training: 10%, 20%, 5%, 40%? For a fixed number of tokens (~30GT), we should plot the end-task performance at different ratio of MLM to CLM training. That will give an idea of the optimum is there is one.
Note that this is a nice-to-have, that we should only pursue if we have enough compute budget/bandwidth.
The text was updated successfully, but these errors were encountered: