You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper.
I have one question regarding the SFT loss implementation in the repository. Apparently, the SFT loss sums the cross entropy loss within each sequences. However, from my understanding, language modeling loss conventionally averages the cross entropy loss for all tokens within the batch (Ref: GPT2 Loss). I think this results in a difference in computing the standard cross entropy loss between TRL's SFTTrainer and this repository's SFT loss. Why is SFT implemented this way?
The text was updated successfully, but these errors were encountered:
YJWon99
changed the title
Why does SFTTrainer sum the cross-entropy loss within each sequence?
Why does SFT sum the cross-entropy loss within each sequence?
Feb 17, 2024
Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper.
I have one question regarding the SFT loss implementation in the repository. Apparently, the SFT loss sums the cross entropy loss within each sequences. However, from my understanding, language modeling loss conventionally averages the cross entropy loss for all tokens within the batch (Ref: GPT2 Loss). I think this results in a difference in computing the standard cross entropy loss between TRL's SFTTrainer and this repository's SFT loss. Why is SFT implemented this way?
The text was updated successfully, but these errors were encountered: