Why does SFT sum the cross-entropy loss within each sequence? #68

YJWon99 · 2024-02-17T07:12:45Z

Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper.

I have one question regarding the SFT loss implementation in the repository. Apparently, the SFT loss sums the cross entropy loss within each sequences. However, from my understanding, language modeling loss conventionally averages the cross entropy loss for all tokens within the batch (Ref: GPT2 Loss). I think this results in a difference in computing the standard cross entropy loss between TRL's SFTTrainer and this repository's SFT loss. Why is SFT implemented this way?

HuXiangkun · 2024-05-17T06:17:49Z

Same question here. Hi @YJWon99 , do you have any ideas now?

yiyepiaoling0715 · 2025-01-03T13:08:22Z

has been solved? same question

HuXiangkun · 2025-01-06T01:42:44Z

@yiyepiaoling0715 I think it's a bug in their code, it should be averaged over the sequence and I made the revision in my experiments.

YJWon99 changed the title ~~Why does SFTTrainer sum the cross-entropy loss within each sequence?~~ Why does SFT sum the cross-entropy loss within each sequence? Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does SFT sum the cross-entropy loss within each sequence? #68

Why does SFT sum the cross-entropy loss within each sequence? #68

YJWon99 commented Feb 17, 2024 •

edited

Loading

HuXiangkun commented May 17, 2024

yiyepiaoling0715 commented Jan 3, 2025

HuXiangkun commented Jan 6, 2025

Why does SFT sum the cross-entropy loss within each sequence? #68

Why does SFT sum the cross-entropy loss within each sequence? #68

Comments

YJWon99 commented Feb 17, 2024 • edited Loading

HuXiangkun commented May 17, 2024

yiyepiaoling0715 commented Jan 3, 2025

HuXiangkun commented Jan 6, 2025

YJWon99 commented Feb 17, 2024 •

edited

Loading