`rouge_score` with `accumulate='best'` gives mixed results #2148

volksen · 2023-10-06T12:41:53Z

🐛 Bug

Hi,

when using the rouge_score with accumulate="best", the results are dependent on the order of the labels. As of my understanding, accumulate="best" should return the best f score over all references.

Minimal example:

from torchmetrics.functional.text import rouge_score

preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]

print(rouge_score(preds, references, accumulate='best'))
print(rouge_score(preds, references_rev, accumulate='best'))

gives different results:

{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(1.), 'rouge2_precision': tensor(1.), 'rouge2_recall': tensor(1.), 'rougeL_fmeasure': tensor(1.), 'rougeL_precision': tensor(1.), 'rougeL_recall': tensor(1.), 'rougeLsum_fmeasure': tensor(1.), 'rougeLsum_precision': tensor(1.), 'rougeLsum_recall': tensor(1.)}
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(0.), 'rouge2_precision': tensor(0.), 'rouge2_recall': tensor(0.), 'rougeL_fmeasure': tensor(0.3333), 'rougeL_precision': tensor(0.3333), 'rougeL_recall': tensor(0.3333), 'rougeLsum_fmeasure': tensor(0.3333), 'rougeLsum_precision': tensor(0.3333), 'rougeLsum_recall': tensor(0.3333)}

Did I missread the documentation or is this a bug. Accumulate='avg' works as expected.
Maybe the bug is in https://github.com/Lightning-AI/torchmetrics/blob/v1.1.0/src/torchmetrics/functional/text/rouge.py#L378
where there is a todo comment.

I compared the results to the rouge-score package:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]
print(scorer.score_multi(references, preds))
print(scorer.score_multi(references_rev, preds))

which gives the same results in both cases:

{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}

Environment

TorchMetrics Version: 1.1.2
Python 3.10.12
torch 2.0.1

github-actions · 2023-10-06T12:42:29Z

Hi! thanks for your contribution!, great first issue!

stancld · 2023-10-13T14:20:40Z

Thanks for the report! Gonna check this weekend.

Borda · 2024-02-02T18:34:16Z

Thanks for the report! Gonna check this weekend.

@stancld, did you have a chance to have a look at it? 🐰

rittik9 · 2024-09-10T06:10:05Z

@Borda pls assign it to me

…-AI#2148

volksen added bug / fix Something isn't working help wanted Extra attention is needed labels Oct 6, 2023

Borda changed the title ~~rouge_score with accumulate='best' gives mixed results~~ rouge_score with accumulate='best' gives mixed results Oct 6, 2023

Borda added v1.1.x topic: Text labels Oct 6, 2023

Borda assigned stancld Oct 6, 2023

Borda unassigned stancld Aug 29, 2024

Borda added the good first issue Good for newcomers label Aug 29, 2024

Borda assigned rittik9 Sep 10, 2024

Borda removed the help wanted Extra attention is needed label Sep 10, 2024

rittik9 added a commit to rittik9/torchmetrics that referenced this issue Nov 7, 2024

fix: rouge_score with accumulate='best' gives mixed results Lightning…

21948b2

…-AI#2148

rittik9 mentioned this issue Nov 7, 2024

Fix mixed results of rouge_score with accumulate='best' #2830

Merged

4 tasks

Borda closed this as completed in #2830 Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`rouge_score` with `accumulate='best'` gives mixed results #2148

`rouge_score` with `accumulate='best'` gives mixed results #2148

volksen commented Oct 6, 2023 •

edited by Borda

Loading

github-actions bot commented Oct 6, 2023

stancld commented Oct 13, 2023

Borda commented Feb 2, 2024

rittik9 commented Sep 10, 2024

rouge_score with accumulate='best' gives mixed results #2148

rouge_score with accumulate='best' gives mixed results #2148

Comments

volksen commented Oct 6, 2023 • edited by Borda Loading

🐛 Bug

Environment

github-actions bot commented Oct 6, 2023

stancld commented Oct 13, 2023

Borda commented Feb 2, 2024

rittik9 commented Sep 10, 2024

`rouge_score` with `accumulate='best'` gives mixed results #2148

`rouge_score` with `accumulate='best'` gives mixed results #2148

volksen commented Oct 6, 2023 •

edited by Borda

Loading