What is the expected output of your example of a2cu.score? #1

jmconroy · 2024-02-16T13:07:10Z

Thank you for sharing your code.
I have one minor "issue," and the second is mostly a request and question.
The issue is you have a typo in the code in the README; you need to include some commas in your examples of A2CU and A3CU. For example:

recall_scores, prec_scores, f1_scores = a2cu.score(
references=references,
candidates=candidates,
generation_batch_size=2, # the batch size for ACU generation
matching_batch_size=16, # the batch size for ACU matching ## COMMA was missing
output_path=None, # the path to save the evaluation results ## COMMA was missing
recall_only=False, # whether to only compute the recall score ## COMMA was missing
acu_path=None # the path to save the generated ACUs
)

My question is regarding the output of this example, which gives precision=recall=f1 of 1/8, i.e.,
....
Recall score: 0.1250
Precision score: 0.1250
F1 score: 0.1250
Recall: 0.1250, Precision 0.1250, F1: 0.1250

The input in your example is
candidates, references = ["This is a test"], ["This is a test"]

This result was surprising to me. Would we not expect the answer to be 1 or fairly close to it? So, with a debugger, I exacted the acus being generated for the reference and found they were:

[['This
is a test.',
'The narrator is talking about something.',
'The narrator is talking about something.',
'The narrator is talking about something.',
'The narrator is talking about something.',
'The narrator is talking about something.',
'The narrator is talking about something.',
'The narrator is talking about something.']]

Do I have an installation problem, or is this the expected answer?
Could you please add the expected output to both of your examples?

jmconroy · 2024-04-08T11:15:00Z

I have seen further evidence of this issue on TAC data, where the summaries are about 100 words. The problem is that the ACU generation can produce duplicate content units. The code and the paper assume that the collection of content units is a "set" and is a "bag," AKA "multiset." As such, the a2cu score for comparing a summary against itself can be less than 1. The short example is an extreme case, but for TAC 2010, I saw the a2cu score be as low as 0.84 because of this issue.

yixinL7 · 2024-07-03T21:53:25Z

Hi John,

Thank you for raising this issue, and sorry for the late response.

The output you received is indeed expected, and your observation is correct that it is partly because the ACU generation model can generate duplicated ACUs. We have implemented a simple deduplication fix for this issue:

AutoACU/autoacu/a2cu.py

Line 95 in 25286e2

_acus = list(set(x.split("|||")))

Unfortunately, this won't completely fix the problem because the ACU generation model can still generate "hallucinated" ACUs, which won't be matched back to the original text. For example, when candidates, references = ["This is a test"], ["This is a test"], the ACU generator still generates two ACUs after deduplication: ['This is a test.', 'The narrator is talking about something.']. Only the first one will be matched with the original text, resulting in a final ACU score of 0.5.

We have updated the README with a warning about this limitation. Thank you again for bringing this issue to our attention!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the expected output of your example of a2cu.score? #1

What is the expected output of your example of a2cu.score? #1

jmconroy commented Feb 16, 2024 •

edited

Loading

jmconroy commented Apr 8, 2024

yixinL7 commented Jul 3, 2024

What is the expected output of your example of a2cu.score? #1

What is the expected output of your example of a2cu.score? #1

Comments

jmconroy commented Feb 16, 2024 • edited Loading

jmconroy commented Apr 8, 2024

yixinL7 commented Jul 3, 2024

jmconroy commented Feb 16, 2024 •

edited

Loading