Rouge metric #27

presto105 · 2021-12-06T17:29:23Z

Rouge

Huggingface rouge metric 에서 사용하는 Rouge score type은 rouge1, rouge2, rougeL, rougeLsum 입니다!

rouge란, reference title과 model에 의해 generation된 문장의 단어 일치도를 보는 metric으로,
단어 일치도의 f1-score를 의미합니다.

그렇다면 rouge-1/2에 더해서 rougeL과 rougeLsum은 무엇을 의미할까요?

rouge-N

예시 문장을 바탕으로 rouge score를 계산해보겠습니다.

prediction = '이것은 첫번째 예측 문장 이지요'
reference = '이것은 정답에 해당하는 문장 첫번째 입니다.'

예시를 위해 띄어쓰기 기준으로 split을 진행하였습니다.

intersection_ngrams_count = 0
for ngram in six.iterkeys(reference_ngrams): ## python 3 ".keys()"
    intersection_ngrams_count += min(reference_ngrams[ngram], prediction_ngrams[ngram]) ## 겹치는 n_gram counting (True positive)
reference_ngrams_count = sum(reference_ngrams.values()) ## references n_gram 수
prediction_ngrams_count = sum(prediction_ngrams.values()) ## prediction n_gram 수

precision = intersection_ngrams_count / max(prediction_ngrams_count, 1)
recall = intersection_ngrams_count / max(reference_ngrams_count, 1)
fmeasure = scoring.fmeasure(precision, recall) ## f1 score 계산

각 token의 count를 contingency table로 나타내면 아래와 같습니다

	Ref T	Ref F
Pred T	TP(3)	FP(2)
Pred F	FN(3)	TN(0)

TP: 첫번째, 이것은, 문장
FP: 예측, 이지요
FN: 정답에 , 해당하는, 입니다
Precision: 3/5
Recall: 3/6
fmeasure: 6/11

즉, 이 예시에서는 Rouge1은 6/11이 됩니다.
너무 긴 문장이 생성되면 recall이 감소할 수 있어 길이도 보정도 된다고 생각 할 수 있습니다.

rouge-L

rougeL의 L은 Longest Common Sub sequence(lcs)를 의미합니다. Longest Common Sub sequence가 무엇일까요?
문장이 완전히 일치하지는 않지만, 떨어져 있는 Sequence 형태로 가장 길게 일치하는 단어의 조합이라 할 수 있습니다.
앞선 예시를 가지고 다시 살펴보게 되면,

prediction = '이것은 첫번째 예측 문장 이지요'
reference = '이것은 정답에 해당하는 문장 첫번째 입니다.'

Longest Common Sub sequence: '이것은' +'문장'

이러한 lcs를 구하기 위한 코드는 dynamic programming 알고리즘으로 구하게 됩니다.

rows = len(reference)
cols = len(prediction)
lcs_table = [[0] * (cols + 1) for _ in range(rows + 1)]
for i in range(1, rows + 1):
	for j in range(1, cols + 1):
		if reference[i - 1] == prediction[j - 1]:
			lcs_table[i][j] = lcs_table[i - 1][j - 1] + 1
		else:
			lcs_table[i][j] = max(lcs_table[i - 1][j], lcs_table[i][j - 1])
lcs_table ## 행은 reference, 열은 prediction token으로 '이것은'과 '문장'에 해당하는 idx에서 count 증가

Table에서 각 행, 열을 거치며 일치하는 token을 만나게 되면 count를 증가시켜 주게 됩니다.

[[0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 2, 2],
[0, 1, 2, 2, 2, 2],
[0, 1, 2, 2, 2, 2]]

즉, table[-1][-1]의 값은 lcs count를 의미하게 됩니다.
이를 다시 contingency table로 나타내면 아래와 같습니다

	Ref T	Ref F
Pred T	TP(2)	FP(3)
Pred F	FN(4)	TN(0)

TP: '이것은' +'문장'
FP: 예측, 첫번째, 이지요
FN: 정답에 , 해당하는, 첫번째, 입니다
Precision: 2/5
Recall: 2/6
fmeasure: 4/9

rouge-Lsum

마지막으로, rougeLsum입니다. lcs를 사용하는 것은 동일하며, 여러 '\n'이 포함된 문장을 비교할 때 사용하는 값입니다.
'\n'로 split 후 이중 loop를 돌며 전체 겹치는 단어를 찾고 중복을 제거 해줍니다.

새로운 예시를 통해 계산해보면,

prediction = '이것은 첫번째 예측 문장 이지요.\n 이것은 두번째 예측'
reference = '이것은 정답에 해당하는 문장 첫번째 입니다.\n 이것은 두번째 정답'

Longest Common Sub sequence: '이것은' +'문장' + '이것은' + '두번째'

	Ref T	Ref F
Pred T	TP(4)	FP(4)
Pred F	FN(5)	TN(0)

Precision: 4/8
Recall: 4/9
fmeasure: 8/17

이를 계산하는 함수 _summary_level_lcs에 여러 function들이 엮어 있어 코드가 복잡해 보입니다.
하나하나 따라가다 보면,

Reference/Prediction split sentence의 이중 for문
dynamic programming으로 lcs table 생성
lcs table back trace로 lcs 속하는 단어의 index list 생성
중복 일치 lcs 단어 제거
RougeLCS 계산

위 과정을 통해 rougeLsum이 계산됩니다.

마무리

Rouge-N보다 문장에서의 단어 순서가 고려된 Rouge-L이 이번 metric에 적합하다고 생각했습니다
Tokenizer를 적용하였을때 조사, 어미 등으로 인한 rouge의 약점이 존재하는 것 같습니다.
품사 tagging을 통해 이러한 이슈를 해결해볼 수 있을 것 같습니다 (eval 시간이 너무 늘어날 것 같네요).
불용어 사전을 만들어 filter를 적용해보는 것도 방법인 것 같습니다!

The text was updated successfully, but these errors were encountered:

presto105 self-assigned this Dec 6, 2021

presto105 added the report Sharing information or results of analysis label Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rouge metric #27

Rouge metric #27

presto105 commented Dec 6, 2021 •

edited

Loading

Rouge metric #27

Rouge metric #27

Comments

presto105 commented Dec 6, 2021 • edited Loading

Rouge

rouge-N

rouge-L

rouge-Lsum

마무리

presto105 commented Dec 6, 2021 •

edited

Loading