Develop DeduplicateDocuments filter #25

otakumesi · 2023-09-21T05:32:42Z

Add a per-document deduplication filter.

[TODO]

Develop draft implementation
Compute document stats and think about threshold
Implements production code
Add Spec

otakumesi · 2023-09-21T05:46:54Z

eiennohito · 2023-09-21T10:30:14Z

lib/src/main/scala/com/worksap/nlp/uzushio/lib/filters/DeduplicateDocuments.scala

+  val randomGenerator: DocumentRandomGeneratorBase = new DocumentRandomGenerator
+) extends DocFilter {
+
+  def computeNearDuplicateTextRatio(doc: Document): Float = {


指標はいいと思います！

eiennohito · 2023-09-21T10:34:41Z

lib/src/main/scala/com/worksap/nlp/uzushio/lib/filters/DeduplicateDocuments.scala

+  def shouldRemoveDocument(doc: Document) = {
+    val nearDuplicateTextRatio = computeNearDuplicateTextRatio(doc)
+    val thresholdProb = randomGenerator.randomDouble(doc.docId)
+    nearDuplicateTextRatio >= thresholdProb


この条件は正しくないと思います。

文書の9割は１回以上コーパスで出てきたと100回以上出てきたことを同じ確率にするのは正しくないと思います。

@eiennohito なるほど、確かに現在は2回以上の重複を同じ扱いにして、その重複しているテキストの割合を計算してました。
そうなると、例えばnearFreqの数字を使って重み付きのratioを計算したり、平均nearFreqなどでthresholdを調整してみるとかをした方が良いということでしょうか。

擬似コードですが、例えばこんな感じの実装はどうでしょう。
nearFreqが大きいほどratioが大きくなる（重複率の指標が大きくなる）という考えです。

weights = ln(nearFreqList) / (ln(nearFreqList)+1) # or # nearFreqListCut100= [nearFreq if nearFreq < 100 else 100 for nearFreq in nearFreqList] # weights = ln(nearFreqList) / ln(100) ratio = weightedSum(paragrahLengths, weights) / documentLength

ただ、平均的なFreqが大きいほどthresholdProbが小さくなるように実装した方が自然かもしれませんね。
となると、乱数の分布パラメータに平均Freqをいれてあげるとかですかね。

https://github.com/WorksApplications/uzushio/pull/25/files#diff-b97c0389f5b19a25a0e0b320169e9d0e137b9e24fc420835827209d4e2ceda7bR24-R47
反映させてみました。

Freqが大きい段落があるほど、1に近づくという指標です。

結果的に重複ドキュメントは何個残りそうですか？

@eiennohito 軽く確認をしてみたところ、こんな感じでした。

(フィルタ前) -> (フィルタ後）

244 -> 228

276 -> 266

267 -> 229

239 -> 228

あまり大きくフィルタはされてないですね。
閾値となる乱数を調整するのが良さそうです。

mu=0.1、sd=0.1の正規乱数で本指標をフィルタしたところ、

244 -> 137

276 -> 147

267 -> 139

239 -> 124

だいたい半分くらい減りましたね。
フィルタできたドキュメントをうまく抽出できないか見てみます。

…romString

…nto dev-duplicate-filters

…shio into dev-duplicate-filters

eiennohito · 2023-09-27T00:11:36Z

実験のために現時点でマージしたいのですが、どうしましょうか？

otakumesi · 2023-09-27T00:44:02Z

@eiennohito もうマージしていただいても問題ないです〜。

develop DeduplicateDocuments filter

cc5d11f

Refactoring DedupicateDocuments for test

ef71563

eiennohito requested changes Sep 21, 2023

View reviewed changes

otakumesi and others added 19 commits September 22, 2023 13:56

Update logic computing duplication ratio

3fe58d2

Apply scalafmt

b91ffb5

Merge remote-tracking branch 'origin/main' into dev-duplicate-filters

b2f8743

modify oneline implementation

4500c66

modify typo and operators

283c2f0

add filtering document logic in processDocumentParts

9f7a793

fix operator and variable

c96e5d6

sniff encodings more leniently

e967f37

modify oneline implementation

d491e10

modify typo and operators

e4d21b2

add filtering document logic in processDocumentParts

511cb5d

fix operator and variable

9d5c2f5

add debug tool for displaying metrics with dedup stats

1bc277c

add Gaussian Random Value Generator, and refactoring RandomGeneratorF…

6bc1e71

…romString

t Merge branch 'main' of ssh://github.com/WorksApplications/uzushio i…

a093809

…nto dev-duplicate-filters

Merge branch 'dev-duplicate-filters' of github.com:asahi-research/uzu…

02e68d3

…shio into dev-duplicate-filters

add tests for DeduplicateDocuments

87d037b

add duplication-score and apply scalafmt

9316695

put back codes

f52a1f6

otakumesi changed the title ~~[WIP] Develop DeduplicateDocuments filter~~ Develop DeduplicateDocuments filter Sep 26, 2023

modify DeduplicationFilter parameters

23b2e7c

otakumesi marked this pull request as ready for review September 27, 2023 00:42

otakumesi requested a review from eiennohito September 27, 2023 00:43

remove println

2652ac4

eiennohito merged commit 6a9b802 into WorksApplications:main Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop DeduplicateDocuments filter #25

Develop DeduplicateDocuments filter #25

otakumesi commented Sep 21, 2023 •

edited

Loading

otakumesi commented Sep 21, 2023

eiennohito Sep 21, 2023

eiennohito Sep 21, 2023

otakumesi Sep 21, 2023 •

edited

Loading

otakumesi Sep 21, 2023 •

edited

Loading

otakumesi Sep 21, 2023

otakumesi Sep 22, 2023

eiennohito Sep 22, 2023

otakumesi Sep 25, 2023 •

edited

Loading

otakumesi Sep 25, 2023

eiennohito commented Sep 27, 2023

otakumesi commented Sep 27, 2023

Develop DeduplicateDocuments filter #25

Develop DeduplicateDocuments filter #25

Conversation

otakumesi commented Sep 21, 2023 • edited Loading

otakumesi commented Sep 21, 2023

eiennohito Sep 21, 2023

Choose a reason for hiding this comment

eiennohito Sep 21, 2023

Choose a reason for hiding this comment

otakumesi Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

otakumesi Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

otakumesi Sep 21, 2023

Choose a reason for hiding this comment

otakumesi Sep 22, 2023

Choose a reason for hiding this comment

eiennohito Sep 22, 2023

Choose a reason for hiding this comment

otakumesi Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

otakumesi Sep 25, 2023

Choose a reason for hiding this comment

eiennohito commented Sep 27, 2023

otakumesi commented Sep 27, 2023

otakumesi commented Sep 21, 2023 •

edited

Loading

otakumesi Sep 21, 2023 •

edited

Loading

otakumesi Sep 21, 2023 •

edited

Loading

otakumesi Sep 25, 2023 •

edited

Loading