-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop DeduplicateDocuments filter #25
Develop DeduplicateDocuments filter #25
Conversation
val randomGenerator: DocumentRandomGeneratorBase = new DocumentRandomGenerator | ||
) extends DocFilter { | ||
|
||
def computeNearDuplicateTextRatio(doc: Document): Float = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
指標はいいと思います!
def shouldRemoveDocument(doc: Document) = { | ||
val nearDuplicateTextRatio = computeNearDuplicateTextRatio(doc) | ||
val thresholdProb = randomGenerator.randomDouble(doc.docId) | ||
nearDuplicateTextRatio >= thresholdProb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
この条件は正しくないと思います。
文書の9割は1回以上コーパスで出てきたと100回以上出てきたことを同じ確率にするのは正しくないと思います。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eiennohito なるほど、確かに現在は2回以上の重複を同じ扱いにして、その重複しているテキストの割合を計算してました。
そうなると、例えばnearFreqの数字を使って重み付きのratioを計算したり、平均nearFreqなどでthresholdを調整してみるとかをした方が良いということでしょうか。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
擬似コードですが、例えばこんな感じの実装はどうでしょう。
nearFreqが大きいほどratioが大きくなる(重複率の指標が大きくなる)という考えです。
weights = ln(nearFreqList) / (ln(nearFreqList)+1)
# or
# nearFreqListCut100= [nearFreq if nearFreq < 100 else 100 for nearFreq in nearFreqList]
# weights = ln(nearFreqList) / ln(100)
ratio = weightedSum(paragrahLengths, weights) / documentLength
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ただ、平均的なFreqが大きいほどthresholdProbが小さくなるように実装した方が自然かもしれませんね。
となると、乱数の分布パラメータに平均Freqをいれてあげるとかですかね。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Freqが大きい段落があるほど、1に近づくという指標です。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
結果的に重複ドキュメントは何個残りそうですか?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eiennohito 軽く確認をしてみたところ、こんな感じでした。
- (フィルタ前) -> (フィルタ後)
- 244 -> 228
- 276 -> 266
- 267 -> 229
- 239 -> 228
あまり大きくフィルタはされてないですね。
閾値となる乱数を調整するのが良さそうです。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mu=0.1、sd=0.1の正規乱数で本指標をフィルタしたところ、
- 244 -> 137
- 276 -> 147
- 267 -> 139
- 239 -> 124
だいたい半分くらい減りましたね。
フィルタできたドキュメントをうまく抽出できないか見てみます。
…nto dev-duplicate-filters
…shio into dev-duplicate-filters
実験のために現時点でマージしたいのですが、どうしましょうか? |
@eiennohito もうマージしていただいても問題ないです〜。 |
Add a per-document deduplication filter.
[TODO]