Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Near Deduplication] Benchmark #7

Open
ChenghaoMou opened this issue Oct 1, 2022 · 2 comments
Open

[Near Deduplication] Benchmark #7

ChenghaoMou opened this issue Oct 1, 2022 · 2 comments

Comments

@ChenghaoMou
Copy link
Collaborator

Provide results on large dataset with different near deduplication methods:

  1. minhash + lsh
  2. simhash
  3. any relevant methods

Details to be included:

  • tokenization method
  • method parameters
  • hardware
  • memory usage
  • time
  • duplication results, examples
@ChenghaoMou
Copy link
Collaborator Author

Model Deduplication Method Type Comment Src
CodeGeeX Paper Not Available https://models.aminer.cn/codegeex/blog/
InCoder Exact match based on alphanumeric tokens/md5 + Bloom filter Exact Many other analyses on decontamination, filtering https://arxiv.org/abs/2204.05999
CodeGen Exact match based on sha256 hashes Exact https://arxiv.org/abs/2203.13474
AlphaCode Exact match ignoring whitespaces Exact https://arxiv.org/abs/2203.07814
PolyCode Exact match sha256 Exact https://github.com/VHellendoorn/Code-LMs/blob/main/Data/deduplicate.py
PaLM Coder Levenshtein distance Near https://arxiv.org/abs/2204.02311

@lvwerra
Copy link
Contributor

lvwerra commented Oct 5, 2022

If we have a handful of deduplication strategies we could run some smaller model trainings to evaluate these approaches. We'll be working on the science plan in the next few days/weeks and in general preprocessing (incl. dedup) ablations will probably be in there for some studies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants