Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5sum mismatch #1

Open
senwu opened this issue Aug 7, 2020 · 3 comments
Open

md5sum mismatch #1

senwu opened this issue Aug 7, 2020 · 3 comments

Comments

@senwu
Copy link

senwu commented Aug 7, 2020

Thanks for sharing this awesome work. Really appreciate!!!

I want to use this revised TACRED dataset for my study, while I found my md5 checksums don't match the ones mentioned in the README.

Here are my md5checksums:

1c090c0e3861d6ecccfd199fdf439bed  train.json 
393e7200a63ffd10a16072cbbee464dd  dev.json
d287fb2377747b74e6feae2e2bcd9264  dev_rev.json
aba500ef2f60c32bc41e366383e8cda8  test.json
4c9dfcb4c8d523420dbf0f34858362f3  test_rev.json

Also, from the patch files, I found there are 1590 samples and 936 samples in dev and test files. (Seems like those numbers doesn't match the numbers reported in the paper?)

Please let me know if I am doing anything wrong? Thanks!

@ChristophAlt
Copy link
Collaborator

@senwu Thanks for your interest in our work.

Let's see if we can narrow down the issue.

The checksums of the original TACRED (train.json, dev.json, and test.json) match, so this is fine. Could you tell me a little more about your setting, e.g., operating system and python version. It could be that storing a json behaves differently, e.g., line endings, on different platforms.

Your observation is correct, there are less samples in the patch files than reported in the paper. The reason is that the number of "revised" samples also includes those that were assigned a second label by our annotators. As the TACRED format does not support multiple labels per sample, we chose not to patch those instances.

@liviosoares
Copy link

Just wanted to report, in case its helpful, that I had the same MD5sum problem as @senwu originally when using Python2. When running with Python3, the MD5s were consistent with the ones published by @ChristophAlt.

I have not spent time identifying if the problem is just JSON formatting differences or whether there are other potentially important content differences.

@ChristophAlt
Copy link
Collaborator

@liviosoares Thank you! Your feedback is very much appreciated. I'll try to identify the root cause of the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants