A gamedev question-answer dataset proposal #3465

kaydotdev · 2023-02-04T17:07:46Z

kaydotdev
Feb 4, 2023

Hey, everyone! 👋

Back in the day, I collected a question-answer dataset from a Unity3D Gaming Forum to build a chatbot for assisting in game development. Despite the project being abandoned due to the circumstances, I believe the dataset will be helpful for the Open Assistant. It contains both unprocessed forum questions in the raw.json file and cleaned + labeled question-answer pair (12K+ samples) in the intents.json file. It's publically available on the Kaggle platform under the CC0:Public Domain license: https://www.kaggle.com/datasets/antonkozyriev/unity3d-faq?select=intents.json

Also, the dataset does not contain personal information about users on a forum. The pipelines preprocessed all IDs, and user mentions.

I hope you will find my dataset useful and let me know whether I should upload it on HuggingFace!

AbdBarho · 2023-02-05T08:14:03Z

AbdBarho
Feb 5, 2023
Collaborator

@antonace that is really cool, it is great that the data is public domain, I wonder if it was also public domain on the game dev forums it was scrapped from?

0 replies

kaydotdev · 2023-02-05T21:12:12Z

kaydotdev
Feb 5, 2023
Author

@AbdBarho yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

0 replies

huu4ontocord · 2023-02-07T04:21:13Z

huu4ontocord
Feb 7, 2023
Maintainer

thank you!

0 replies

Sobsz · 2023-02-07T13:19:41Z

Sobsz
Feb 7, 2023

yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

For the record, "public domain" doesn't refer to data being publicly available, but rather being free of copyright. The data here is most likely copyrighted, as the Berne Convention protects everything with some modicum of creativity. (Learn more here.)

That being said, scraping copyrighted data is par for the course for large language models, including the pretrained models this project will be using, so it's Probably Fine™. If not, there's always StackExchange, where answers are explicitly licensed for reuse under CC BY-SA (same as Wikipedia, though the weak-copyleft aspect can be problematic).

0 replies

kaydotdev · 2023-02-09T15:12:04Z

kaydotdev
Feb 9, 2023
Author

yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

For the record, "public domain" doesn't refer to data being publicly available, but rather being free of copyright. The data here is most likely copyrighted, as the Berne Convention protects everything with some modicum of creativity. (Learn more here.)

That being said, scraping copyrighted data is par for the course for large language models, including the pretrained models this project will be using, so it's Probably Fine™. If not, there's always StackExchange, where answers are explicitly licensed for reuse under CC BY-SA (same as Wikipedia, though the weak-copyleft aspect can be problematic).

@Sobsz thank you for the suggestion! I was not able to find any information about forum data licensing on a website. But I've contacted with Unity Support team about this issue, and I'm waiting for a response.

0 replies

bitplane · 2023-02-13T01:09:48Z

bitplane
Feb 13, 2023
Collaborator

I might be wrong here (I'm not a lawyer) but I'll weigh in:

Usually with web forums, the individual posts are the copyright of the users who posted them. As part of sign-up they give the forum owners a license to use their contributions, which may or may not be transferable. This is the great web2.0 data heist; the platforms that collected the data can sell it (or themselves) as they have permission to use it, while anyone else would have to contact each individual user and ask for explicit permission.

In reality very few of those users know or care about that, they expected their posts to be publicly used, it's arguably fair use, and if someone does complain you can filter their posts out. Anonymize the poster ids and put the anonymized ones in the metadata, there's no PII. Using a subset rather than all the data might also help with fair use (using 10 posts someone put on a public forum is more likely to be fair use than using 1000 things they posted)

The forum owners may have additional rights over the entire collection in a) areas of the world where "database rights" are strong and b) where there's been creative input from a curator of said database. Posts in manually created "best of" forum may have database/collections rights, community moderated ones won't.

If you can get permission from the forum owner then it can be released under whatever license they say. If not and they don't assert any right over the collection, I'd say go for it and see if anyone complains.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A gamedev question-answer dataset proposal #3465

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A gamedev question-answer dataset proposal #3465

kaydotdev Feb 4, 2023

Replies: 6 comments

AbdBarho Feb 5, 2023 Collaborator

kaydotdev Feb 5, 2023 Author

huu4ontocord Feb 7, 2023 Maintainer

Sobsz Feb 7, 2023

kaydotdev Feb 9, 2023 Author

bitplane Feb 13, 2023 Collaborator

kaydotdev
Feb 4, 2023

AbdBarho
Feb 5, 2023
Collaborator

kaydotdev
Feb 5, 2023
Author

huu4ontocord
Feb 7, 2023
Maintainer

Sobsz
Feb 7, 2023

kaydotdev
Feb 9, 2023
Author

bitplane
Feb 13, 2023
Collaborator