Replies: 6 comments
-
@antonace that is really cool, it is great that the data is public domain, I wonder if it was also public domain on the game dev forums it was scrapped from? |
Beta Was this translation helpful? Give feedback.
-
@AbdBarho yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain. |
Beta Was this translation helpful? Give feedback.
-
thank you! |
Beta Was this translation helpful? Give feedback.
-
For the record, "public domain" doesn't refer to data being publicly available, but rather being free of copyright. The data here is most likely copyrighted, as the Berne Convention protects everything with some modicum of creativity. (Learn more here.) That being said, scraping copyrighted data is par for the course for large language models, including the pretrained models this project will be using, so it's Probably Fine™. If not, there's always StackExchange, where answers are explicitly licensed for reuse under CC BY-SA (same as Wikipedia, though the weak-copyleft aspect can be problematic). |
Beta Was this translation helpful? Give feedback.
-
@Sobsz thank you for the suggestion! I was not able to find any information about forum data licensing on a website. But I've contacted with Unity Support team about this issue, and I'm waiting for a response. |
Beta Was this translation helpful? Give feedback.
-
I might be wrong here (I'm not a lawyer) but I'll weigh in: Usually with web forums, the individual posts are the copyright of the users who posted them. As part of sign-up they give the forum owners a license to use their contributions, which may or may not be transferable. This is the great web2.0 data heist; the platforms that collected the data can sell it (or themselves) as they have permission to use it, while anyone else would have to contact each individual user and ask for explicit permission. In reality very few of those users know or care about that, they expected their posts to be publicly used, it's arguably fair use, and if someone does complain you can filter their posts out. Anonymize the poster ids and put the anonymized ones in the metadata, there's no PII. Using a subset rather than all the data might also help with fair use (using 10 posts someone put on a public forum is more likely to be fair use than using 1000 things they posted) The forum owners may have additional rights over the entire collection in a) areas of the world where "database rights" are strong and b) where there's been creative input from a curator of said database. Posts in manually created "best of" forum may have database/collections rights, community moderated ones won't. If you can get permission from the forum owner then it can be released under whatever license they say. If not and they don't assert any right over the collection, I'd say go for it and see if anyone complains. |
Beta Was this translation helpful? Give feedback.
-
Hey, everyone! 👋
Back in the day, I collected a question-answer dataset from a Unity3D Gaming Forum to build a chatbot for assisting in game development. Despite the project being abandoned due to the circumstances, I believe the dataset will be helpful for the Open Assistant. It contains both unprocessed forum questions in the
raw.json
file and cleaned + labeled question-answer pair (12K+ samples) in theintents.json
file. It's publically available on the Kaggle platform under theCC0:Public Domain
license: https://www.kaggle.com/datasets/antonkozyriev/unity3d-faq?select=intents.jsonAlso, the dataset does not contain personal information about users on a forum. The pipelines preprocessed all IDs, and user mentions.
I hope you will find my dataset useful and let me know whether I should upload it on HuggingFace!
Beta Was this translation helpful? Give feedback.
All reactions