Replies: 1 comment
-
Hi @epec254, thank you for your proposal and apologies for my late response - I was on PTO. Regarding the license, it's indeed based solely on what the original dataset author provided. For SODA, I am not sure whether it is suitable for commercial usage even that the author comments in a twitter reply that the dataset is compatible with OpenAI's TOS. As for CoQA, we indeed need to add more details for such datasets:). I fully support your proposal! I've been maintaining an internal beta messy Excel sheet that highlights datasets suitable for commercial usage and those may require closer scrutiny. Due to other priorities, I've not yet made this public. I'll be updating the document with an additional column based on your suggestions, so that people can consult with their legal team for further details. I look forward to collaborating on optimizing the document further:). |
Beta Was this translation helpful? Give feedback.
-
Hey folks,
First - awesome collection and effort to standardize formatting for chat-based datasets!
I was curious to understand the logic / approach for identifying the licenses as listed in https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv. As I understand it, the license listed is the license provided by the dataset author for their work.
I propose to that there should be an additional column that includes the "the most restrictive license that applies to the dataset."
Why? For someone who is looking for commercially friendly datasets (e.g., can be used by their company for commercial purposes), I think the approach as-is can be a misleading. For example:
SODA - it is listed as an
MIT License
- while that isn't incorrect based on their GitHub repo, if you review their HuggingFace page / academic paper, you come to understand that they produced the data by querying InstructGPT, an OpenAI model. Given this, and given OpenAI's TOS, the data is technically NOT MIT licensed - it is "corrupted" by the OpenAI license and can't be used for (per OpenAI's license) "developing models that compete with OpenAI." Hence, someone looking to use SODA for commercially related work (e.g., as data for a model they are training for their company) would, at the bare minimum, want to check with their legal team their level of comfort with using this data.CoQA - listed as
MIT License
- if you review their webpage, they state the below about the underlying licenses. To use this data for commercially related work, one would want to filter for the literature / wikipedia / CNN subset (since CC-BY-SA & Apache allow commercial use) and likely have their legal team review the MCTest and RACE licenses for commercial friendliness.Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions