-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets to add #61
Comments
@x-tabdeveloping I believe these datesets mostly cover what we want in SEB? Let me know if there is anything specifically that we are missing. |
I'm sure we can get some Bitext mining task for Swedish from OPUS like swedish-norwegian or swedish-danish, I can do that if need be. |
Could we perchance use (aka. scrape or find) DBA entries and categories for Danish clustering? |
We could, I am unsure if we could share the dataset though. |
A simpler solution might be to simply use dagw domains. |
I have split up all the datasets to add into issues for now. Unless we specifically want to add some new dataset I believe we can close this issue. |
Ah there might be a review sentiment dataset available here: https://github.com/esbenkc/Emma |
There is also this intent classification dataset from Certainly.io: https://github.com/certainlyio/corona_dataset/blob/master/danish.csv |
Seems like we have all in the list. We might add a few other datasets but we can make custom issue for those |
jolly, let's close this |
We are still missing a few datasets. I believe these can fill out the ranks and cover most use-cases for Scandinavian languages.
NordJylland News , STSVGSummerization, STSSNL, stsWe notably need
(before running model we should probably consider implementing #18 and #94).
The text was updated successfully, but these errors were encountered: