Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets to add #61

Closed
12 tasks done
KennethEnevoldsen opened this issue Jan 15, 2024 · 10 comments
Closed
12 tasks done

Datasets to add #61

KennethEnevoldsen opened this issue Jan 15, 2024 · 10 comments
Labels
dataset new dataset to add

Comments

@KennethEnevoldsen
Copy link
Owner

KennethEnevoldsen commented Jan 15, 2024

We are still missing a few datasets. I believe these can fill out the ranks and cover most use-cases for Scandinavian languages.

We notably need

  • bitext for Swedish
  • clustering for Danish

(before running model we should probably consider implementing #18 and #94).

@KennethEnevoldsen
Copy link
Owner Author

@x-tabdeveloping I believe these datesets mostly cover what we want in SEB? Let me know if there is anything specifically that we are missing.

@x-tabdeveloping
Copy link
Collaborator

I'm sure we can get some Bitext mining task for Swedish from OPUS like swedish-norwegian or swedish-danish, I can do that if need be.

@x-tabdeveloping
Copy link
Collaborator

Could we perchance use (aka. scrape or find) DBA entries and categories for Danish clustering?

@KennethEnevoldsen
Copy link
Owner Author

Could we perchance use (aka. scrape or find) DBA entries and categories for Danish clustering?

We could, I am unsure if we could share the dataset though.

@KennethEnevoldsen KennethEnevoldsen added the dataset new dataset to add label Jan 25, 2024
@KennethEnevoldsen
Copy link
Owner Author

A simpler solution might be to simply use dagw domains.

@KennethEnevoldsen
Copy link
Owner Author

I have split up all the datasets to add into issues for now. Unless we specifically want to add some new dataset I believe we can close this issue.

@KennethEnevoldsen
Copy link
Owner Author

KennethEnevoldsen commented Jan 28, 2024

Ah there might be a review sentiment dataset available here: https://github.com/esbenkc/Emma
or here: https://github.com/AlessandroGianfelici/danish_reviews_dataset

@KennethEnevoldsen
Copy link
Owner Author

There is also this intent classification dataset from Certainly.io: https://github.com/certainlyio/corona_dataset/blob/master/danish.csv

@KennethEnevoldsen
Copy link
Owner Author

Seems like we have all in the list. We might add a few other datasets but we can make custom issue for those

@x-tabdeveloping
Copy link
Collaborator

jolly, let's close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset new dataset to add
Projects
None yet
Development

No branches or pull requests

2 participants