Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VG Clustering #83

Closed
Tracked by #61
KennethEnevoldsen opened this issue Jan 22, 2024 · 4 comments
Closed
Tracked by #61

VG Clustering #83

KennethEnevoldsen opened this issue Jan 22, 2024 · 4 comments
Assignees
Labels
dataset new dataset to add

Comments

@KennethEnevoldsen
Copy link
Owner

No description provided.

@KennethEnevoldsen KennethEnevoldsen mentioned this issue Jan 22, 2024
12 tasks
@KennethEnevoldsen KennethEnevoldsen self-assigned this Jan 22, 2024
@KennethEnevoldsen
Copy link
Owner Author

@Muennighoff and @x-tabdeveloping the VG_summarization has multiple classes pr article:

dataset[1:10]["classes"]
#['nyheter,utenriks', 'sport,fotball', 'nyheter,innenriks', 'nyheter,utenriks', 'sport,ski,langrenn', 'sport', 'sport,fotball', 'nyheter,utenriks', 'sport,norges-idrettsforbund']

Most of them are finer grained versions. E.g. 'nyheter,innenriks' (news, domestic) vs 'nyheter,utenriks' (news, international).

In total, there are 18763 documents in the test set. 1147 unique labels, 15 main categories (the first one), 220 if we include the second, and 1131 if we include three.

Unless you guys have anything to add I will go for the case with only 15 categories.

@x-tabdeveloping
Copy link
Collaborator

Sounds very reasonable, I think we should roll with that.

@KennethEnevoldsen KennethEnevoldsen added the dataset new dataset to add label Jan 25, 2024
@KennethEnevoldsen
Copy link
Owner Author

Clustering added in #96

@KennethEnevoldsen KennethEnevoldsen changed the title VG_summarization, clustering and STS VG Clustering Jan 28, 2024
@KennethEnevoldsen
Copy link
Owner Author

Move VG STS to a new issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset new dataset to add
Projects
None yet
Development

No branches or pull requests

2 participants