Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code-prose-composition tagger #234

Open
wants to merge 4 commits into
base: learn2code
Choose a base branch
from
Open

Conversation

no0p
Copy link

@no0p no0p commented Feb 13, 2025

Tagger for Code Prose Composition

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

Produces tags like the following:

{"id":"","attributes":{"exp__code_prose_composition__prose":[[0,26095,0.73]],"exp__code_prose_composition__other":[[0,26095,0.03]],"exp__code_prose_composition__code":[[0,26095,0.24]],"exp__code_prose_composition__prose_count":[[0,26095,123.0]],"exp__code_prose_composition__prose_mean_entropy":[[0,26095,0.33532]],"exp__code_prose_composition__other_count":[[0,26095,5.0]],"exp__code_prose_composition__other_mean_entropy":[[0,26095,0.00184]],"exp__code_prose_composition__code_count":[[0,26095,40.0]],"exp__code_prose_composition__code_mean_entropy":[[0,26095,1.10486]]},"source":""}

Recommended filter for mixed prose/code content based on these tags is:

exp__code_prose_composition__code > 0.05
exp__code_prose_composition__prose > 0.3
exp__code_prose_composition__code_count >= 8
exp__code_prose_composition__code_mean_entropy < 0.5

The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. This is a TODO, to generate an appropriate set of examples that balance this. Regardless, for now, filtering for high confidence code predictions works well.

Usage Detail

The model path references a private hugging face model under allenai. Requires an access token with read permissions. Open to a discussion of making this public for simplicity, only reluctance is that it's very much still a prototype and has a long way to go. See above filter discussion.

cmwilhelm and others added 4 commits February 12, 2025 16:48
This is a combination of @revbucket 's desired heuristic filters for learn2code,
plus the existing filters we had previously implemented in prior iterations.

A starting point. I am a bit suspicious about the precision of the regexes
I lifted from the StarCoder paper.

Note: plan is to merge this into long-lived "learn2code"
feature branch rather than main, as I suspect we will be iterating
a lot.
Add a tagger that adds attributes for code-prose-other
composition of files based on line classifications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants