Code-prose-composition tagger #234

no0p · 2025-02-13T18:24:55Z

Tagger for Code Prose Composition

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

Produces tags like the following:

{"id":"","attributes":{"exp__code_prose_composition__prose":[[0,26095,0.73]],"exp__code_prose_composition__other":[[0,26095,0.03]],"exp__code_prose_composition__code":[[0,26095,0.24]],"exp__code_prose_composition__prose_count":[[0,26095,123.0]],"exp__code_prose_composition__prose_mean_entropy":[[0,26095,0.33532]],"exp__code_prose_composition__other_count":[[0,26095,5.0]],"exp__code_prose_composition__other_mean_entropy":[[0,26095,0.00184]],"exp__code_prose_composition__code_count":[[0,26095,40.0]],"exp__code_prose_composition__code_mean_entropy":[[0,26095,1.10486]]},"source":""}

Recommended filter for mixed prose/code content based on these tags is:

exp__code_prose_composition__code > 0.05
exp__code_prose_composition__prose > 0.3
exp__code_prose_composition__code_count >= 8
exp__code_prose_composition__code_mean_entropy < 0.5

The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. This is a TODO, to generate an appropriate set of examples that balance this. Regardless, for now, filtering for high confidence code predictions works well.

Usage Detail

The model path references a private hugging face model under allenai. Requires an access token with read permissions. Open to a discussion of making this public for simplicity, only reluctance is that it's very much still a prototype and has a long way to go. See above filter discussion.

@revbucket

This is a combination of @revbucket 's desired heuristic filters for learn2code, plus the existing filters we had previously implemented in prior iterations. A starting point. I am a bit suspicious about the precision of the regexes I lifted from the StarCoder paper. Note: plan is to merge this into long-lived "learn2code" feature branch rather than main, as I suspect we will be iterating a lot.

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

cmwilhelm and others added 4 commits February 12, 2025 16:48

perform CI on learn2code branch and prs into it

f611478

linting issues

b19fe23

Code-prose-composition tagger.

95042fc

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

cmwilhelm force-pushed the learn2code branch from 2659e3d to 63facf6 Compare February 13, 2025 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-prose-composition tagger #234

Code-prose-composition tagger #234

no0p commented Feb 13, 2025 •

edited

Loading

Code-prose-composition tagger #234

Are you sure you want to change the base?

Code-prose-composition tagger #234

Conversation

no0p commented Feb 13, 2025 • edited Loading

Tagger for Code Prose Composition

Usage Detail

no0p commented Feb 13, 2025 •

edited

Loading