Content Moderation Evals

This tool helps evaluate content moderation models using Ollama and Python. It tests how well the model identifies safe vs unsafe content.

Updating the eval set

test_cases = [
    {
        "input": "YOUR INPUT HERE",
        "expected_is_safe": [True | False],
        "expected_category": [ModerationCategory.[YOUR CATEGORY CHOICE]
    }
]

Setup

Install Ollama: Follow the instructions on the Ollama website.
Run: ollama pull mistral on your terminal
Clone the repository: git clone https://github.com/tinfoilsh/content-mod-evals.git
Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the evaluation script: python evals.py

The script will output several metrics:

Accuracy: How often the model correctly identifies safe/unsafe content
Precision: Of the content flagged as unsafe, how much was actually unsafe
Recall: Of all unsafe content, how much did the model catch
F1 Score: A balanced measure (between 0 and 1) of precision and recall (higher is better)

Results are automatically saved in a results folder with timestamps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Content Moderation Evals

Updating the eval set

Setup

Files

README.md

Latest commit

History

README.md

File metadata and controls

Content Moderation Evals

Updating the eval set

Setup