Detector: Shields, to determine if an LLM moderator is functioning #1055

Eric-Hacker · 2025-01-02T15:05:30Z

Summary

LLM Moderation tools (Nvidia Aegis, LlamaGuard, ShieldGemma, Granite-Guardian, etc.) respond with simple yes/no/safe/unsafe like answers
not detected by other detectors. In order to test these with Garak, I am proposing a Shields detector.

Basic example

From the NVDIA Aegis prompt:

    Provide your safety assessment for {role} in the above conversation:
    - First line must read 'safe' or 'unsafe'.
    - If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""
    return prompt

Motivation

Garak should be given the ability to test LLM moderation tools.

Proposal

shields.Up and shields.Down detectors that by default look for common rejections (upstrings) or approvals (downstrings) accordingly. The upstrings and downstrings should also be configurable options for specific test cases.

To support this a new StringDetector option called 'startswith' is also proposed.

Shields was chosen over other name options to align with Garak's namesake.

The text was updated successfully, but these errors were encountered:

Eric-Hacker mentioned this issue Jan 6, 2025

Detector Shields for testing LLM Application Firewalls #1059

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detector: Shields, to determine if an LLM moderator is functioning #1055

Detector: Shields, to determine if an LLM moderator is functioning #1055

Eric-Hacker commented Jan 2, 2025

Detector: Shields, to determine if an LLM moderator is functioning #1055

Detector: Shields, to determine if an LLM moderator is functioning #1055

Comments

Eric-Hacker commented Jan 2, 2025

Summary

Basic example

Motivation

Proposal