You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLM Moderation tools (Nvidia Aegis, LlamaGuard, ShieldGemma, Granite-Guardian, etc.) respond with simple yes/no/safe/unsafe like answers
not detected by other detectors. In order to test these with Garak, I am proposing a Shields detector.
Basic example
From the NVDIA Aegis prompt:
Provide your safety assessment for {role} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""
return prompt
Motivation
Garak should be given the ability to test LLM moderation tools.
Proposal
shields.Up and shields.Down detectors that by default look for common rejections (upstrings) or approvals (downstrings) accordingly. The upstrings and downstrings should also be configurable options for specific test cases.
To support this a new StringDetector option called 'startswith' is also proposed.
Shields was chosen over other name options to align with Garak's namesake.
The text was updated successfully, but these errors were encountered:
Summary
LLM Moderation tools (Nvidia Aegis, LlamaGuard, ShieldGemma, Granite-Guardian, etc.) respond with simple yes/no/safe/unsafe like answers
not detected by other detectors. In order to test these with Garak, I am proposing a Shields detector.
Basic example
From the NVDIA Aegis prompt:
Motivation
Garak should be given the ability to test LLM moderation tools.
Proposal
shields.Up and shields.Down detectors that by default look for common rejections (upstrings) or approvals (downstrings) accordingly. The upstrings and downstrings should also be configurable options for specific test cases.
To support this a new StringDetector option called 'startswith' is also proposed.
Shields was chosen over other name options to align with Garak's namesake.
The text was updated successfully, but these errors were encountered: