diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml index b671bac..3822c1f 100644 --- a/.github/workflows/docs.yml +++ b/.github/workflows/docs.yml @@ -36,6 +36,8 @@ jobs: with: submodules: recursive fetch-depth: 0 + - name: Install packages + run: sudo apt-get install -y pandoc - name: Install poetry run: pipx install poetry - uses: actions/setup-python@v4 diff --git a/aisploit/scanner/__init__.py b/aisploit/scanner/__init__.py index ab4bae3..5d8b437 100644 --- a/aisploit/scanner/__init__.py +++ b/aisploit/scanner/__init__.py @@ -1,5 +1,11 @@ from .job import ScannerJob +from .plugin import Plugin, SendPromptsPlugin +from .report import Issue, IssueCategory __all__ = [ "ScannerJob", + "Plugin", + "SendPromptsPlugin", + "Issue", + "IssueCategory", ] diff --git a/docs/classifiers.ipynb b/docs/classifiers.ipynb new file mode 100644 index 0000000..15788e8 --- /dev/null +++ b/docs/classifiers.ipynb @@ -0,0 +1,308 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Classifiers\n", + "\n", + "The `Classifiers` module in AISploit provides a framework for building and using classifiers to score inputs based on certain criteria. Classifiers are essential components in red teaming and security testing tasks, where they are used to evaluate the characteristics or properties of input data.\n", + "\n", + "### Overview\n", + "\n", + "A classifier in AISploit is represented by a class that inherits from the `BaseClassifier` abstract base class. This base class defines the interface that all classifiers must implement. Additionally, there is a specialization for text classifiers, represented by the `BaseTextClassifier` class.\n", + "\n", + "### How Classifiers Work\n", + "\n", + "Classifiers in AISploit typically operate by analyzing input data and assigning a score based on predefined criteria. The `score()` method is the main entry point for scoring input data. This method takes the input data and optionally a list of reference inputs, and returns a `Score` object representing the score of the input." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating Custom Classifiers\n", + "\n", + "To create a custom classifier, you need to define a new class that inherits from `BaseClassifier` or `BaseTextClassifier`, depending on the nature of the input data. You then need to implement the `score()` method in your subclass, providing the logic for scoring input data." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Any, List\n", + "\n", + "from aisploit.core import BaseClassifier, Score\n", + "\n", + "class CustomClassifier(BaseClassifier[float, Any]):\n", + " def score(self, input: Any, references: List[Any] | None = None) -> Score[float]:\n", + " # Implement scoring logic here\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example Usage\n", + "\n", + "Here's an example of how to use a classifier in AISploit:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Create an instance of the classifier\n", + "classifier = CustomClassifier()\n", + "\n", + "# Score an input\n", + "score = classifier.score(input=\"Hello World\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This documentation provides an overview of how classifiers work in AISploit, explains how to create custom classifiers, and gives an example of how to use a classifier in practice. Users can refer to this documentation to understand how to work with classifiers in their security testing tasks.\n", + "\n", + "### Ready-to-Use Classifiers\n", + "\n", + "AISploit also provides several ready-to-use classifiers that cover common use cases. These classifiers can be directly instantiated and used in your projects. Some of the ready-to-use classifiers include:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### SelfSimilarityClassifier\n", + "\n", + "The SelfSimilarityClassifier is a text classifier based on self-similarity using cosine similarity scores. It calculates the cosine similarity between the input text and reference texts to determine the degree of similarity." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'flagged': True,\n", + " 'value': {'aggregated_score': 0.6721476912498474,\n", + " 'scores': [0.6721476912498474]},\n", + " 'description': 'Returns True if the aggregated cosine similarity score is less than the threshold',\n", + " 'explanation': 'The aggregated cosine similarity score for the input is 0.6721476912498474'}" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aisploit.classifiers import SelfSimilarityClassifier\n", + "\n", + "# What's the name of the Voodoo Lady in the Monkey Island games?\n", + "classifier = SelfSimilarityClassifier()\n", + "score = classifier.score(\n", + " \"The Voodoo Lady's name is Elaine Marley.\", \n", + " [\n", + " \"The Voodoo Lady's name is never revealed in the games, and she is simply referred to as the Voodoo Lady.\",\n", + " ],\n", + ")\n", + "\n", + "\n", + "vars(score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### PresidioAnalyserClassifier\n", + "\n", + "The PresidioAnalyserClassifier is a text classifier that utilizes the Presidio Analyzer for detecting Personally Identifiable Information (PII) entities in text." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'flagged': True,\n", + " 'value': [type: PERSON, start: 11, end: 19, score: 0.85,\n", + " type: PHONE_NUMBER, start: 43, end: 55, score: 0.75],\n", + " 'description': 'Returns True if entities are found in the input',\n", + " 'explanation': 'Found 2 entities in input'}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aisploit.classifiers.presidio import PresidioAnalyserClassifier\n", + "\n", + "classifier = PresidioAnalyserClassifier()\n", + "score = classifier.score(\"My name is John Doo and my phone number is 212-555-5555\")\n", + "\n", + "vars(score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### ComprehendPIIClassifier\n", + "\n", + "The ComprehendPIIClassifier is a text classifier that utilizes Amazon Comprehend for detecting Personally Identifiable Information (PII) entities in text." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'flagged': True,\n", + " 'value': [{'Score': 0.9999950528144836,\n", + " 'Type': 'NAME',\n", + " 'BeginOffset': 11,\n", + " 'EndOffset': 19},\n", + " {'Score': 0.9999926090240479,\n", + " 'Type': 'PHONE',\n", + " 'BeginOffset': 43,\n", + " 'EndOffset': 55}],\n", + " 'description': 'Returns True if entities are found in the input',\n", + " 'explanation': 'Found 2 entities in input'}" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aisploit.classifiers.amazon import ComprehendPIIClassifier\n", + "\n", + "classifier = ComprehendPIIClassifier()\n", + "score = classifier.score(\"My name is John Doo and my phone number is 212-555-5555\")\n", + "\n", + "vars(score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### ComprehendToxicityClassifier\n", + "\n", + "The ComprehendToxicityClassifier is a text classifier that leverages Amazon Comprehend for detecting toxic content in text." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'flagged': True,\n", + " 'value': {'Toxicity': 0.8208000063896179,\n", + " 'Labels': [{'Name': 'PROFANITY', 'Score': 0.19329999387264252},\n", + " {'Name': 'HATE_SPEECH', 'Score': 0.2694000005722046},\n", + " {'Name': 'INSULT', 'Score': 0.2587999999523163},\n", + " {'Name': 'GRAPHIC', 'Score': 0.19329999387264252},\n", + " {'Name': 'HARASSMENT_OR_ABUSE', 'Score': 0.18960000574588776},\n", + " {'Name': 'SEXUAL', 'Score': 0.21789999306201935},\n", + " {'Name': 'VIOLENCE_OR_THREAT', 'Score': 0.9879999756813049}]},\n", + " 'description': 'Returns True if the overall toxicity score is greater than or equal to the threshold',\n", + " 'explanation': 'The overall toxicity score for the input is 0.8208000063896179'}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aisploit.classifiers.amazon import ComprehendToxicityClassifier\n", + "\n", + "classifier = ComprehendToxicityClassifier()\n", + "score = classifier.score(\"I will kill you\")\n", + "\n", + "vars(score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### ModerationClassifier\n", + "\n", + "The ModerationClassifier is a text classifier that utilizes OpenAI's Moderation API for detecting toxic content in text." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'flagged': True,\n", + " 'value': Moderation(categories=Categories(harassment=True, harassment_threatening=True, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=True, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=True), category_scores=CategoryScores(harassment=0.4573294222354889, harassment_threatening=0.35159170627593994, hate=0.0006792626227252185, hate_threatening=4.232471837894991e-05, self_harm=4.82136874779826e-06, self_harm_instructions=3.341407150969644e-08, self_harm_intent=2.0083894014533143e-06, sexual=4.86759927298408e-05, sexual_minors=1.9414277119267354e-07, violence=0.9988717436790466, violence_graphic=1.050253467838047e-05, self-harm=4.82136874779826e-06, sexual/minors=1.9414277119267354e-07, hate/threatening=4.232471837894991e-05, violence/graphic=1.050253467838047e-05, self-harm/intent=2.0083894014533143e-06, self-harm/instructions=3.341407150969644e-08, harassment/threatening=0.35159170627593994), flagged=True),\n", + " 'description': 'Moderation score for the given input',\n", + " 'explanation': 'Details about the moderation score'}" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from aisploit.classifiers.openai import ModerationClassifier\n", + "\n", + "classifier = ModerationClassifier()\n", + "score = classifier.score(\"I will kill you\")\n", + "\n", + "vars(score)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "aisploit-MWVof28N-py3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/index.rst b/docs/index.rst index 61874b2..4649464 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,5 +4,7 @@ :hidden: :maxdepth: 1 + classifiers + scanner api/index diff --git a/docs/scanner.ipynb b/docs/scanner.ipynb new file mode 100644 index 0000000..8ca0b51 --- /dev/null +++ b/docs/scanner.ipynb @@ -0,0 +1,89 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scanner\n", + "\n", + "The scanner in AISploit is a versatile tool designed for detecting vulnerabilities, exploits, or sensitive information in various AI models or systems. It offers extensibility through plugins, allowing users to customize and extend its functionality according to their specific needs.\n", + "\n", + "### Usage\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "from aisploit.scanner import ScannerJob\n", + "from aisploit.targets import LangchainTarget\n", + "from aisploit.scanner.plugins import ImageMarkdownInjectionPlugin\n", + "\n", + "\n", + "chat_model = ChatOpenAI(model=\"gpt-3.5-turbo\")\n", + "\n", + "job = ScannerJob(\n", + " target=LangchainTarget(model=chat_model),\n", + " plugins=[\n", + " ImageMarkdownInjectionPlugin(\n", + " domain=\"cxd47vgx2z2qyzr637trlgzogfm6ayyn.oastify.com\"\n", + " ),\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extensibility with Plugins\n", + "\n", + "The scanner is extensible with plugins, which allow users to add custom scanning capabilities tailored to their requirements. Below is an example of how to create a custom plugin:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Sequence\n", + "from aisploit.core import BaseTarget\n", + "from aisploit.scanner import Issue, Plugin\n", + "\n", + "\n", + "class CustomPlugin(Plugin):\n", + " \"\"\"A custom scanner plugin for detecting specific issues.\"\"\"\n", + "\n", + " def run(self, *, run_id: str, target: BaseTarget) -> Sequence[Issue]:\n", + " # Define your logic to identify issues\n", + " pass" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "aisploit-MWVof28N-py3.12", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}