Add module docstring

gatewayd-io · Apr 6, 2024 · 43945d9 · 43945d9
1 parent 69bb4df
commit 43945d9
Show file tree

Hide file tree

Showing 2 changed files with 120 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -1,35 +1,105 @@
 # DeepSQLi
 
-Deep learning model, dataset, trained model and related code for SQL injection detection.
+This repository contains the code for the DeepSQLi project. The project aims to detect SQL injection attacks using deep learning models. The project consists of two main components: **Tokenizer API** and **Serving API**. The Tokenizer API tokenizes and sequences the input query, while the Serving API predicts whether the query is SQL injection or not.
 
-## Docker
+The Tokenizer API is built using Flask and TensorFlow, and the Serving API is built using TensorFlow Serving. The Tokenizer API is responsible for tokenizing and sequencing the input query using the corresponding [dataset](./dataset/), and the Serving API is responsible for predicting whether the query is SQL injection or not using the trained deep learning [model](./sqli_model/).
+
+The project also includes a [SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips) that integrates with the GatewayD database gatewayd. The plugin serves as a frontend for these APIs. It intercepts the incoming queries and sends them to the Serving API for prediction. If the query is predicted as SQL injection, the plugin terminates the request; otherwise, it forwards the query to the database.
+
+The following diagram shows the architecture of the project:
+
+```mermaid
+flowchart TD
+    Client <-- PostgreSQL wire protocol:15432 --> GatewayD
+    GatewayD <--> Sq["SQL IDS/IPS plugin"]
+    Sq <-- http:8000 --> T["Tokenizer API"]
+    Sq <-- http:8501 --> S["Serving API"]
+    S -- loads --> SM["SQLi models"]
+    T -- loads --> Dataset
+    Sq -- threshold: 80% --> D{Malicious query?}
+    D -->|No: send to| Database
+    D -->|Yes: terminate request| GatewayD
+```
+
+There are currently two models available and trained using the [dataset](./dataset/). Both models are trained using the same model architecture, but they are trained using different datasets. The first model is trained using the [SQLi dataset v1](./dataset/sqli_dataset1.csv), and the second model is trained using the [SQLi dataset v2](./dataset/sqli_dataset2.csv). The models are trained using the following hyperparameters:
+
+- Model architecture: LSTM (Long Short-Term Memory)
+- Embedding dimension: 128
+- LSTM units: 64
+- Dropout rate: 0.2
+- Learning rate: 0.001
+- Loss function: Binary crossentropy
+- Optimizer: Adam
+- Metrics: Accuracy, precision, recall, and F1 score
+- Validation split: 0.2
+- Dense layer units: 1
+- Activation function: Sigmoid
+- Maximum sequence length: 100
+- Maximum number of tokens: 10000
+- Maximum number of epochs: 11
+- Batch size: 32
+
+## Installation
+
+The fastest way to get started is to use Docker and Docker Compose. If you don't have Docker installed, you can install it by following the instructions [here](https://docs.docker.com/get-docker/).
+
+### Docker Compose
+
+Use the following commands to build and run the Tokenizer and Serving API containers using Docker Compose (recommended):
+
+```bash
+docker compose up -d
+```
+
+To stop the containers, use the following command:
+
+```bash
+docker compose stop
+```
+
+To remove the containers and release their resources, use the following command:
+
+```bash
+docker compose down
+```
+
+### Docker
+
+#### Build the images
 
 ```bash
-# Build the images
 docker build --no-cache --tag tokenizer-api:latest -f Dockerfile.tokenizer-api .
 docker build --no-cache --tag serving-api:latest -f Dockerfile.serving-api .
-# Run the Tokenizer and Serving API containers
-docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest
-docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest
 ```
 
-## Docker Compose
+#### Run the containers
 
 ```bash
-# Run the Tokenizer and Serving API containers
-docker compose up -d
-# Stop the Tokenizer and Serving API containers
-docker compose down
+docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest
+docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest
 ```
 
 ### Test
 
+You can test the APIs using the following commands:
+
+#### Tokenizer API
+
 ```bash
 # Tokenize and sequence the query
 curl 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}'
+```
+
+#### Serving API
+
+```bash
 # Predict whether the query is SQLi or not
 curl 'http://localhost:8501/v1/models/sqli_model:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,21,4,32,3,10,3,3]]}'
+```
 
+#### One-liner
+
+```bash
 # Or you can use the following one-liner:
 curl -s 'http://localhost:8501/v1/models/sqli_model:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":['$(curl -s 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}' | jq -c .tokens)']}' | jq
 ```
diff --git a/training/train.py b/training/train.py
@@ -1,4 +1,42 @@
-# Deep Learning Model Training with LSTM
+"""Deep Learning Model Training with LSTM
+
+This Python script is used for training a deep learning model using
+Long Short-Term Memory (LSTM) networks.
+
+The script starts by importing necessary libraries. These include `sys`
+for interacting with the system, `pandas` for data manipulation, `tensorflow`
+for building and training the model, `sklearn` for splitting the dataset and
+calculating metrics, and `numpy` for numerical operations.
+
+The script expects two command-line arguments: the input file and the output directory.
+If these are not provided, the script will exit with a usage message.
+
+The input file is expected to be a CSV file, which is loaded into a pandas DataFrame.
+The script assumes that this DataFrame has a column named "Query" containing the text
+data to be processed, and a column named "Label" containing the target labels.
+
+The text data is then tokenized using the `Tokenizer` class from
+`tensorflow.keras.preprocessing.text` (TF/IDF). The tokenizer is fit on the text data
+and then used to convert the text into sequences of integers. The sequences are then
+padded to a maximum length of 100 using the `pad_sequences` function.
+
+The data is split into a training set and a test set using the `train_test_split` function
+from `sklearn.model_selection`. The split is stratified, meaning that the distribution of
+labels in the training and test sets should be similar.
+
+A Sequential model is created using the `Sequential` class from `tensorflow.keras.models`.
+The model consists of an Embedding layer, an LSTM layer, and a Dense layer. The model is
+compiled with the Adam optimizer and binary cross-entropy loss function, and it is trained
+on the training data.
+
+After training, the model is used to predict the labels of the test set. The predictions
+are then compared with the true labels to calculate various performance metrics, including
+accuracy, recall, precision, F1 score, specificity, and ROC. These metrics are printed to
+the console.
+
+Finally, the trained model is saved in the SavedModel format to the output directory
+specified by the second command-line argument.
+"""
 
 import sys
 import pandas as pd