Skip to content

Commit

Permalink
Add module docstring
Browse files Browse the repository at this point in the history
  • Loading branch information
mostafa committed Apr 6, 2024
1 parent 69bb4df commit 43945d9
Show file tree
Hide file tree
Showing 2 changed files with 120 additions and 12 deletions.
92 changes: 81 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,105 @@
# DeepSQLi

Deep learning model, dataset, trained model and related code for SQL injection detection.
This repository contains the code for the DeepSQLi project. The project aims to detect SQL injection attacks using deep learning models. The project consists of two main components: **Tokenizer API** and **Serving API**. The Tokenizer API tokenizes and sequences the input query, while the Serving API predicts whether the query is SQL injection or not.

## Docker
The Tokenizer API is built using Flask and TensorFlow, and the Serving API is built using TensorFlow Serving. The Tokenizer API is responsible for tokenizing and sequencing the input query using the corresponding [dataset](./dataset/), and the Serving API is responsible for predicting whether the query is SQL injection or not using the trained deep learning [model](./sqli_model/).

The project also includes a [SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips) that integrates with the GatewayD database gatewayd. The plugin serves as a frontend for these APIs. It intercepts the incoming queries and sends them to the Serving API for prediction. If the query is predicted as SQL injection, the plugin terminates the request; otherwise, it forwards the query to the database.

The following diagram shows the architecture of the project:

```mermaid
flowchart TD
Client <-- PostgreSQL wire protocol:15432 --> GatewayD
GatewayD <--> Sq["SQL IDS/IPS plugin"]
Sq <-- http:8000 --> T["Tokenizer API"]
Sq <-- http:8501 --> S["Serving API"]
S -- loads --> SM["SQLi models"]
T -- loads --> Dataset
Sq -- threshold: 80% --> D{Malicious query?}
D -->|No: send to| Database
D -->|Yes: terminate request| GatewayD
```

There are currently two models available and trained using the [dataset](./dataset/). Both models are trained using the same model architecture, but they are trained using different datasets. The first model is trained using the [SQLi dataset v1](./dataset/sqli_dataset1.csv), and the second model is trained using the [SQLi dataset v2](./dataset/sqli_dataset2.csv). The models are trained using the following hyperparameters:

- Model architecture: LSTM (Long Short-Term Memory)
- Embedding dimension: 128
- LSTM units: 64
- Dropout rate: 0.2
- Learning rate: 0.001
- Loss function: Binary crossentropy
- Optimizer: Adam
- Metrics: Accuracy, precision, recall, and F1 score
- Validation split: 0.2
- Dense layer units: 1
- Activation function: Sigmoid
- Maximum sequence length: 100
- Maximum number of tokens: 10000
- Maximum number of epochs: 11
- Batch size: 32

## Installation

The fastest way to get started is to use Docker and Docker Compose. If you don't have Docker installed, you can install it by following the instructions [here](https://docs.docker.com/get-docker/).

### Docker Compose

Use the following commands to build and run the Tokenizer and Serving API containers using Docker Compose (recommended):

```bash
docker compose up -d
```

To stop the containers, use the following command:

```bash
docker compose stop
```

To remove the containers and release their resources, use the following command:

```bash
docker compose down
```

### Docker

#### Build the images

```bash
# Build the images
docker build --no-cache --tag tokenizer-api:latest -f Dockerfile.tokenizer-api .
docker build --no-cache --tag serving-api:latest -f Dockerfile.serving-api .
# Run the Tokenizer and Serving API containers
docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest
docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest
```

## Docker Compose
#### Run the containers

```bash
# Run the Tokenizer and Serving API containers
docker compose up -d
# Stop the Tokenizer and Serving API containers
docker compose down
docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest
docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest
```

### Test

You can test the APIs using the following commands:

#### Tokenizer API

```bash
# Tokenize and sequence the query
curl 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}'
```

#### Serving API

```bash
# Predict whether the query is SQLi or not
curl 'http://localhost:8501/v1/models/sqli_model:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,21,4,32,3,10,3,3]]}'
```

#### One-liner

```bash
# Or you can use the following one-liner:
curl -s 'http://localhost:8501/v1/models/sqli_model:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":['$(curl -s 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}' | jq -c .tokens)']}' | jq
```
40 changes: 39 additions & 1 deletion training/train.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,42 @@
# Deep Learning Model Training with LSTM
"""Deep Learning Model Training with LSTM
This Python script is used for training a deep learning model using
Long Short-Term Memory (LSTM) networks.
The script starts by importing necessary libraries. These include `sys`
for interacting with the system, `pandas` for data manipulation, `tensorflow`
for building and training the model, `sklearn` for splitting the dataset and
calculating metrics, and `numpy` for numerical operations.
The script expects two command-line arguments: the input file and the output directory.
If these are not provided, the script will exit with a usage message.
The input file is expected to be a CSV file, which is loaded into a pandas DataFrame.
The script assumes that this DataFrame has a column named "Query" containing the text
data to be processed, and a column named "Label" containing the target labels.
The text data is then tokenized using the `Tokenizer` class from
`tensorflow.keras.preprocessing.text` (TF/IDF). The tokenizer is fit on the text data
and then used to convert the text into sequences of integers. The sequences are then
padded to a maximum length of 100 using the `pad_sequences` function.
The data is split into a training set and a test set using the `train_test_split` function
from `sklearn.model_selection`. The split is stratified, meaning that the distribution of
labels in the training and test sets should be similar.
A Sequential model is created using the `Sequential` class from `tensorflow.keras.models`.
The model consists of an Embedding layer, an LSTM layer, and a Dense layer. The model is
compiled with the Adam optimizer and binary cross-entropy loss function, and it is trained
on the training data.
After training, the model is used to predict the labels of the test set. The predictions
are then compared with the true labels to calculate various performance metrics, including
accuracy, recall, precision, F1 score, specificity, and ROC. These metrics are printed to
the console.
Finally, the trained model is saved in the SavedModel format to the output directory
specified by the second command-line argument.
"""

import sys
import pandas as pd
Expand Down

0 comments on commit 43945d9

Please sign in to comment.