Skip to content

code-philia/PhishLLM

Repository files navigation

PhishLLM

Official repository for "Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List". Published in USENIX Security 2024.

Read our Paper

Visit our Website

Download our Datasets

Cite our Paper

Introduction

Existing reference-based phishing detection:

  • ❌ Relies on a pre-defined reference list, which is lack of comprehensiveness and incurs high maintenance cost
  • ❌ Does not fully make use of the textual semantics present on the webpage

In our PhishLLM, we build a reference-based phishing detection framework:

  • Without the pre-defined reference list: Modern LLMs have encoded far more extensive brand-domain information than any predefined list
  • Chain-of-thought credential-taking prediction: Reasoning the credential-taking status in a step-by-step way by looking at the text

Framework

Input: a URL and its screenshot, Output: Phish/Benign, Phishing target

  • Step 1: Brand recognition model

    • Input: Logo caption, Logo OCR Results
    • Intermediate Output: LLM's predicted brand
    • Output: Validated predicted brand, confirmed through Google Images
  • Step 2: Credential-Requiring-Page classification model

    • Input: Webpage OCR results
    • Output: LLM chooses from A. Credential-Taking Page or B. Non-Credential-Taking Page
    • Go to step 4 if LLM chooses 'A', otherwise go to step 3.
  • Step 3: Credential-Requiring-Page transition model (activates if LLM chooses 'B' from the last step)

    • Input: All clickable UI elements
    • Intermediate Output: Top-1 most likely login UI
    • Output: Webpage after clicking that UI, go back to Step 1 with the updated webpage and URL
  • Step 4: Output step

    • Case 1: If the domain is from a web hosting domain: it is flagged as phishing if (i) LLM predicts a targeted brand inconsistent with the webpage's domain and (ii) LLM chooses 'A' from Step 2

    • Case 2: If the domain is not from a web hosting domain: it is flagged as phishing if (i) LLM predicts a targeted brand inconsistent with the webpage's domain (ii) LLM chooses 'A' from Step 2 and (iii) the domain is not a popular domain indexed by Google

    • Otherwise: reported as benign

Project structure

scripts/ 
├── infer/
│   └──test.py             # inference script
├── train/
│   └──train.py             # training script for the CRP transition model (a CLIP model)
├── pipeline/             
│   └──test_llm.py # TestLLM class
├── data/ # data utilities
└── utils/ # other utitiles such as web interaction utility functions 

experiments/
├── ablation_study/ # ablation study in RQ2 and public phishing study in RQ4
├── componentwise_evaluation/ # component-wise evaluation experiments in RQ2, RQ3
└── field_study/ # Large/Small-scale field study in RQ4

prompts/ 
├── brand_recog_prompt.json 
└── crp_pred_prompt.json

server/ # deployment scripts to deploy PhishLLM demo website

Setup

  • Step 1: Clone the Repository and Install Requirements. A new conda environment "phishllm" will be created
    cd PhishLLM/
    chmod +x ./setup.sh
    export ENV_NAME="phishllm" && ./setup.sh
  • Step 2: Register OpenAI API Key, See Tutorial here. Paste the API key to './datasets/openai_key.txt'.

  • Step 3: Register a Google Programmable Search API Key, See Tutorial here. Paste your API Key (in the first line) and Search Engine ID (in the second line) to "./datasets/google_api_key.txt":

     [API_KEY]
     [SEARCH_ENGINE_ID]
    
  • Step 4 (Optional): Edit Hyperparameters. All hyperparameter configurations are in param_dict.yaml.

Prepare the Dataset

To test on your own dataset, you need to prepare the dataset in the following structure:

testing_dir/
├── aaa.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── bbb.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── ccc.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source

Inference: Run PhishLLM

  conda activate phishllm
  python -m scripts.infer.test --folder [folder to test, e.g., ./testing_dir]

Understand the Output

  • You will see the console is printing logs like the following

    Expand to see the sample log

    
      [PhishLLMLogger][DEBUG] Folder ./datasets/field_study/2023-09-01/device-862044b2-5124-4735-b6d5-f114eea4a232.remotewd.com
      [PhishLLMLogger][DEBUG] Logo caption: the logo for sonicwall network security appliance
      [PhishLLMLogger][DEBUG] Logo OCR: SONICWALL Network Security Appliance Username
      [PhishLLMLogger][DEBUG] Industry: Technology
      [PhishLLMLogger][DEBUG] LLM prediction time: 0.9699530601501465
      [PhishLLMLogger][DEBUG] Detected brand: sonicwall.com
      [PhishLLMLogger][DEBUG] Domain sonicwall.com is valid and alive
      [PhishLLMLogger][DEBUG] CRP prediction: There is no confusing token. Then we find the keywords that are related to login: LOG IN. Additionally, the presence of "Username" suggests that this page requires credentials. Therefore, the answer would be A.
      [💥] Phishing discovered, phishing target is sonicwall.com
      [PhishLLMLogger][DEBUG] Folder ./datasets/field_study/2023-09-01/lp.aldooliveira.com
      [PhishLLMLogger][DEBUG] Logo caption: a black and white photo of the word hello world
      [PhishLLMLogger][DEBUG] Logo OCR: Hello world! Welcome to WordPress. This is your first post. Edit or delete it, then start writing! dezembro 2, 2021 publicado
      [PhishLLMLogger][DEBUG] Industry: Uncategorized
      [PhishLLMLogger][DEBUG] LLM prediction time: 0.8813009262084961
      [PhishLLMLogger][DEBUG] Detected brand: wordpress.com
      [PhishLLMLogger][DEBUG] Domain wordpress.com is valid and alive
      [PhishLLMLogger][DEBUG] CRP prediction: There is no token or keyword related to login or sensitive information. Therefore the answer would be B.
      [PhishLLMLogger][DEBUG] No candidate login button to click
       [✅] Benign
    
  • Meanwhile, a txt file named "[today's date]_phishllm.txt" is being created, it has the following columns:

    • "folder": name of the folder
    • "phish_prediction": "phish" | "benign"
    • "target_prediction": phishing target brand's domain, e.g. paypal.com, meta.com
    • "brand_recog_time": time taken for brand recognition
    • "crp_prediction_time": time taken for CRP prediction
    • "crp_transition_time": time taken for CRP transition

(Optional) Re-training

Alternatively, if you would like to re-train the CRP transition model, which is a CLIP model. We have annotated a dataset for login UI / non-login UI elements.

(Optional) Use other versions of GPT

You can change the GPT model you want to use in the LLM_model argument in param_dict.yaml, default is "gpt-3.5-turbo-16k". Please check the list of GPT models you can use.

Citations

@inproceedings{liu2024less,
  title={Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List},
  author={Liu, Ruofan and Lin, Yun and Teoh, Xiwen and Liu, Gongshen and Huang, Zhiyong and Dong, Jin Song},
  booktitle={33rd USENIX Security Symposium (USENIX Security 24)},
  pages={523--540},
  year={2024}
}

If you have any issues running our code, you can raise a Github issue or email us [email protected], [email protected], [email protected]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published