easy to use

lindsey98 · Jan 26, 2024 · 0f4adf9 · 0f4adf9
1 parent d73d54c
commit 0f4adf9
Show file tree

Hide file tree

Showing 19 changed files with 5,156 additions and 90 deletions.
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@
 
 ## Framework
 
-<img src="phishpedia/big_pic/overview.png" style="width:2000px;height:350px"/>
+<img src="./datasets/overview.png" style="width:2000px;height:350px"/>
 
 ```Input```: A URL and its screenshot ```Output```: Phish/Benign, Phishing target
 - Step 1: Enter <b>Deep Object Detection Model</b>, get predicted logos and inputs (inputs are not used for later prediction, just for explanation)
@@ -40,88 +40,54 @@
 
 ## Project structure
 ```
-- src
-    - adv_attack: adversarial attacking scripts
-    - detectron2_pedia: training script for object detector
-     |_ output
-      |_ rcnn_2
-        |_ rcnn_bet365.pth 
-    - siamese_pedia: inference script for siamese
-     |_ siamese_retrain: training script for siamese
-     |_ expand_targetlist
-         |_ 1&1 Ionos
-         |_ ...
-     |_ domain_map.pkl
-     |_ resnetv2_rgb_new.pth.tar
-    - siamese.py: main script for siamese
-    - pipeline_eval.py: evaluation script for general experiment
-
-- tele: telegram scripts to vote for phishing 
-- phishpedia_config.py: config script for phish-discovery experiment 
-- phishpedia_main.py: main script for phish-discovery experiment 
+- logo_recog.py: Deep Object Detection Model
+- logo_matching.py: Deep Siamese Model 
+- configs.yaml: Configuration file
+- phishpedia.py: Main script
 ```
 
 ## Instructions
 Requirements: 
-- CUDA 11
 - Anaconda installed, please refer to the official installation guide: https://docs.anaconda.com/free/anaconda/install/index.html 
 
 1. Create a local clone of Phishpedia
-```
+```bash
 git clone https://github.com/lindsey98/Phishpedia.git
 ```
 
 2. Setup
-```
-cd Phishpedia/
+```bash
 chmod +x ./setup.sh
 ./setup.sh
 ```
-If you encounter any problem in downloading the models, you can manually download them from here https://huggingface.co/Kelsey98/Phishpedia. And put them into the corresponding conda environment.
 
 3. 
 ```
-conda activate myenv
+conda activate phishpedia
 ```
 
-Run in Python to test a single website
-```python
-from phishpedia.phishpedia_main import test
-import matplotlib.pyplot as plt
-from phishpedia.phishpedia_config import load_config
-
-url = open("phishpedia/datasets/test_sites/accounts.g.cdcde.com/info.txt").read().strip()
-screenshot_path = "phishpedia/datasets/test_sites/accounts.g.cdcde.com/shot.png"
-ELE_MODEL, SIAMESE_THRE, SIAMESE_MODEL, LOGO_FEATS, LOGO_FILES, DOMAIN_MAP_PATH = load_config(None)
-
-phish_category, pred_target, plotvis, siamese_conf, pred_boxes = test(url=url, screenshot_path=screenshot_path,
-                                                                       ELE_MODEL=ELE_MODEL,
-                                                                       SIAMESE_THRE=SIAMESE_THRE,
-                                                                       SIAMESE_MODEL=SIAMESE_MODEL,
-                                                                       LOGO_FEATS=LOGO_FEATS,
-                                                                       LOGO_FILES=LOGO_FILES,
-                                                                       DOMAIN_MAP_PATH=DOMAIN_MAP_PATH
-                                                                      )
-
-print('Phishing (1) or Benign (0) ?', phish_category)
-print('What is its targeted brand if it is a phishing ?', pred_target)
-print('What is the siamese matching confidence ?', siamese_conf)
-print('Where is the predicted logo (in [x_min, y_min, x_max, y_max])?', pred_boxes)
-plt.imshow(plotvis[:, :, ::-1])
-plt.title("Predicted screenshot with annotations")
-plt.show()
+4. Run in bash 
+```bash
+python phishpedia.py --folder <folder you want to test e.g. ./datasets/test_sites>
 ```
 
-Or run in bash 
+The testing folder should be in the structure of:
+
 ```
-python run.py --folder <folder you want to test e.g. phishpedia/datasets/test_sites> --results <where you want to save the results e.g. test.txt> 
+test_site_1
+|__ info.txt (Write the URL)
+|__ shot.png (Save the screenshot)
+test_site_2
+|__ info.txt (Write the URL)
+|__ shot.png (Save the screenshot)
+......
 ```
 
 ## Miscellaneous
 - In our paper, we also implement several phishing detection and identification baselines, see [here](https://github.com/lindsey98/PhishingBaseline)
 - The logo targetlist described in our paper includes 181 brands, we have further expanded the targetlist to include 277 brands in this code repository 
 - For the phish discovery experiment, we obtain feed from [Certstream phish_catcher](https://github.com/x0rz/phishing_catcher), we lower the score threshold to be 40 to process more suspicious websites, readers can refer to their repo for details
-- We use Scrapy for website crawling [Repo here](https://github.com/lindsey98/MyScrapy.git) 
+- We use Scrapy for website crawling 
 
 ## Citation 
 If you find our work useful in your research, please consider citing our paper by:

diff --git a/configs.py b/configs.py
@@ -0,0 +1,63 @@
+# Global configuration
+import subprocess
+from typing import Union
+import yaml
+from logo_matching import cache_reference_list, load_model_weights
+from logo_recog import config_rcnn
+import os
+import numpy as np
+
+def get_absolute_path(relative_path):
+    base_path = os.path.dirname(__file__)
+    return os.path.abspath(os.path.join(base_path, relative_path))
+
+def load_config(reload_targetlist=False):
+
+    with open(os.path.join(os.path.dirname(__file__), 'configs.yaml')) as file:
+        configs = yaml.load(file, Loader=yaml.FullLoader)
+
+    # Iterate through the configuration and update paths
+    for section, settings in configs.items():
+        for key, value in settings.items():
+            if 'PATH' in key and isinstance(value, str):  # Check if the key indicates a path
+                absolute_path = get_absolute_path(value)
+                configs[section][key] = absolute_path
+
+    ELE_CFG_PATH = configs['ELE_MODEL']['CFG_PATH']
+    ELE_WEIGHTS_PATH = configs['ELE_MODEL']['WEIGHTS_PATH']
+    ELE_CONFIG_THRE = configs['ELE_MODEL']['DETECT_THRE']
+    ELE_MODEL = config_rcnn(ELE_CFG_PATH,
+                            ELE_WEIGHTS_PATH,
+                            conf_threshold=ELE_CONFIG_THRE)
+
+    # siamese model
+    SIAMESE_THRE = configs['SIAMESE_MODEL']['MATCH_THRE']
+
+    print('Load protected logo list')
+    targetlist_zip_path = configs['SIAMESE_MODEL']['TARGETLIST_PATH']
+    targetlist_dir = os.path.dirname(targetlist_zip_path)
+    zip_file_name = os.path.basename(targetlist_zip_path)
+    targetlist_folder = zip_file_name.split('.zip')[0]
+    full_targetlist_folder_dir = os.path.join(targetlist_dir, targetlist_folder)
+
+    if reload_targetlist or targetlist_zip_path.endswith('.zip') and not os.path.isdir(full_targetlist_folder_dir):
+        os.makedirs(full_targetlist_folder_dir, exist_ok=True)
+        subprocess.run(f'unzip -o "{targetlist_zip_path}" -d "{full_targetlist_folder_dir}"', shell=True)
+
+    SIAMESE_MODEL = load_model_weights( num_classes=configs['SIAMESE_MODEL']['NUM_CLASSES'],
+                                        weights_path=configs['SIAMESE_MODEL']['WEIGHTS_PATH'])
+
+    if reload_targetlist or (not os.path.exists(os.path.join(os.path.dirname(__file__), 'LOGO_FEATS.npy'))):
+        LOGO_FEATS, LOGO_FILES = cache_reference_list(model=SIAMESE_MODEL,
+                                                      targetlist_path=full_targetlist_folder_dir)
+        print('Finish loading protected logo list')
+        np.save(os.path.join(os.path.dirname(__file__),'LOGO_FEATS.npy'), LOGO_FEATS)
+        np.save(os.path.join(os.path.dirname(__file__),'LOGO_FILES.npy'), LOGO_FILES)
+
+    else:
+        LOGO_FEATS, LOGO_FILES = np.load(os.path.join(os.path.dirname(__file__),'LOGO_FEATS.npy')), \
+                                 np.load(os.path.join(os.path.dirname(__file__),'LOGO_FILES.npy'))
+
+    DOMAIN_MAP_PATH = configs['SIAMESE_MODEL']['DOMAIN_MAP_PATH']
+
+    return ELE_MODEL, SIAMESE_THRE, SIAMESE_MODEL, LOGO_FEATS, LOGO_FILES, DOMAIN_MAP_PATH
diff --git a/configs.yaml b/configs.yaml
@@ -0,0 +1,11 @@
+ELE_MODEL: # element recognition model -- logo only
+  CFG_PATH: models/faster_rcnn.yaml # os.path.join(os.path.dirname(__file__), xxx)
+  WEIGHTS_PATH: models/rcnn_bet365.pth
+  DETECT_THRE: 0.05
+
+SIAMESE_MODEL:
+  NUM_CLASSES: 277 # number of brands, users don't need to modify this even the targetlist is expanded
+  MATCH_THRE: 0.87 # FIXME: threshold is 0.87 in phish-discovery?
+  WEIGHTS_PATH: models/resnetv2_rgb_new.pth.tar
+  TARGETLIST_PATH: models/expand_targetlist.zip
+  DOMAIN_MAP_PATH: models/domain_map.pkl
diff --git a/datasets/.DS_Store b/datasets/.DS_Store
diff --git a/datasets/overview.png b/datasets/overview.png
diff --git a/datasets/test_sites/.DS_Store b/datasets/test_sites/.DS_Store
diff --git a/datasets/test_sites/accounts.g.cdcde.com/.DS_Store b/datasets/test_sites/accounts.g.cdcde.com/.DS_Store
diff --git a/datasets/test_sites/accounts.g.cdcde.com/html.txt b/datasets/test_sites/accounts.g.cdcde.com/html.txt
diff --git a/datasets/test_sites/accounts.g.cdcde.com/info.txt b/datasets/test_sites/accounts.g.cdcde.com/info.txt
@@ -0,0 +1 @@
+https://accounts.g.cdcde.com/ServiceLogin?passive=1209600&osid=1&continue=https://plus.g.cdcde.com/&followup=https://plus.g.cdcde.com/
diff --git a/datasets/test_sites/accounts.g.cdcde.com/shot.png b/datasets/test_sites/accounts.g.cdcde.com/shot.png
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		https://accounts.g.cdcde.com/ServiceLogin?passive=1209600&osid=1&continue=https://plus.g.cdcde.com/&followup=https://plus.g.cdcde.com/