Skip to content

Latest commit

 

History

History
59 lines (46 loc) · 3.6 KB

paper-summaries.md

File metadata and controls

59 lines (46 loc) · 3.6 KB

Research Paper Summaries

Control Flow Graph Based Multiclass Malware Detection Using Bi-normal Separation (paper)

Published in Defence Science Journal, 2016 (27 citations)

Based on (this paper)

Assumptions: IDA Pro and unpacker (works well for metamorphic malware), size of execution profile is small

Algorithm

1) Disassembly of EXE
  • check if the test executable is packed (using PEiD software)
  • if so, unpack the executable using a suitable unpacker
2) CFG Generation
  • generate a graph (tree, to be specific) where the nodes represent a "basic" block of instructions (has a unique block ID)
  • a basic block is defined to have only one entry and exit point, i.e., there is no change of control in midst of them
  • edges represent flow of control
  • all possible paths starting from the root are traversed to get an extensive signature of the executable (this is done using a slightly modified version of DFS that avoids loops)
  • an execution trace of a particular path is a list of block IDs in the order of execution
  • all paths give rise to a huge list (execution profile) that concatenates the list of each path
  • the execution profile is then traversed to get all the opcodes present in each block
  • all the opcodes are further concatened into a list
3) Feature Extraction
  • 3-grams found to be best by experiments in aforementioned paper
  • Trigrams resulting from the opcode list are considered to be text document for that particular EXE (either all trigrams are considered or unique ones)
4) Feature Scoring
  • features are scored using TF, IDF, BNS
5) Feature Selection
  • features are selected using Information Gain or Chi-square Test

Then, Naive Bayes Classifier, SVM and Random Forests are used to classify the malware.

The fact that this paper considers every execution path provides a good intuition as to why is will work for every file, once it is unpacked

Example

  • Consider an EXE that is (unpacked and) decompiled to construct CFG involving basic blocks
  • Go through all possible execution paths of the CFG (avoid looping) and for each execution, append the order of basic blocks to a global list
  • From the list, replace each basic block with the opcodes in it
  • Go over the list to make trigrams of opcodes (Say we get ["jmp-nop-add", "nop-add-mul", "add-mul-addi"])
  • Score the trigrams using TF-BNS (Say scores are [2.4, 3.5, 0.3] respectively)
  • Select the best performing trigrams with IG or Chi-square Test (threshold = 2.0, therefore first 2 are selected)
  • (Training step) Train on a classifier like NB or SVC
  • Classify as benign or [class of malware]

Demerits

  • using a feature vector might result in loss of sequential data, as only trigrams are used (quite some sequential dependencies maybe higher than 3 opcodes)
  • execution profile too huge for large software
  • operands and loops not considered (loss of info)

Connected Papers

  1. SMASH: A Malware Detection Method Based on Multi-Feature Ensemble Learning (cited) [8 citations]
  2. Automatic malware classification and new malware detection using machine learning (cited) [38 citations]
  3. Malware Detection Method based on Control Flow Analysis (cited) [nil]

Detecting Self-mutating Malware Using Control-Flow Graph Matching (paper)