Research Paper Summaries

Control Flow Graph Based Multiclass Malware Detection Using Bi-normal Separation (paper)

Published in Defence Science Journal, 2016 (27 citations)

Based on (this paper)

Assumptions: IDA Pro and unpacker (works well for metamorphic malware), size of execution profile is small

Algorithm

1) Disassembly of EXE

check if the test executable is packed (using PEiD software)
if so, unpack the executable using a suitable unpacker

2) CFG Generation

generate a graph (tree, to be specific) where the nodes represent a "basic" block of instructions (has a unique block ID)
a basic block is defined to have only one entry and exit point, i.e., there is no change of control in midst of them
edges represent flow of control
all possible paths starting from the root are traversed to get an extensive signature of the executable (this is done using a slightly modified version of DFS that avoids loops)
an execution trace of a particular path is a list of block IDs in the order of execution
all paths give rise to a huge list (execution profile) that concatenates the list of each path
the execution profile is then traversed to get all the opcodes present in each block
all the opcodes are further concatened into a list

3) Feature Extraction

3-grams found to be best by experiments in aforementioned paper
Trigrams resulting from the opcode list are considered to be text document for that particular EXE (either all trigrams are considered or unique ones)

4) Feature Scoring

features are scored using TF, IDF, BNS

5) Feature Selection

features are selected using Information Gain or Chi-square Test

Then, Naive Bayes Classifier, SVM and Random Forests are used to classify the malware.

The fact that this paper considers every execution path provides a good intuition as to why is will work for every file, once it is unpacked

Example

Consider an EXE that is (unpacked and) decompiled to construct CFG involving basic blocks
Go through all possible execution paths of the CFG (avoid looping) and for each execution, append the order of basic blocks to a global list
From the list, replace each basic block with the opcodes in it
Go over the list to make trigrams of opcodes (Say we get ["jmp-nop-add", "nop-add-mul", "add-mul-addi"])
Score the trigrams using TF-BNS (Say scores are [2.4, 3.5, 0.3] respectively)
Select the best performing trigrams with IG or Chi-square Test (threshold = 2.0, therefore first 2 are selected)
(Training step) Train on a classifier like NB or SVC
Classify as benign or [class of malware]

Demerits

using a feature vector might result in loss of sequential data, as only trigrams are used (quite some sequential dependencies maybe higher than 3 opcodes)
execution profile too huge for large software
operands and loops not considered (loss of info)

Connected Papers

SMASH: A Malware Detection Method Based on Multi-Feature Ensemble Learning (cited) [8 citations]
Automatic malware classification and new malware detection using machine learning (cited) [38 citations]
Malware Detection Method based on Control Flow Analysis (cited) [nil]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper-summaries.md

paper-summaries.md

Research Paper Summaries

Control Flow Graph Based Multiclass Malware Detection Using Bi-normal Separation (paper)

Algorithm

1) Disassembly of EXE

2) CFG Generation

3) Feature Extraction

4) Feature Scoring

5) Feature Selection

Detecting Self-mutating Malware Using Control-Flow Graph Matching (paper)

Files

paper-summaries.md

Latest commit

History

paper-summaries.md

File metadata and controls

Research Paper Summaries

Control Flow Graph Based Multiclass Malware Detection Using Bi-normal Separation (paper)

Algorithm

1) Disassembly of EXE

2) CFG Generation

3) Feature Extraction

4) Feature Scoring

5) Feature Selection

Detecting Self-mutating Malware Using Control-Flow Graph Matching (paper)