Control Flow Graph Based Multiclass Malware Detection Using Bi-normal Separation (paper)
Published in Defence Science Journal, 2016 (27 citations)
Based on (this paper)
Assumptions: IDA Pro and unpacker (works well for metamorphic malware), size of execution profile is small
- check if the test executable is packed (using PEiD software)
- if so, unpack the executable using a suitable unpacker
- generate a graph (tree, to be specific) where the nodes represent a "basic" block of instructions (has a unique block ID)
- a basic block is defined to have only one entry and exit point, i.e., there is no change of control in midst of them
- edges represent flow of control
- all possible paths starting from the root are traversed to get an extensive signature of the executable (this is done using a slightly modified version of DFS that avoids loops)
- an execution trace of a particular path is a list of block IDs in the order of execution
- all paths give rise to a huge list (execution profile) that concatenates the list of each path
- the execution profile is then traversed to get all the opcodes present in each block
- all the opcodes are further concatened into a list
- 3-grams found to be best by experiments in aforementioned paper
- Trigrams resulting from the opcode list are considered to be text document for that particular EXE (either all trigrams are considered or unique ones)
- features are scored using TF, IDF, BNS
- features are selected using Information Gain or Chi-square Test
Then, Naive Bayes Classifier, SVM and Random Forests are used to classify the malware.
The fact that this paper considers every execution path provides a good intuition as to why is will work for every file, once it is unpacked
Example
- Consider an EXE that is (unpacked and) decompiled to construct CFG involving basic blocks
- Go through all possible execution paths of the CFG (avoid looping) and for each execution, append the order of basic blocks to a global list
- From the list, replace each basic block with the opcodes in it
- Go over the list to make trigrams of opcodes (Say we get ["jmp-nop-add", "nop-add-mul", "add-mul-addi"])
- Score the trigrams using TF-BNS (Say scores are [2.4, 3.5, 0.3] respectively)
- Select the best performing trigrams with IG or Chi-square Test (threshold = 2.0, therefore first 2 are selected)
- (Training step) Train on a classifier like NB or SVC
- Classify as benign or [class of malware]
Demerits
- using a feature vector might result in loss of sequential data, as only trigrams are used (quite some sequential dependencies maybe higher than 3 opcodes)
- execution profile too huge for large software
- operands and loops not considered (loss of info)
Connected Papers
- SMASH: A Malware Detection Method Based on Multi-Feature Ensemble Learning (cited) [8 citations]
- Automatic malware classification and new malware detection using machine learning (cited) [38 citations]
- Malware Detection Method based on Control Flow Analysis (cited) [nil]