A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas
Presentation Outline Overview Background Our approach Feature description Feature extraction Experiments Results Conclusion
Overview Goal Detecting Malicious Executables Contribution A new Model that combines Binary, Assembly, and Library Call features An efficient technique to retrieve Assembly features from Binary features A scalable solution to n-gram feature extraction Novelty Combining classical binary n-gram features with the features extracted through reverse-engineering
Malicious Executables: Background Programs that performs malicious activities, such as destroying data stealing information clogging network etc. Consists of different architectures, such as Independent programs (e.g. Worms) Dependent (piggybacked) on a host program (e.g. virus) Propagation mechanisms Mobile: Propagates automatically through networks (worms) Static : propagates when infected files are transferred (viruses)
Detecting Malicious Executables Traditional way: signature-based detection Problems: Requires human intervention Not effective against “zero day attack”, because too slow Requirements Fast detection No human intervention (automatic) Recent techniques Signature auto-generation (Earlybird, Autograph, Polygraph) Data Mining based (Stolfo et al., Maloof et al.)
Our Approach Design goals: to obtain a solution that Is free of signatures Requires no human intervention Can detect new variants and / or zero day attacks Our “Hybrid Feature Retrieval” (HFR) model Is Based on Data Mining Meets all three design goals Steps Collection of Training Data (malicious & benign.exe) Feature Extraction & Selection Training with classifier Testing and detection
Top-level Architecture Training Data (Executables) Feature Extraction Training (SVM) classifier New Executable Feature-Selection Feature Extraction Testing (SVM) Infected? No Yes Keep Delete Training Testing
Features Binary n-grams Assembly instruction sequences (corresponding to the binary n-grams) DLL function calls
Binary n-gram Features Each binary executable is a 'string' of bytes An n-gram of the binary is a sequence of n consecutive bytes Example A string of four bytes: "ab05ef23" (in hexadecimal) 1-grams: "ab", "05", "ef", "23" (single bytes) 2-grams: "ab05", "05ef", "ef23" (2-byte sequences) 3-grams: "ab05ef", "05ef23" (3-byte sequences)
Binary n-gram Feature Extraction Each binary executable is scanned Each extracted n-gram is stored in a balanced binary search tree to avoid duplicates Each n-gram's frequency of occurrence in the training data is also stored in the tree
Binary n-gram Feature Extraction (contd...) Using AVL tree (a balanced binary search tree) we ensure fast insertion and searching Using disk I/O we overcome memory limitations Executables being scanned 1: “abcdef” 2: “93abcd” 3: “dc0ef2” 4: “0ef7gh” Current Scan Position 93ab,1 AVL tree for storing 2-grams and frequencies abcd,2 cdef,1 dc0e,1
Feature Selection Motivation Total number extracted n-grams may be very large (order of millions) Classifier can't be trained with so many features We select K best n-grams using Information Gain criterion Information Gain of a binary attribute A on a collection of examples S is given by Values(A): set of all possible values for attribute A Sv: subset of S for which attribute A has value v. Selected binary features are called “Binary Feature Set” or BFS
Assembly Features An assembly feature is a sequence of assembly instructions We call these features as “Derived Assembly Feature” or DAF Every DAF corresponds to a selected binary n- gram Motivation for extracting DAF : n-gram may contain partial information DAF contains more complete information
Assembly Feature Extraction Disassemble all executables For each selected binary n-gram Q do S all assembly instruction sequences in the disassembled executables corresponding to Q DAF Q Best assembly instruction sequence in S according to information gain
Assembly Feature Extraction (Contd...) Example: Let “ ” be a selected 4-gram (Q) Following Assembly instruction sequences (S) corresponding to Q are found in the disassembled executables: DAF Q is selected from these sequences using information gain DAF Q
DLL function call features DLL function call features are the names of system functions called from the executables Ex: call getProcAddress() These features are extracted from the executable header We extract all the DLL call features from training data and select a subset using information gain
Combining features Each feature is considered as a 'binary' feature We create a vector V of all selected features, where V[i] corresponds to the i-th feature This vector is called the Hybrid Feature Set (HFS) For each executable E in the training data, we create a binary feature vector B corresponding to V, where B[i] is 1 if V[i] is present in E B[i] is 0 if V[i] is absent in E We train a classifier using these vectors
Experiments Collect real samples of malicious and normal executables Extract and select features Combine the features into HFS We also extract Assembly n-gram features (sequences of n assembly instructions), called Assembly Feature Set or (AFS) Test accuracy of each three kind of feature sets (BFS, AFS, HFS) using SVM with three- fold cross validation
Data Set There are two datasets, with the following distribution: Malicious instances are collected from Benign instances are collected windows XP machines, and other sources
Experimental Setup OS & H/W Platform: Sun Solaris & Linux Machines: 2GHz, 4GB Disassembler: PEdisassem Disassembles Windows Portable Executables Available from Feature extraction implemented in java, JDK 1.5 K = 500 (number of binary n-grams selected) Support Vector Machine Tool: libsvm ( SVM parameters: C-SVC, with polynomial kernel
Results HFS: Hybrid Feature Set - has the highest accuracy (best values are circled) AFS: Assembly Feature Set BFS: Binary Feature Set DLL features are not shown because DLL n-gram features have poor performance for n > 1. So, We only use DLL 1-grams in HFS
Results (Contd...) HFS: Hybrid Feature Set – has the lowest False Positive & False Negative AFS: Assembly Feature Set BFS: Binary Feature Set
Results (Contd...) Receiver Operating Characteristic (ROC) curves. HFS has the best ROC curve (better curve => greater area under the curve)
Results (Contd...) HFS has the greatest Area Under the Curve
Conclusion Hybrid Feature Retrieval (HFR) model retrieves a novel combination of three different kinds of features We have implemented an efficient, scalable solution to the n-gram feature extraction in general Our results are better compared to other techniques Future works Handle obfuscation Operate online, real time
Thank you