Download presentation
Presentation is loading. Please wait.
Published byPenelope Goodman Modified over 9 years ago
1
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas
2
Presentation Outline Overview Background Our approach Feature description Feature extraction Experiments Results Conclusion
3
Overview Goal Detecting Malicious Executables Contribution A new Model that combines Binary, Assembly, and Library Call features An efficient technique to retrieve Assembly features from Binary features A scalable solution to n-gram feature extraction Novelty Combining classical binary n-gram features with the features extracted through reverse-engineering
4
Malicious Executables: Background Programs that performs malicious activities, such as destroying data stealing information clogging network etc. Consists of different architectures, such as Independent programs (e.g. Worms) Dependent (piggybacked) on a host program (e.g. virus) Propagation mechanisms Mobile: Propagates automatically through networks (worms) Static : propagates when infected files are transferred (viruses)
5
Detecting Malicious Executables Traditional way: signature-based detection Problems: Requires human intervention Not effective against “zero day attack”, because too slow Requirements Fast detection No human intervention (automatic) Recent techniques Signature auto-generation (Earlybird, Autograph, Polygraph) Data Mining based (Stolfo et al., Maloof et al.)
6
Our Approach Design goals: to obtain a solution that Is free of signatures Requires no human intervention Can detect new variants and / or zero day attacks Our “Hybrid Feature Retrieval” (HFR) model Is Based on Data Mining Meets all three design goals Steps Collection of Training Data (malicious & benign.exe) Feature Extraction & Selection Training with classifier Testing and detection
7
Top-level Architecture Training Data (Executables) Feature Extraction Training (SVM) classifier New Executable Feature-Selection Feature Extraction Testing (SVM) Infected? No Yes Keep Delete Training Testing
8
Features Binary n-grams Assembly instruction sequences (corresponding to the binary n-grams) DLL function calls
9
Binary n-gram Features Each binary executable is a 'string' of bytes An n-gram of the binary is a sequence of n consecutive bytes Example A string of four bytes: "ab05ef23" (in hexadecimal) 1-grams: "ab", "05", "ef", "23" (single bytes) 2-grams: "ab05", "05ef", "ef23" (2-byte sequences) 3-grams: "ab05ef", "05ef23" (3-byte sequences)
10
Binary n-gram Feature Extraction Each binary executable is scanned Each extracted n-gram is stored in a balanced binary search tree to avoid duplicates Each n-gram's frequency of occurrence in the training data is also stored in the tree
11
Binary n-gram Feature Extraction (contd...) Using AVL tree (a balanced binary search tree) we ensure fast insertion and searching Using disk I/O we overcome memory limitations Executables being scanned 1: “abcdef” 2: “93abcd” 3: “dc0ef2” 4: “0ef7gh” Current Scan Position 93ab,1 AVL tree for storing 2-grams and frequencies abcd,2 cdef,1 dc0e,1
12
Feature Selection Motivation Total number extracted n-grams may be very large (order of millions) Classifier can't be trained with so many features We select K best n-grams using Information Gain criterion Information Gain of a binary attribute A on a collection of examples S is given by Values(A): set of all possible values for attribute A Sv: subset of S for which attribute A has value v. Selected binary features are called “Binary Feature Set” or BFS
13
Assembly Features An assembly feature is a sequence of assembly instructions We call these features as “Derived Assembly Feature” or DAF Every DAF corresponds to a selected binary n- gram Motivation for extracting DAF : n-gram may contain partial information DAF contains more complete information
14
Assembly Feature Extraction Disassemble all executables For each selected binary n-gram Q do S all assembly instruction sequences in the disassembled executables corresponding to Q DAF Q Best assembly instruction sequence in S according to information gain
15
Assembly Feature Extraction (Contd...) Example: Let “00005068” be a selected 4-gram (Q) Following Assembly instruction sequences (S) corresponding to Q are found in the disassembled executables: DAF Q is selected from these sequences using information gain DAF Q
16
DLL function call features DLL function call features are the names of system functions called from the executables Ex: call getProcAddress() These features are extracted from the executable header We extract all the DLL call features from training data and select a subset using information gain
17
Combining features Each feature is considered as a 'binary' feature We create a vector V of all selected features, where V[i] corresponds to the i-th feature This vector is called the Hybrid Feature Set (HFS) For each executable E in the training data, we create a binary feature vector B corresponding to V, where B[i] is 1 if V[i] is present in E B[i] is 0 if V[i] is absent in E We train a classifier using these vectors
18
Experiments Collect real samples of malicious and normal executables Extract and select features Combine the features into HFS We also extract Assembly n-gram features (sequences of n assembly instructions), called Assembly Feature Set or (AFS) Test accuracy of each three kind of feature sets (BFS, AFS, HFS) using SVM with three- fold cross validation
19
Data Set There are two datasets, with the following distribution: Malicious instances are collected from http://vx.netlux.org/ Benign instances are collected windows XP machines, and other sources
20
Experimental Setup OS & H/W Platform: Sun Solaris & Linux Machines: 2GHz, 4GB Disassembler: PEdisassem Disassembles Windows Portable Executables Available from http://www.geocities.com/~sangcho/ Feature extraction implemented in java, JDK 1.5 K = 500 (number of binary n-grams selected) Support Vector Machine Tool: libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) SVM parameters: C-SVC, with polynomial kernel
21
Results HFS: Hybrid Feature Set - has the highest accuracy (best values are circled) AFS: Assembly Feature Set BFS: Binary Feature Set DLL features are not shown because DLL n-gram features have poor performance for n > 1. So, We only use DLL 1-grams in HFS
22
Results (Contd...) HFS: Hybrid Feature Set – has the lowest False Positive & False Negative AFS: Assembly Feature Set BFS: Binary Feature Set
23
Results (Contd...) Receiver Operating Characteristic (ROC) curves. HFS has the best ROC curve (better curve => greater area under the curve)
24
Results (Contd...) HFS has the greatest Area Under the Curve
25
Conclusion Hybrid Feature Retrieval (HFR) model retrieves a novel combination of three different kinds of features We have implemented an efficient, scalable solution to the n-gram feature extraction in general Our results are better compared to other techniques Future works Handle obfuscation Operate online, real time
26
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.