Download presentation
Presentation is loading. Please wait.
Published byMegan Sims Modified over 9 years ago
1
Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas
2
Outline and Acknowledgement ● Vision for Assured Information Sharing ● Handling Different Trust levels ● Defensive Operations between Untrustworthy Partners – Detecting Malicious Executables using Data Mining ● Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds
3
Vision: Assured Information Sharing Publish Data/Policy Component Data/Policy for Agency A Data/Policy for Coalition Publish Data/Policy Component Data/Policy for Agency C Component Data/Policy for Agency B Publish Data/Policy 1.Trustworthy Partners 2.Semi-Trustworthy partners 3.Untrustworthy partners 4.Dynamic Trust
4
Our Approach ● Integrate the Medicaid claims data and mine the data; next enforce policies and determine how much information has been lost by enforcing policies – Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students) ● Apply game theory and probing techniques to extract information from semi-trustworthy partners – Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student) ● Data Mining for Defensive and offensive operations – E.g., Malicious code detection, Honeypots – Prof. Latifur Khan and Mehedy Masud ● Dynamic Trust levels, Peer to Peer Communication – Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)
5
Introduction: Detecting Malicious Executables using Data Mining 0 What are malicious executables? - Harm computer systems - Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. - Exploits software vulnerability on a victim - May remotely infect other victims - Incurs great loss. Example: Code Red epidemic cost $2.6 Billion 0 Malicious code detection: Traditional approach - Signature based - Requires signatures to be generated by human experts - So, not effective against “zero day” attacks
6
State of the Art: Automated Detection O Automated detection approaches: ● Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. ● Content-based: analyse the content of the malicious executable – Autograph (H. Ah-Kim – CMU): Based on automated signature generation process – N-gram analysis (Maloof, M.A. et.al.): Based on mining features and using machine learning.
7
New Ideas ✗ Content -based approaches consider only machine- codes (byte-codes). ✗ Is it possible to consider higher-level source codes for malicious code detection? ✗ Yes: Diassemble the binary executable and retrieve the assembly program ✗ Extract important features from the assembly program ✗ Combine with machine-code features
8
Feature Extraction ✗ Binary n-gram features – Sequence of n consecutive bytes of binary executable ✗ Assembly n-gram features – Sequence of n consecutive assembly instructions ✗ System API call features – DLL function call information
9
The Hybrid Feature Retrieval Model ● Collect training samples of normal and malicious executables. ● Extract features ● Train a Classifier and build a model ● Test the model against test samples
10
Hybrid Feature Retrieval (HFR) ● Training
11
Hybrid Feature Retrieval (HFR) ● Testing
12
Binary n-gram features – Features are extracted from the byte codes in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: Given a 11-byte sequence: 0123456789abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 The 4-grams (4-byte sequences) are: 01234567, 23456789, 456789ab,...,ef012345 and so on.... Problem: – Large dataset. Too many features (millions!). Solution: – Use secondary memory, efficient data structures – Apply feature selection Feature Extraction
13
Assembly n-gram features – Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; 2-grams (1) “push eax”; “mov eax, dword[0f34]”; (2) “mov eax, dword[0f34]”; “add ecx, eax”; Problem: – Same problem as binary Solution: – Same solution Feature Extraction
14
● Select Best K features ● Selection Criteria: Information Gain ● Gain of an attribute A on a collection of examples S is given by Feature Selection
15
Experiments 0 Dataset – Dataset1: 838 Malicious and 597 Benign executables – Dataset2: 1082 Malicious and 1370 Benign executables – Collected Malicious code from VX Heavens (http://vx.netlux.org) 0 Disassembly – Pedisassem ( http://www.geocities.com/~sangcho/index.html ) 0 Training, Testing – Support Vector Machine (SVM) – C-Support Vector Classifiers with an RBF kernel
16
Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set
17
Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set
18
Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set
19
Future Plans ● System call : – seems to be very useful. – Need to Consider Frequency of call – Call sequence pattern (following program path) – Actions immediately preceding or after call ● Detect Malicious code by program slicing – requires analysis
20
Data Mining to Detect Buffer Overflow Attack Mohammad M. Masud, Latifur Khan, Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas
21
Introduction ● Goal – Intrusion detection. – e.g.: worm attack, buffer overflow attack. ● Main Contribution – 'Worm' code detection by data mining coupled with 'reverse engineering'. – Buffer overflow detection by combining data mining with static analysis of assembly code.
22
Background ● What is 'buffer overflow'? – A situation when a fixed sized buffer is overflown by a larger sized input. ● How does it happen? – example:........ char buff[100]; gets(buff);........ buffStack memory Input string
23
Background (cont...) ● Then what?........ char buff[100]; gets(buff);........ buffStack memory Stack Return address overwritten buffStack memory New return address points to this memory location Attacker's code buff
24
Background (cont...) ● So what? – Program may crash or – The attacker can execute his arbitrary code ● It can now – Execute any system function – Communicate with some host and download some 'worm' code and install it! – Open a backdoor to take full control of the victim ● How to stop it?
25
Background (cont...) ● Stopping buffer overflow – Preventive approaches – Detection approaches ● Preventive approaches – Finding bugs in source code. Problem: can only work when source code is available. – Compiler extension. Same problem. – OS/HW modification ● Detection approaches – Capture code running symptoms. Problem: may require long running time. – Automatically generating signatures of buffer overflow attacks.
26
CodeBlocker (Our approach) ● A detection approach ● Based on the Observation: – Attack messages usually contain code while normal messages contain data. ● Main Idea – Check whether message contains code ● Problem to solve: – Distinguishing code from data
27
● Statistics to support this observation (a)on Windows platforms – most web servers (port 80) accept data only; – remote access services (ports 111, 137, 138, 139) accept data only; Microsoft SQL Servers (port 1434) accept data only; – workstation services (ports 139 and 445) accept data only. ● (b) On Linux platforms, most – Apache web servers (port 80) accept data only; – BIND (port 53) accepts data only; – SNMP (port 161) accepts data only; – most Mail Transport (port 25) accepts data only; – Database servers (Oracle, MySQL, PostgreSQL) at ports 1521, 3306 and 5432 accept data only.
28
Severity of the problem ● It is not easy to detect actual instruction sequence from a given string of bits
29
Our solution ● Apply data mining. ● Formulate the problem as a classification problem (code, data) ● Collect a set of training examples, containing both instances ● Train the data with a machine learning algorithm, get the model ● Test this model against a new message
30
CodeBlocker Model
31
Feature Extraction
32
Disassembly ● We apply SigFree tool – implemented by Xinran Wang et al. (PennState)
33
Feature extraction ● Features are extracted using – N-gram analysis – Control flow analysis ● N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE
34
Feature extraction (cont...) ● Control-flow Based N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6
35
Feature extraction (cont...) ● Control Flow analysis. Generated features – Invalid Memory Reference (IMR) – Undefined Register (UR) – Invalid Jump Target (IJT) ● Checking IMR – A memory is referenced using register addressing and the register value is undefined – e.g.: mov ax, [dx + 5] ● Checking UR – Check if the register value is set properly ● Checking IJT – Check whether jump target does not violate instruction boundary
36
Feature extraction (cont...) ● Why n-gram analysis? – Intuition: in general, disassembled executables should have a different pattern of instruction usage than disassembled data. ● Why control flow analysis? – Intuition: there should be no invalid memory references or invalid jump targets.
37
Putting it together ● Compute all possible n-grams ● Select best k of them ● Compute feature vector (binary vector) for each training example ● Supply these vectors to the training algorithm
38
Experiments ● Dataset – Real traces of normal messages – Real attack messages – Polymorphic shellcodes ● Training, Testing – Support Vector Machine (SVM)
39
Results ● CFBn: Control-Flow Based n-gram feature ● CFF: Control-flow feature
40
Novelty / contribution ● We introduce the notion of control flow based n-gram ● We combine control flow analysis with data mining to detect code / data ● Significant improvement over other methods (e.g. SigFree)
41
Advantages ● 1) Fast testing ● 2) Signature free operation 3) Low overhead ● 4) Robust against many obfuscations
42
Limitations ● Need samples of attack and normal messages. ● May not be able to detect a completely new type of attack.
43
Future Works ● Find more features ● Apply dynamic analysis techniques ● Semantic analysis
44
Reference / suggested readings – X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006. – Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004.
45
Email Worm Detection (behavioural approach) Training data Feature extraction Clean or Infected ? Outgoing Emails Classifier Machine Learning Test data The Model
46
Feature Extraction Per email features = Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments = Continuous-valued Features Number of attachments; Number of words/characters in the subject and body Per window features = Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length = Ratio of emails with attachments
47
Feature Reduction & Selection Principal Component Analysis = Reduce higher dimensional data into lower dimension = Helps reducing noise, overfitting Decesion Tree = Used to Select Best features
48
Experiments 0 Data Set - Contains instances for both normal and viral emails. – Six worm types: ● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f - Collected from UC Berkeley ● Training, Testing: - Decision Tree: C4.5 algorithm (J48) on Weka Systems - Support Vector Machine (SVM) and Naïve Bayes (NB).
49
Results
50
Conclusion & Future Work ● Three approaches has been tested – Apply classifier directly – Apply dimension reduction (PCA) and then classify – Apply feature selection (decision tree) and then classify ● Decision tree has the best performance ● Future Plans – Combine content based with behavioral approaches ● Offensive Operations – Honeypots, Information operations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.