Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.

Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

Outline and Acknowledgement ● Vision for Assured Information Sharing ● Handling Different Trust levels ● Defensive Operations between Untrustworthy Partners – Detecting Malicious Executables using Data Mining ● Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds

Vision: Assured Information Sharing Publish Data/Policy Component Data/Policy for Agency A Data/Policy for Coalition Publish Data/Policy Component Data/Policy for Agency C Component Data/Policy for Agency B Publish Data/Policy 1.Trustworthy Partners 2.Semi-Trustworthy partners 3.Untrustworthy partners 4.Dynamic Trust

Our Approach ● Integrate the Medicaid claims data and mine the data; next enforce policies and determine how much information has been lost by enforcing policies – Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students) ● Apply game theory and probing techniques to extract information from semi-trustworthy partners – Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student) ● Data Mining for Defensive and offensive operations – E.g., Malicious code detection, Honeypots – Prof. Latifur Khan and Mehedy Masud ● Dynamic Trust levels, Peer to Peer Communication – Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)

Introduction: Detecting Malicious Executables using Data Mining 0 What are malicious executables? - Harm computer systems - Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. - Exploits software vulnerability on a victim - May remotely infect other victims - Incurs great loss. Example: Code Red epidemic cost $2.6 Billion 0 Malicious code detection: Traditional approach - Signature based - Requires signatures to be generated by human experts - So, not effective against “zero day” attacks

State of the Art: Automated Detection O Automated detection approaches: ● Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. ● Content-based: analyse the content of the malicious executable – Autograph (H. Ah-Kim – CMU): Based on automated signature generation process – N-gram analysis (Maloof, M.A. et.al.): Based on mining features and using machine learning.

New Ideas ✗ Content -based approaches consider only machine- codes (byte-codes). ✗ Is it possible to consider higher-level source codes for malicious code detection? ✗ Yes: Diassemble the binary executable and retrieve the assembly program ✗ Extract important features from the assembly program ✗ Combine with machine-code features

Feature Extraction ✗ Binary n-gram features – Sequence of n consecutive bytes of binary executable ✗ Assembly n-gram features – Sequence of n consecutive assembly instructions ✗ System API call features – DLL function call information

The Hybrid Feature Retrieval Model ● Collect training samples of normal and malicious executables. ● Extract features ● Train a Classifier and build a model ● Test the model against test samples

Hybrid Feature Retrieval (HFR) ● Training

Hybrid Feature Retrieval (HFR) ● Testing

Binary n-gram features – Features are extracted from the byte codes in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: Given a 11-byte sequence: 0123456789abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 The 4-grams (4-byte sequences) are: 01234567, 23456789, 456789ab,...,ef012345 and so on.... Problem: – Large dataset. Too many features (millions!). Solution: – Use secondary memory, efficient data structures – Apply feature selection Feature Extraction

Assembly n-gram features – Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; 2-grams (1) “push eax”; “mov eax, dword[0f34]”; (2) “mov eax, dword[0f34]”; “add ecx, eax”; Problem: – Same problem as binary Solution: – Same solution Feature Extraction

● Select Best K features ● Selection Criteria: Information Gain ● Gain of an attribute A on a collection of examples S is given by Feature Selection

Experiments 0 Dataset – Dataset1: 838 Malicious and 597 Benign executables – Dataset2: 1082 Malicious and 1370 Benign executables – Collected Malicious code from VX Heavens (http://vx.netlux.org) 0 Disassembly – Pedisassem ( http://www.geocities.com/~sangcho/index.html ) 0 Training, Testing – Support Vector Machine (SVM) – C-Support Vector Classifiers with an RBF kernel

Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set

Future Plans ● System call : – seems to be very useful. – Need to Consider Frequency of call – Call sequence pattern (following program path) – Actions immediately preceding or after call ● Detect Malicious code by program slicing – requires analysis

Data Mining to Detect Buffer Overflow Attack Mohammad M. Masud, Latifur Khan, Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

Introduction ● Goal – Intrusion detection. – e.g.: worm attack, buffer overflow attack. ● Main Contribution – 'Worm' code detection by data mining coupled with 'reverse engineering'. – Buffer overflow detection by combining data mining with static analysis of assembly code.

Background ● What is 'buffer overflow'? – A situation when a fixed sized buffer is overflown by a larger sized input. ● How does it happen? – example:........ char buff[100]; gets(buff);........ buffStack memory Input string

Background (cont...) ● Then what?........ char buff[100]; gets(buff);........ buffStack memory Stack Return address overwritten buffStack memory New return address points to this memory location Attacker's code buff

Background (cont...) ● So what? – Program may crash or – The attacker can execute his arbitrary code ● It can now – Execute any system function – Communicate with some host and download some 'worm' code and install it! – Open a backdoor to take full control of the victim ● How to stop it?

Background (cont...) ● Stopping buffer overflow – Preventive approaches – Detection approaches ● Preventive approaches – Finding bugs in source code. Problem: can only work when source code is available. – Compiler extension. Same problem. – OS/HW modification ● Detection approaches – Capture code running symptoms. Problem: may require long running time. – Automatically generating signatures of buffer overflow attacks.

CodeBlocker (Our approach) ● A detection approach ● Based on the Observation: – Attack messages usually contain code while normal messages contain data. ● Main Idea – Check whether message contains code ● Problem to solve: – Distinguishing code from data

● Statistics to support this observation (a)on Windows platforms – most web servers (port 80) accept data only; – remote access services (ports 111, 137, 138, 139) accept data only; Microsoft SQL Servers (port 1434) accept data only; – workstation services (ports 139 and 445) accept data only. ● (b) On Linux platforms, most – Apache web servers (port 80) accept data only; – BIND (port 53) accepts data only; – SNMP (port 161) accepts data only; – most Mail Transport (port 25) accepts data only; – Database servers (Oracle, MySQL, PostgreSQL) at ports 1521, 3306 and 5432 accept data only.

Severity of the problem ● It is not easy to detect actual instruction sequence from a given string of bits

Our solution ● Apply data mining. ● Formulate the problem as a classification problem (code, data) ● Collect a set of training examples, containing both instances ● Train the data with a machine learning algorithm, get the model ● Test this model against a new message

CodeBlocker Model

Feature Extraction

Disassembly ● We apply SigFree tool – implemented by Xinran Wang et al. (PennState)

Feature extraction ● Features are extracted using – N-gram analysis – Control flow analysis ● N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE

Feature extraction (cont...) ● Control-flow Based N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6

Feature extraction (cont...) ● Control Flow analysis. Generated features – Invalid Memory Reference (IMR) – Undefined Register (UR) – Invalid Jump Target (IJT) ● Checking IMR – A memory is referenced using register addressing and the register value is undefined – e.g.: mov ax, [dx + 5] ● Checking UR – Check if the register value is set properly ● Checking IJT – Check whether jump target does not violate instruction boundary

Feature extraction (cont...) ● Why n-gram analysis? – Intuition: in general, disassembled executables should have a different pattern of instruction usage than disassembled data. ● Why control flow analysis? – Intuition: there should be no invalid memory references or invalid jump targets.

Putting it together ● Compute all possible n-grams ● Select best k of them ● Compute feature vector (binary vector) for each training example ● Supply these vectors to the training algorithm

Experiments ● Dataset – Real traces of normal messages – Real attack messages – Polymorphic shellcodes ● Training, Testing – Support Vector Machine (SVM)

Results ● CFBn: Control-Flow Based n-gram feature ● CFF: Control-flow feature

Novelty / contribution ● We introduce the notion of control flow based n-gram ● We combine control flow analysis with data mining to detect code / data ● Significant improvement over other methods (e.g. SigFree)

Advantages ● 1) Fast testing ● 2) Signature free operation 3) Low overhead ● 4) Robust against many obfuscations

Limitations ● Need samples of attack and normal messages. ● May not be able to detect a completely new type of attack.

Future Works ● Find more features ● Apply dynamic analysis techniques ● Semantic analysis

Reference / suggested readings – X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006. – Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004.

Email Worm Detection (behavioural approach) Training data Feature extraction Clean or Infected ? Outgoing Emails Classifier Machine Learning Test data The Model

Feature Extraction Per email features = Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments = Continuous-valued Features Number of attachments; Number of words/characters in the subject and body Per window features = Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length = Ratio of emails with attachments

Feature Reduction & Selection Principal Component Analysis = Reduce higher dimensional data into lower dimension = Helps reducing noise, overfitting Decesion Tree = Used to Select Best features

Experiments 0 Data Set - Contains instances for both normal and viral emails. – Six worm types: ● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f - Collected from UC Berkeley ● Training, Testing: - Decision Tree: C4.5 algorithm (J48) on Weka Systems - Support Vector Machine (SVM) and Naïve Bayes (NB).

Results

Conclusion & Future Work ● Three approaches has been tested – Apply classifier directly – Apply dimension reduction (PCA) and then classify – Apply feature selection (decision tree) and then classify ● Decision tree has the best performance ● Future Plans – Combine content based with behavioral approaches ● Offensive Operations – Honeypots, Information operations

Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.

Similar presentations

Presentation on theme: "Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.

Similar presentations

Presentation on theme: "Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham."— Presentation transcript:

Similar presentations

About project

Feedback