Data Mining for Malware Detection Lecture #2 May 27, 2011 Dr. Bhavani Thuraisingham The University of Texas at Dallas.

Slides:

Advertisements

Similar presentations

Scalable Parallel Intrusion Detection Fahad Zafar Advising Faculty: Dr. John Dorband and Dr. Yaacov Yeesha 1 University of Maryland Baltimore County.

Advertisements

Code-Red : a case study on the spread and victims of an Internet worm David Moore, Colleen Shannon, Jeffery Brown Jonghyun Kim.

By Hiranmayi Pai Neeraj Jain

1 Detection of Injected, Dynamically Generated, and Obfuscated Malicious Code (DOME) Subha Ramanathan & Arun Krishnamurthy Nov 15, 2005.

1 Topic 1 – Lesson 3 Network Attacks Summary. 2 Questions ► Compare passive attacks and active attacks ► How do packet sniffers work? How to mitigate?

Worm Origin Identification Using Random Moonwalks Yinglian Xie, V. Sekar, D. A. Maltz, M. K. Reiter, Hui Zhang 2005 IEEE Symposium on Security and Privacy.

Data warehouse example

Learning on User Behavior for Novel Worm Detection.

19.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 19: Security The Security Problem Authentication Program Threats.

BotMiner Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology.

EECS Presentation Web Tap: Intelligent Intrusion Detection Kevin Borders.

5/1/2006Sireesha/IDS1 Intrusion Detection Systems (A preliminary study) Sireesha Dasaraju CS526 - Advanced Internet Systems UCCS.

Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29.

Silberschatz, Galvin and Gagne  Operating System Concepts Module 19: Security The Security Problem Authentication Program Threats System Threats.

Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.

Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.

Automated malware classification based on network behavior

Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)

A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.

A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.

Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.

Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.

Security Exploiting Overflows. Introduction r See the following link for more info: operating-systems-and-applications-in-

Speaker ： Hong-Ren Jiang A Novel Testbed for Detection of Malicious Software Functionality 1.

IIT Indore © Neminah Hubballi

Survey “Intrusion Detection: Systems and Models” “A Stateful Intrusion Detection System for World-Wide Web Servers”

Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #5 Forensics Systems September 5, 2007.

1 Confidentiality and Trust Management in a Coalition Environment Lecture #11 Dr. Bhavani Thuraisingham February 13, 2008 Data and Applications Security.

Computer Security and Penetration Testing

BLENDED ATTACKS EXPLOITS, VULNERABILITIES AND BUFFER-OVERFLOW TECHNIQUES IN COMPUTER VIRUSES By: Eric Chien and Peter Szor Presented by: Jesus Morales.

ITIS 1210 Introduction to Web-Based Information Systems Chapter 45 How Hackers can Cripple the Internet and Attack Your PC How Hackers can Cripple the.

Carnegie Mellon Selected Topics in Automated Diversity Stephanie Forrest University of New Mexico Mike Reiter Dawn Song Carnegie Mellon University.

1 Figure 4-16: Malicious Software (Malware) Malware: Malicious software Essentially an automated attack robot capable of doing much damage Usually target-of-opportunity.

Automated Classification and Analysis of Internet Malware M. Bailey J. Oberheide J. Andersen Z. M. Mao F. Jahanian J. Nazario RAID 2007 Presented by Mike.

Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,

Automatic Diagnosis and Response to Memory Corruption Vulnerabilities Presenter: Jianyong Dai Jun Xu, Peng Ning, Chongkyung Kil, Yan Zhai, Chris Bookhot.

Christopher Kruegel University of California Engin Kirda Institute Eurecom Clemens Kolbitsch Thorsten Holz Secure Systems Lab Vienna University of Technology.

Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Application Forensics November 5, 2008.

Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.

Stamping out worms and other Internet pests Miguel Castro Microsoft Research.

CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Hybrid Intelligent Systems for Detecting Network Anomalies Lane Thames ECE 8833 Intelligent Systems.

Shellcode Development -Femi Oloyede -Pallavi Murudkar.

Malicious Code Detection and Security Applications Prof. Bhavani Thuraisingham The University of Texas at Dallas October 2008.

Automated Worm Fingerprinting Authors: Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Publish: OSDI'04. Presenter: YanYan Wang.

Dr. Bhavani Thuraisingham October 9, 2015 Analyzing and Securing Social Media Attacks on Social Media.

Erik Jonsson School of Engineering and Computer Science The University of Texas at Dallas Cyber Security Research on Engineering Solutions Dr. Bhavani.

Data Mining for Security Applications Prof. Bhavani Thuraisingham The University of Texas at Dallas May 2006.

Data Mining for Malicious Code Detection and Security Applications Prof. Bhavani Thuraisingham Prof. Latifur Khan The University of Texas at Dallas Guest.

1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.

Week-14 (Lecture-1) Malicious software and antivirus: 1. Malware A user can be tricked or forced into downloading malware comes in many forms, Ex. viruses,

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

Data Mining for Security Applications Prof. Bhavani Thuraisingham The University of Texas at Dallas June 2006.

Shellcode COSC 480 Presentation Alison Buben.

Learning to Detect and Classify Malicious Executables in the Wild by J

Internet Quarantine: Requirements for Containing Self-Propagating Code

TMG Client Protection 6NPS – Session 7.

Detecting Malicious Executables

Worm Origin Identification Using Random Moonwalks

Data and Applications Security Introduction to Data Mining

Malicious Code Detection and Security Applications

Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.

Operating System Concepts

Modeling IDS using hybrid intelligent systems

When Machine Learning Meets Security – Secure ML or Use ML to Secure sth.? ECE 693.

Presentation transcript:

Data Mining for Malware Detection Lecture #2 May 27, 2011 Dr. Bhavani Thuraisingham The University of Texas at Dallas

10/15/ :06 What is Data Mining? Data Mining Knowledge Mining Knowledge Discovery in Databases Data Archaeology Data Dredging Database Mining Knowledge Extraction Data Pattern Processing Information Harvesting Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data, often previously unknown, using pattern recognition technologies and statistical and mathematical techniques (Thuraisingham, Data Mining, CRC Press 1998)

10/15/ :06 What’s going on in data mining? 0 What are the technologies for data mining? -Database management, data warehousing, machine learning, statistics, pattern recognition, visualization, parallel processing 0 What can data mining do for you? -Data mining outcomes: Classification, Clustering, Association, Anomaly detection, Prediction, Estimation,... 0 How do you carry out data mining? -Data mining techniques: Decision trees, Neural networks, Market-basket analysis, Link analysis, Genetic algorithms,... 0 What is the current status? -Many commercial products mine relational databases 0 What are some of the challenges? -Mining unstructured data, extracting useful patterns, web mining, Data mining, security and privacy

10/15/ :06 Data Mining for Intrusion Detection: Problem 0 An intrusion can be defined as “any set of actions that attempt to compromise the integrity, confidentiality, or availability of a resource”. 0 Attacks are: - Host-based attacks - Network-based attacks 0 Intrusion detection systems are split into two groups: - Anomaly detection systems - Misuse detection systems 0 Use audit logs -Capture all activities in network and hosts. -But the amount of data is huge!

10/15/ :06 Misuse Detection 0 Misuse Detection

10/15/ :06 Problem: Anomaly Detection 0 Anomaly Detection

10/15/ :06 Our Approach: Overview Training Data Class Hierarchical Clustering (DGSOT) Testing Testing Data SVM Class Training DGSOT: Dynamically growing self organizing tree

10/15/ :06 Hierarchical clustering with SVM flow chart Our Approach Our Approach: Hierarchical Clustering

10/15/ :06 Results Training Time, FP and FN Rates of Various Methods Methods Average Accuracy Total Training Time Average FP Rate (%) Average FN Rate (%) Random Selection 52%0.44 hours4047 Pure SVM57.6%17.34 hours SVM+Rocchio Bundling 51.6%26.7 hours SVM + DGSOT69.8%13.18 hours

10/15/ :06 Introduction: Detecting Malicious Executables using Data Mining 0 What are malicious executables? -Harm computer systems -Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. -Exploits software vulnerability on a victim -May remotely infect other victims -Incurs great loss. Example: Code Red epidemic cost $2.6 Billion 0 Malicious code detection: Traditional approach -Signature based -Requires signatures to be generated by human experts -So, not effective against “zero day” attacks

10/15/ :06 State of the Art in Automated Detection O Automated detection approaches: 0 Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. 0 Content-based: analyse the content of the malicious executable -Autograph (H. Ah-Kim – CMU): Based on automated signature generation process -N-gram analysis (Maloof, M.A. et.al.): Based on mining features and using machine learning.

10/15/ :06 Our New Ideas (Khan, Masud and Thuraisingham) ✗ Content -based approaches consider only machine-codes (byte-codes). ✗ Is it possible to consider higher-level source codes for malicious code detection? ✗ Yes: Diassemble the binary executable and retrieve the assembly program ✗ Extract important features from the assembly program ✗ Combine with machine-code features

10/15/ :06 The Hybrid Feature Retrieval Model Feature Extraction Binary n-gram features - Sequence of n consecutive bytes of binary executable Assembly n-gram features - Sequence of n consecutive assembly instructions System API call features Collect training samples of normal and malicious executables. Extract features 0 Train a Classifier and build a model 0 Test the model against test samples

10/15/ :06 Hybrid Feature Retrieval (HFR): Training and Testing

10/15/ :06 Binary n-gram features -Features are extracted from the byte codes in the form of n- grams, where n = 2,4,6,8,10 and so on. Example: Given a 11-byte sequence: abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 The 4-grams (4-byte sequences) are: , , ab,...,ef and so on.... Problem: -Large dataset. Too many features (millions!). Solution: -Use secondary memory, efficient data structures -Apply feature selection Feature Extraction

10/15/ :06 Assembly n-gram features -Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; 2-grams (1) “push eax”; “mov eax, dword[0f34]”; (2) “mov eax, dword[0f34]”; “add ecx, eax”; Problem: -Same problem as binary Solution: -Same solution Feature Extraction

10/15/ :06 0 Select Best K features 0 Selection Criteria: Information Gain 0 Gain of an attribute A on a collection of examples S is given by Feature Selection

10/15/ :06 Experiments 0 Dataset -Dataset1: 838 Malicious and 597 Benign executables -Dataset2: 1082 Malicious and 1370 Benign executables -Collected Malicious code from VX Heavens ( 0 Disassembly -Pedisassem ( ) 0 Training, Testing -Support Vector Machine (SVM) -C-Support Vector Classifiers with an RBF kernel

10/15/ :06 Results 0 HFS = Hybrid Feature Set 0 BFS = Binary Feature Set 0 AFS = Assembly Feature Set

10/15/ :06 Results 0 HFS = Hybrid Feature Set 0 BFS = Binary Feature Set 0 AFS = Assembly Feature Set

10/15/ :06 Results 0 HFS = Hybrid Feature Set 0 BFS = Binary Feature Set 0 AFS = Assembly Feature Set

10/15/ :06 Future Plans 0 System call: -seems to be very useful. -Need to Consider Frequency of call -Call sequence pattern (following program path) -Actions immediately preceding or after call 0 Detect Malicious code by program slicing -requires analysis

10/15/ :06 Data Mining for Buffer Overflow Introduction 0 Goal -Intrusion detection. -e.g.: worm attack, buffer overflow attack. 0 Main Contribution -'Worm' code detection by data mining coupled with 'reverse engineering'. -Buffer overflow detection by combining data mining with static analysis of assembly code.

10/15/ :06 Background 0 What is 'buffer overflow'? -A situation when a fixed sized buffer is overflown by a larger sized input. 0 How does it happen? -example: char buff[100]; gets(buff); buffStack memory Input string

10/15/ :06 Background (cont...) 0 Then what? char buff[100]; gets(buff); buffStack memory Stack Return address overwritten buffStack memory New return address points to this memory location Attacker's code buff

10/15/ :06 Background (cont...) 0 So what? -Program may crash or -The attacker can execute his arbitrary code 0 It can now -Execute any system function -Communicate with some host and download some 'worm' code and install it! -Open a backdoor to take full control of the victim 0 How to stop it?

10/15/ :06 Background (cont...) 0 Stopping buffer overflow -Preventive approaches -Detection approaches 0 Preventive approaches -Finding bugs in source code. Problem: can only work when source code is available. -Compiler extension. Same problem. -OS/HW modification 0 Detection approaches -Capture code running symptoms. Problem: may require long running time. -Automatically generating signatures of buffer overflow attacks.

10/15/ :06 CodeBlocker (Our approach) 0 A detection approach 0 Based on the Observation: -Attack messages usually contain code while normal messages contain data. 0 Main Idea -Check whether message contains code 0 Problem to solve: -Distinguishing code from data

10/15/ :06 Some Statistics 0 Statistics to support this observation (a)on Windows platforms -most web servers (port 80) accept data only; -remote access services (ports 111, 137, 138, 139) accept data only; Microsoft SQL Servers (port 1434) accept data only; -workstation services (ports 139 and 445) accept data only. 0 (b) On Linux platforms, most -Apache web servers (port 80) accept data only; -BIND (port 53) accepts data only; -SNMP (port 161) accepts data only; -most Mail Transport (port 25) accepts data only; -Database servers (Oracle, MySQL, PostgreSQL) at ports 1521, 3306 and 5432 accept data only.

10/15/ :06 Severity of the problem 0 It is not easy to detect actual instruction sequence from a given string of bits

10/15/ :06 Our solution 0 Apply data mining. 0 Formulate the problem as a classification problem (code, data) 0 Collect a set of training examples, containing both instances 0 Train the data with a machine learning algorithm, get the model 0 Test this model against a new message

10/15/ :06 CodeBlocker Model

10/15/ :06 Feature Extraction

10/15/ :06 Disassembly 0 We apply SigFree tool -implemented by Xinran Wang et al. (PennState)

10/15/ :06 Feature extraction 0 Features are extracted using -N-gram analysis -Control flow analysis 0 N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE

10/15/ :06 Feature extraction (cont...) 0 Control-flow Based N-gram analysis Assembly programCorresponding IFG What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6

10/15/ :06 Feature extraction (cont...) 0 Control Flow analysis. Generated features -Invalid Memory Reference (IMR) -Undefined Register (UR) -Invalid Jump Target (IJT) 0 Checking IMR -A memory is referenced using register addressing and the register value is undefined -e.g.: mov ax, [dx + 5] 0 Checking UR -Check if the register value is set properly 0 Checking IJT -Check whether jump target does not violate instruction boundary

10/15/ :06 Putting it together 0 Why n-gram analysis?  Intuition: in general, disassembled executables should have a different pattern of instruction usage than disassembled data. 0 Why control flow analysis? -Intuition: there should be no invalid memory references or invalid jump targets. 0 Approach -Compute all possible n-grams -Select best k of them -Compute feature vector (binary vector) for each training example -Supply these vectors to the training algorithm

10/15/ :06 Experiments 0 Dataset -Real traces of normal messages -Real attack messages -Polymorphic shellcodes 0 Training, Testing -Support Vector Machine (SVM)

10/15/ :06 Results 0 CFBn: Control-Flow Based n-gram feature 0 CFF: Control-flow feature

10/15/ :06 Novelty, Advantages, Limitations, Future 0 Novelty -We introduce the notion of control flow based n-gram -We combine control flow analysis with data mining to detect code / data -Significant improvement over other methods (e.g. SigFree) 0 Advantages -Fast testing -Signature free operation -Low overhead -Robust against many obfuscations 0 Limitations -Need samples of attack and normal messages. -May not be able to detect a completely new type of attack. 0 Future -Find more features -Apply dynamic analysis techniques -Semantic analysis

10/15/ :06 Worm Detection: Introduction 0 What are worms? -Self-replicating program; Exploits software vulnerability on a victim; Remotely infects other victims 0 Evil worms -Severe effect; Code Red epidemic cost $2.6 Billion 0 Goals of worm detection -Real-time detection 0 Issues -Substantial Volume of Identical Traffic, Random Probing 0 Methods for worm detection -Count number of sources/destinations; Count number of failed connection attempts 0 Worm Types - worms, Instant Messaging worms, Internet worms, IRC worms, File- sharing Networks worms 0 Automatic signature generation possible -EarlyBird System (S. Singh -UCSD); Autograph (H. Ah-Kim - CMU)

10/15/ :06 Worm Detection using Data Mining Training data Feature extraction Clean or Infected ? Outgoing s Classifier Machine Learning Test data The Model Task: given some training instances of both “normal” and “viral” s, induce a hypothesis to detect “viral” s. We used: Naïve Bayes SVM

10/15/ :06 Assumptions 0 Features are based on outgoing s. 0 Different users have different “normal” behaviour. 0 Analysis should be per-user basis. 0 Two groups of features -Per (#of attachments, HTML in body, text/binary attachments) -Per window (mean words in body, variable words in subject) 0 Total of 24 features identified 0 Goal: Identify “normal” and “viral” s based on these features

10/15/ :06 Feature sets -Per features =Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments =Continuous-valued Features Number of attachments; Number of words/characters in the subject and body -Per window features =Number of s sent; Number of unique recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length =Ratio of s with attachments

10/15/ :06 Data Mining Approach Classifier SVMNaïve Bayes infected ? Clean ? Clean Clean/ Infected Test instance

10/15/ :06 Data set 0 Collected from UC Berkeley. -Contains instances for both normal and viral s. 0 Six worm types: -bagle.f, bubbleboy, mydoom.m, -mydoom.u, netsky.d, sobig.f 0 Originally Six sets of data: -training instances: normal (400) + five worms (5x200) -testing instances: normal (1200) + the sixth worm (200) 0 Problem: Not balanced, no cross validation reported 0 Solution: re-arrange the data and apply cross-validation

10/15/ :06 Our Implementation and Analysis 0 Implementation -Naïve Bayes: Assume “Normal” distribution of numeric and real data; smoothing applied  SVM: with the parameter settings: one-class SVM with the radial basis function using “gamma” = and “nu” = Analysis -NB alone performs better than other techniques -SVM alone also performs better if parameters are set correctly -mydoom.m and VBS.Bubbleboy data set are not sufficient (very low detection accuracy in all classifiers) -The feature-based approach seems to be useful only when we have identified the relevant features gathered enough training data Implement classifiers with best parameter settings