Scalable, Behavior-Based Malware Clustering Ulrich Bayer,Paolo Milani Comparetti,Clemens Hlauschek,Christopher Kruegel, and Engin Kirda Technical University.

Slides:



Advertisements
Similar presentations
Malware clustering and classification
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
White-Box Cryptography
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.
Effective and Efficient Malware Detection at the End Host Clemens Kolbitsch, Paolo Milani TU Vienna Christopher UCSB Engin Kirda.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 Location-Based Services Using GSM Cell Information over Symbian OS Final Year Project LYU0301 Mok Ming Fai (mfmok1) Lee Kwok Chau (leekc1)
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
C++ fundamentals.
Leveraging User Interactions for In-Depth Testing of Web Application Sean McAllister Secure System Lab, Technical University Vienna, Austria Engin Kirda.
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.
Automated malware classification based on network behavior
Automatically Generating Models for Botnet Detection Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel, Engin Kirda Vienna University.
Henric Johnson1 Chapter 10 Malicious Software Henric Johnson Blekinge Institute of Technology, Sweden
Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.
Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis Authors: Heng Yin, Dawn Song, Manuel Egele, Christoper Kruegel, and.
Yin, H., Song, D., Egele, M., Kruegel, C., Kirda, E. In Proc. of the 14th ACM conference on Computer and communications security, October /9/31.
1 Chap 10 Malicious Software. 2 Viruses and ”Malicious Programs ” Computer “Viruses” and related programs have the ability to replicate themselves on.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
NDSS 2007 Philipp Vogt, Florian Nentwich, Nenad Jovanovic, Engin Kirda, Christopher Kruegel, Giovanni Vigna.
Presented by: Kushal Mehta University of Central Florida Michael Spreitzenbarth, Felix Freiling Friedrich-Alexander- University Erlangen, Germany michael.spreitzenbart,
BY ANDREA ALMEIDA T.E COMP DON BOSCO COLLEGE OF ENGINEERING.
Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda Presentation by Mridula Menon N.
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
Behavior-based Spyware Detection By Engin Kirda and Christopher Kruegel Secure Systems Lab Technical University Vienna Greg Banks, Giovanni Vigna, and.
Analyzing and Detecting Network Security Vulnerability Weekly report 1Fan-Cheng Wu.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction. 2COMPSCI Computer Science Fundamentals.
1 Chap 10 Virus. 2 Viruses and ”Malicious Programs ” Computer “Viruses” and related programs have the ability to replicate themselves on an ever increasing.
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Chapter 10 Malicious software. Viruses and ” Malicious Programs Computer “ Viruses ” and related programs have the ability to replicate themselves on.
AccessMiner Using System- Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu and Engin Kirda.
Automated Classification and Analysis of Internet Malware M. Bailey J. Oberheide J. Andersen Z. M. Mao F. Jahanian J. Nazario RAID 2007 Presented by Mike.
Roberto Paleari,Universit`a degli Studi di Milano Lorenzo Martignoni,Universit`a degli Studi di Udine Emanuele Passerini,Universit`a degli Studi di Milano.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Biologically Inspired Defenses against Computer Viruses International Joint Conference on Artificial Intelligence 95’ J.O. Kephart et al.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Christopher Kruegel University of California Engin Kirda Institute Eurecom Clemens Kolbitsch Thorsten Holz Secure Systems Lab Vienna University of Technology.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
Finding Diversity in Remote Code Injection Exploits Justin Ma, John Dunagan, Helen J. Wang, Stefan Savage, Geoffrey M. Voelker *University of California,
Abstract ESOLID is a computational geometry system that performs boundary evaluation using exact computation. Boundary Evaluation Exact computation Problem.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Polygraph: Automatically Generating Signatures for Polymorphic Worms Presented by: Devendra Salvi Paper by : James Newsome, Brad Karp, Dawn Song.
Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.
Computer virus Speaker : 蔡尚倫.  Introduction  Infection target  Infection techniques Outline.
Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software Paper by: James Newsome and Dawn Song.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
1 Presentation Methodology Summary B. Golden. 2 Introduction Why use visualizations?  To facilitate user comprehension  To convey complexity and intricacy.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Polygraph: Automatically Generating Signatures for Polymorphic Worms Authors: James Newsome (CMU), Brad Karp (Intel Research), Dawn Song (CMU) Presenter:
SEMINAR - SCALABLE, BEHAVIOR-BASED MALWARE CLUSTERING GUIDES : BOJAN KOLOSNJAJI, MOHAMMAD REZA NOROUZIAN, GEORGE WEBSTER PRESENTER RAMAKANT AGRAWAL.
Corrado LeitaSymantec Research Labs Ulrich Bayer Technical University Vienna Engin KirdaInstitute iSecLab.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Automatic Network Protocol Analysis
POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms
Chap 10 Malicious Software.
Chap 10 Malicious Software.
Introduction to Data Structure
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Learning and Memorization
Presentation transcript:

Scalable, Behavior-Based Malware Clustering Ulrich Bayer,Paolo Milani Comparetti,Clemens Hlauschek,Christopher Kruegel, and Engin Kirda Technical University Vienna University of California, Santa Brabara Institute Eurecom, Sophia Antipolis by Mike Hsiao 2009 NDSS

Outline Introduction System Overview Dynamic Analysis Behavior Profile Scalable Clustering Evaluation Limitations and Future Work Related Work 2

Introduction Thousands of new malware samples appear each day. Automatic analysis systems allow us to create thousands of analysis reports. Now a way to group the reports is needed. We would like to cluster them into sets of malware reports that exhibit similar behavior. ◦ We require automated clustering techniques. Clustering allows us to ◦ discard reports of samples that have been seen before ◦ guide an analyst in the selection of those samples that require most attention ◦ derive generalized signatures, implement removal procedures that work for a whole class of samples 3

Scalable, Behavior-Based Malware Clustering Malware Clustering: Find a partitioning of a given set of malware samples into subsets so that subsets share some common traits (i.e., find “virus families”) Behavior-Based: A malware sample is represented by its actions performed at run-time. Scalable: It has to work for large sets of malware samples. 4

System Overview 5

Dynamic Analysis Based on our existing automatic, dynamic analysis system called Anubis (based on Qemu). ◦ Anubis is a full-system emulator. ◦ Anubis generates an execution trace listing all invoked system calls. In this work, we extended Anubis with: ◦ system call dependencies (Tainting) ◦ control flow dependencies ◦ network analysis (for accurately describing a sample’s network behavior) Output of this step: Execution trace augmented with taint information and network analysis results. 6

Dynamic Analysis (cont’d) The goal is to identify how the program uses information that it obtains from the OS. Tainting system ◦ We attach (taint) labels to certain interesting bytes in memory and propagate these labels whenever they are copied or otherwise manipulated. ◦ System calls serves as taint source.  I.e., we taint the out-arguments and return values of all system calls. 7

Dynamic Analysis (cont’d) Example ◦ 1) get the return value of GetDate, and then it is used in CreateFile call. ◦ 2) a program reads its own code segment might be a worm propagation code. Record program control flow ◦ identify similarities between programs that perform the same actions Network ◦ use Bro to analyze the sent/received data to recognize and parse application level protocols. 8

Extraction Of The Behavioral Profile In this step, we process the execution trace provided by the ‘dynamic analysis’ step. Goal: abstract from the system call trace ◦ system calls can vary significantly, even between programs that exhibit the same behavior ◦ remove execution-specific artifacts from the trace A behavioral profile is an abstraction of the program‘s execution trace that accurately captures the behavior of the binary. 9

Reasons For An Abstract Behavioral Description Different ways to read from a file Different system calls with similar semantics ◦ e.g., NtCreateProcess, NtCreateProcessEx You can easily interleave the trace with unrelated calls: 10 f = fopen(“C:\\test”); read(f, 1); f = fopen(“C:\\test”); read(f, 3); A: B: f = fopen(“C:\\test”); read(f, 1); readRegValue(..); read(f, 1); C:

Elements Of A Behavioral Profile OS Objects: represent a resource such as a file that can be manipulated via system calls ◦ has a name and a type OS Operations: generalization of a system call ◦ carried out on an OS object ◦ the order of operations is irrelevant ◦ the number of operations on a certain resource does not matter Object Dependencies: model dependencies between OS objects (e.g., a copy operation from a source file to a target file) ◦ also reflect the true order of operations Control Flow Dependencies: reflect how tainted data is used by the program (comparisons with tainted data) 11

Example: Behavioral Profile 12 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 13 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 14 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 15 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 16 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 17 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 18 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Example: Behavioral Profile 19 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Scalable Clustering Most clustering algorithms require to compute the distances between all pairs of points => O(n 2 ). We use LSH (locality sensitive hashing), a technique introduced by Indyk and Motwani, to compute an approximate clustering that requires less than n 2 distance computations. Our clustering algorithm takes as input a set of malware samples where each malware sample is represented as a set of features. ◦ We have to transform each behavioral profile into a feature set first Consider a and b are two samples. Our similarity measure: Jaccard Index defined as 20

LSH Clustering We are performing an approximate, single- linkage hierarchical clustering: Step 1: Locality Sensitive Hashing ◦ to cluster a set of samples we have to choose a similarity threshold t ◦ the result is an approximation of the true set of all near (as defined by the parameter t) pairs Step 2: Single-Linkage hierarchical clustering 21

Evaluating Clustering Quality For assessing the quality of the clustering algorithm, we compare our clustering results with a reference clustering of the same sample set ◦ since no reference clustering for malware exists, we had to create it first Reference Clustering: 1. we obtained a random sampling of 14,212 malware samples that were submitted to Anubis from Oct. 27th 2007 to Jan. 31st we scanned each sample with 6 different virus scanners 3. we selected only those samples for which the majority of the antivirus programs reported the same malware family. This resulted in a total of 2,658 samples. 4. we manually corrected classification problems 22

Quantitative Evaluation We ran our clustering algorithm with a similarity threshold t = 0.7 on the reference set of 2,658 samples. Our system produced 87 clusters while the reference clustering consists of 84 clusters. Precision: ◦ precision measures how well a clustering algorithm distinguishes between samples that are different Recall: ◦ recall measures how well a clustering algorithm recognizes similar samples 23 T: reference clustering C: authors’ clustering

Comparative Evaluation Behavioral Description Similarity Measure ClusteringQuality =precision*recall Bailey- profile NCDExact0.916 =0.979*0.935 Bailey- profile Jaccard Index Exact0.801 =0.971*0.825 SyscallsJaccard Index Exact0.656 =0.874*0.750 Our ProfileJaccard Index Exact0.959 =0.977*0.981 Our ProfileJaccard Index LSH0.959 =0.979* NCD: Normalized Compression Distance LSH: Locality Sensitive Hashing Exact: all n*n/2 distance are computed

Precision and Recall 25 t The relationship between Precision/Recall and threshold t.

Performance Evaluation Input: 75,692 malware samples Previous work by Bailey et al (extrapolated from their results of 500 samples): ◦ Number of distance calculations: 2,864,639,432 ◦ Time for a single distance calculation: 1.25 ms ◦ Runtime: 995 hours (~ 6 weeks) Our results: ◦ Number of distance calculations: 66,528,049 ◦ Runtime: 2h 18min 26

Performance Evaluation 27

More results 4 largest cluster (account for 86% of all samples) ◦ Allaple.1 (1,289 samples)  a polymorphic worm with ICMP scans ◦ Allaple.2 (717 samples)  exploit the target systems using a wider variety of propagation behavior with DNS lookup ◦ DOS (179 samples)  This cluster contains various DOS malware sample but with similar behavior ◦ GBDialer.j (106 samples)  similar startup actions and system modification with attempting modem dial 28

More result Similar register key access to check if anti-virus system is installed. Compare fixed value with system time. ◦ to launch a specific action at specific time Wrong cluster (one cluster with 25 samples) ◦ sample crash ◦ cause debugger activate ◦ generate crash report ◦ display popup message 29

Limitations and Future Work Trace Dependence ◦ more behavior might be hidden in other execution context. Evasion ◦ We are not interested in labor-intensive, manual evasion. ◦ We consider adversary who attempts to automatically produce an arbitrary number of mutations of a malware sample in such a way that most such mutations are assigned to different clusters by our tools. 30

Related Work Behavioral Analysis ◦ All of them focus on system call analysis. Dynamic Data Tainting ◦ Recently, dynamic taint analysis has been also used for the automatic analysis of network protocol.  [18] D. Song, “Polylot: Automatic Extraction of Protocol Message Format using Dynamic Binary Analysis,” in CCS  [43] C. Kruegel, “Automatic Network Protocol Analysis,” in NDSS ◦ But they focus on protocol format or syntax analysis, not execution behavior. 31

Conclusions Novel approach for clustering large collections of malware samples ◦ dynamic analysis ◦ extraction of behavioral profiles ◦ clustering algorithm that requires less than a quadratic amount of distance calculations Experiments on real-world datasets that demonstrate that our techniques can accurately recognize malicious code that behaves in a similar fashion Available online: 32

Coding for profile 33

Example of profile 34

Example: Behavioral Profile 35 src = NtOpenFile(“C:\\sample.exe”); // memory map the target file dst = NtCreateFile(“C:\\Windows\\” + GetTempFilename()); dst_section = NtCreateSection(dst); char *base = NtMapViewOfSection(dst_section); while(len < length(src)) { *(base+len)=NtReadFile(src, 1); len++; } Op | File | C:\sample.exe open:1, read:1 Op | File | RANDOM_1 create:1 Op | Section | RANDOM_1 open:1, map:1, mem_write: 1 Dep | File | C:\sample.exe -> Section | RANDOM_1 read – mem_write

Coding for Cluster 36 transform into a set of features 1. For each object, and for each operation 2. For each dependence 3. For each label-value comparison 4. For each label-label comparison