Automatic Extraction of Malicious Behaviors

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Evaluating Search Engine
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Vector Space Model CS 652 Information Extraction and Integration.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
APT29 HAMMERTOSS Jayakrishnan M.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Web- and Multimedia-based Information Systems Lecture 2.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Relevance Feedback Hongning Wang
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Automated Information Retrieval
Sampath Jayarathna Cal Poly Pomona
An Efficient Algorithm for Incremental Update of Concept space
Julián ALARTE DAVID INSA JOSEP SILVA
Information Retrieval
Multimedia Information Retrieval
Information Retrieval
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Retrieval Utilities Relevance feedback Clustering
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

Automatic Extraction of Malicious Behaviors Khanh-Huu-The Dam University Paris Diderot and LIPN Tayssir Touili LIPN, CNRS and University Paris 13

Motivation Increased by 36% in one year Symantec reported: 317M malwares in 2014 vs. 431M malwares in 2015 Increased by 36% in one year More than 1M new malwares released everyday Malware detection is a big challenge.

The main challenge is how to make this step automatically. Problem is… Extracting malicious behaviors requires a huge amount of engineering effort. a tedious and manual study of the code. a huge time for that study. The main challenge is how to make this step automatically.

To extract automatically the malicious behaviors! Our goal is … To extract automatically the malicious behaviors!

Model Malicious Behaviors How does a malicious behavior look like!! How ? Let us continue with an example of Trojan downloader What is a good model for a malicious behavior??

Trojan Downloader n15 push 0FEh n16 push offset dword_4097A4 n17 call GetSystemDirectoryA n18 push 0 n19 push 0 n20 lea eax, [ebp-1Ch] n21 mov ebx, eax n22 push ebx n23 push eax n24 push 0 n25 call URLDownloadToFileA n26 push 5 n27 call sub_4038B4 n28 push ebx n29 call WinExec Transfer data from Internet into a file stored in the system folder, then execute this file. *This code is extracted from Trojan-Downloader.Win32.Delf.abk

How to extract such graph automatically!!! Trojan Downloader How to extract such graph automatically!!! Get the path of the system folder. n15 push 0FEh n16 push offset dword_4097A4 n17 call GetSystemDirectoryA n18 push 0 n19 push 0 n20 lea eax, [ebp-1Ch] n21 mov ebx, eax n22 push ebx n23 push eax n24 push 0 n25 call URLDownloadToFileA n26 push 5 n27 call sub_4038B4 n28 push ebx n29 call WinExec GetSystemDirectoryA URLDownloadToFileA Transfer data from an URL address into a file. WinExec Malicious API graph Executing this file in the system folder.

n9, GetSystemDirectoryA Modeling a program … n1 push offset Text n2 push 0 n3 call MessageBoxA n4 push 0FFFFFFF5h n5 call GetStdHandle n6 push eax n7 call WriteFile n8 push offset dword_4097A4 n9 call GetSystemDirectoryA n10 push 0 n11 call URLDownloadToFileA n12 push ebx n13 call WinExec n3, MessageBoxA An API call graph represents the order of execution of the different API functions in a program. n5, GetStdHandle n7, WriteFile n9, GetSystemDirectoryA n11, URLDownloadToFileA n13, WinExec The API call graph *An assembly code of Trojan-Downloader.Win32.Delf.abk

Modeling a program Our goal is to extract such malicious behavior from this graph. … n1 push offset Text n2 push 0 n3 call MessageBoxA n4 push 0FFFFFFF5h n5 call GetStdHandle n6 push eax n7 call WriteFile n8 push offset dword_4097A4 n9 call GetSystemDirectoryA n10 push 0 n11 call URLDownloadToFileA n12 push ebx n13 call WinExec n3, MessageBoxA The malicious behavior !!! n5, GetStdHandle n7, WriteFile n9, GetSystemDirectoryA n11, URLDownloadToFileA n13, WinExec The API call graph *An assembly code of Trojan-Downloader.Win32.Delf.abk

How to extract malicious behaviors? Set of malwares API call graphs Malicious API graphs This is an Information Retrieval (IR) problem. Set of benwares API call graphs Our goal: Isolate the few relevant subgraphs (in malwares) from the nonrelevant ones (in benwares).

IR Problem vs. Our Problem Retrieve relevant documents and reject nonrelevant ones in a collection of documents. Isolate the few relevant subgraphs (in malwares) from the nonrelevant ones (in benwares). extracting from each document the set of terms that allow to distinguish this document from the other documents in the collection

Information Retrieval Community Extensively studied the problem over the past 35 years. Information Retrieval (IR) consists of retrieving documents with relevant information from a collection of documents. Web search, email search, etc. Several techniques that were proven to be efficient. In order to apply the techniques of IR over 35 years, our goal is to ….

Our goal is … Adapt and apply this knowledge and experience of the IR community to our malicious behavior extraction problem.

Information Retrieval Information retrieval research has focused on the retrieval of text documents and images. based on extracting from each document a set of terms that allow to distinguish this document from the other documents in the collection. measure the relevance of a term in a document by a term weight scheme.

Term weight scheme in IR The term weight represents the relevance of a term in a document. The higher the term weight is, the more relevant the term is in the document. A large number of weighting functions have been investigated. The TFIDF scheme is the most popular term weighting in the IR community.

Basic TFIDF scheme w (i,j) = tf(i,j) x idf(i) The TFIDF term weight is measured from the occurrences of terms in a document and their appearances in other documents. w (i,j) = tf(i,j) x idf(i) w (i,j) : the weight of term i in document j. tf(i,j) : the frequency of term i in document j. idf(i) : the inverse document frequency of term i. idf(i) = log( N/df(i)) df(i) is the number of documents containing term i. N is the size of the collection.

Properties of TFIDF scheme A term is relevant to a document if it occurs frequently in this document and rarely appears in other documents. Words are terms in a document. Common words like “the”, “a”, “with”, “of”, etc. are terms that can be found in every document are irrelevant. This scheme ensures that A term is relevant to a document if it occurs frequently in this document and rarely appears in other documents.

Basic TFIDF Scheme Issues Term frequencies are usually bigger for longer documents. For ranking, a document with a higher tf for a relevant term is not placed ahead of other documents which have multiple relevant terms. Adjust the term frequency by a function F w (i,j) = F( tf(i,j)) x idf(i) F( tf) takes into account the long document normalization and ensures the high rank for relevant documents.

Some Functions of Term frequency Depending on the application, one function can be better than the others.

How to apply to our graphs ? Documents Graphs A B C Terms are nodes or edges The relevant graph consists of relevant nodes and edges. Terms are words This technique applied to documents while we want to apply to the graphs. Term weights of words Term weights of nodes or edges

How to apply to our graphs ? Weight of term (node or edge) i in graph j is computed by w (i,j) = F( tf(i,j)) x idf(i) tf(i,j) : the frequency of term i in graph j. idf(i) : the inverse graph frequency of term i. idf(i) = log( N/df(i)) df(i) is the number of graphs containing term i. N is the size of the collection.

Relevance of a term in a graph Malware graph set M Benware graph set B API call graphs API call graphs Graph m1, m2, … Graph b1, b2 … Relevance ? Relevance of term i to graph mj Relevance of term i to graph bj Given term i (node or edge)

Relevance of a term in a set Malware graph set M Benware graph set B API call graphs API call graphs Graph m1, m2, … Graph b1, b2 … Relevance ? Relevance of term i in Benwares Relevance of term i in Malwares Given term i (node or edge)

Relevance of a term w.r.t M and B Malware graph set M Benware graph set B API call graphs API call graphs How is a term relevant in M and not in B? W(i,M,B) is high when W(i,M) is high and W(i,B) is low. We want to compute this weight such that W(i,M,B) is high when W(i,M) is high and W(i,B) is low. Given term i (node or edge)

Relevance of a term w.r.t M and B: Rocchio weight Measured by the distance between the weight of i in the set M and its weight in the set B. W(i,M,B) is high if W(i,M) is high and W(i,B) is low. Values to adjust the effect of term weights in M and in B. Normalizing term weights by the size of the collection.

Relevance of a term w.r.t M and B: Ratio weight Measured by the ratio of the weight of term i in M and its weight in B. This is a kind of quotient between W(i,M) and W(i,B). W(i,M,B) is high if W(i,M) is high and W(i,B) is low. To avoid a problem in case W(i,B)=0. Normalizing term weights by the size of the collection.

Relevance of a term w.r.t M and B Malware graph set M Benware graph set B API call graphs API call graphs How to use the term weight to extract malicious graphs? The high weight means term i is relevant to M and not to B. For each term (node or edge) i

Construct malicious API graphs A malicious API graph consists of nodes and edges with the highest weight. How to link all these nodes and edges in a graph. There are different possibilities for computing such graph.

Strategy S0 Take n nodes with the highest weight, for n given by the user. Choose out-going edges with the highest weight to connect these nodes. Nodes with the highest weight Graphs A B A Edges with the highest weight C Edges connecting nodes n = 3

Strategy S0 Take n nodes with the highest weight, for n given by the user. Choose out-going edges with the highest weight to connect these nodes. Nodes with the highest weight Graphs A B A Edges with the highest weight C Edges connecting nodes n = 3

Strategy S0 Take n nodes with the highest weight, for n given by the user. Choose out-going edges with the highest weight to connect these nodes. Nodes with the highest weight Graphs A B A Edges with the highest weight C Edges connecting nodes n = 3

Strategy S1 Take n nodes with the highest weight, for n given by the user. Choose edges with the highest weight that start from one of these nodes. Nodes in the graph Graphs D B A Nodes with the highest weight A Edges with the highest weight D C Edges connecting nodes n = 3

Strategy S1 Take n nodes with the highest weight, for n given by the user. Choose edges with the highest weight that start from one of these nodes. Nodes in the graph Graphs D B A Nodes with the highest weight A Edges with the highest weight D C Edges connecting nodes n = 3

Strategy S1 Take n nodes with the highest weight, for n given by the user. Choose edges with the highest weight that start from one of these nodes. Nodes in the graph Graphs D B A Nodes with the highest weight A Edges with the highest weight D E C Edges connecting nodes n = 3

Strategy S2 Take n nodes with the highest weight, for n given by the user. Choose paths with the highest weight to connect each pairs of these nodes. Nodes in the graph Graphs D B A Nodes with the highest weight A Edges on the path with the highest weight D C n = 3 Edges connecting nodes

Strategy S3 Take n edges with the highest weight, for n given by the user. Nodes in the graph D Graphs B A Nodes with the highest weight A E Edges with the highest weight D C n = 3 Edges connecting nodes

The higher weight means term i is relevant to M and not to B. Summary Malware graph set M Benware graph set B API call graphs API call graphs The higher weight means term i is relevant to M and not to B. We use these weights to compute malicious graphs by using different strategies. For each term (node or edge) i

How to detect malwares? Training set (malwares + benwares) Malicious API graphs How our graphs can be used for malware detection? Does the program contain any malicious behavior ? Check common paths A new program API call graph Yes Malware No Benware

Experiments Apply on a dataset of 1980 benign programs and 3980 malwares collected from Vx Heaven. Training set consists of 1000 benwares and 2420 malwares  extract malicious graphs. Test set consists of 980 benwares and 1560 malwares  for evaluating malicious graphs. Evaluate different strategies and formulas. Because we have no evidence to prove which formula performs better than others. So, we try on different formulas and strategies. And the results below are the best ones for each combination.

Performance Measurement High recall means that most of the relevant items were computed. High precision means that the technique computes more relevant items than irrelevant. (Detection rate)

Performance Measurement F-Measure is a harmonic mean of precision and recall. F-Measure is 1 if all retrieved items are relevant and all relevant items have been retrieved.

Evaluating the performance of the different strategies The best performance of each strategy.

Evaluating the performance of the different strategies For the next experiment, we use this configuration (the best one) to detect new malwares. The best performance is the Rocchio equation, strategy S0, formula F3, and n = 85. The best performance of each strategy.

Comparison with well-known antiviruses Detect new unknown malwares 180 new malwares generated by NGVCK, RCWG and VCL32 which are the best known virus generators. 32 new malwares from Internet*. Because the antiviruses can easily update their signatures of malwares. Hence, in order to obtain a fair comparison, we have to use malware generators for generating new malwares and find the newest malware from internet. And we have generated 180 malwares and gotten 32 new malwares from internet for this comparison. * https://malwr.com/

Comparison with well-known antiviruses With our method, we can detect all new malwares while no well known antiviruses can detect all of them. A comparison of our method against well-known antiviruses.

Summary Apply TFIDF scheme for extracting automatically malicious behaviors from the collection of malwares and benwares. Compare different formulas and strategies. Detection rate is 99.04 %. Our tool is able to detect malwares that well-known antiviruses could not detect.