Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Classification Classification Examples
Random Forest Predrag Radenković 3237/10
Fast and Precise In-Browser JavaScript Malware Detection
1 Detection of Injected, Dynamically Generated, and Obfuscated Malicious Code (DOME) Subha Ramanathan & Arun Krishnamurthy Nov 15, 2005.
Indian Statistical Institute Kolkata
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
K nearest neighbor and Rocchio algorithm
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Metamorphic Malware Research
HUNTING FOR METAMORPHIC ENGINES Mark Stamp & Wing Wong August 5, 2006.
Recommender systems Ram Akella November 26 th 2008.
Machine Learning: Ensemble Methods
Predicting Matchability - CVPR 2014 Paper -
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Evaluating Performance for Data Mining Techniques
Beyond Anti-Virus by Dan Keller Fred Cohen- Computer Scientist “there is no algorithm that can perfectly detect all possible computer viruses”
Jarhead Analysis and Detection of Malicious Java Applets Johannes Schlumberger, Christopher Kruegel, Giovanni Vigna University of California Annual Computer.
Automated malware classification based on network behavior
Silvio Cesare Ph.D. Candidate, Deakin University.
MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.
Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis Authors: Heng Yin, Dawn Song, Manuel Egele, Christoper Kruegel, and.
Presented by: Kushal Mehta University of Central Florida Michael Spreitzenbarth, Felix Freiling Friedrich-Alexander- University Erlangen, Germany michael.spreitzenbart,
Department of Computer Science Yasmine Kandissounon.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Behavior-based Spyware Detection By Engin Kirda and Christopher Kruegel Secure Systems Lab Technical University Vienna Greg Banks, Giovanni Vigna, and.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
AUTHORS: ASAF SHABTAI, URI KANONOV, YUVAL ELOVICI, CHANAN GLEZER, AND YAEL WEISS "ANDROMALY": A BEHAVIORAL MALWARE DETECTION FRAMEWORK FOR ANDROID.
Permission-based Malware Detection in Android Devices REU fellow: Nadeen Saleh 1, Faculty mentor: Dr. Wenjia Li 2 Affiliation: 1. Florida Atlantic University,
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Automated Classification and Analysis of Internet Malware M. Bailey J. Oberheide J. Andersen Z. M. Mao F. Jahanian J. Nazario RAID 2007 Presented by Mike.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,
Normalizing Metamorphic Malware Using Term Rewriting A. Walenstein, R. Mathur, M. R. Chouchane, and A. Lakhotia Software Research Laboratory The University.
CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
Using Engine Signature to Detect Metamorphic Malware Mohamed R. Chouchane and Arun Lakhotia Software Research Laboratory The University of Louisiana at.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
LOGOPolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware Royal, P.; Halpin, M.; Dagon, D.; Edmonds, R.; Wenke Lee; Computer Security.
Forensic Analysis of Toolkit-Generated Malicious Programs Yasmine Kandissounon TSYS School of Computer Science Columbus State University 2009 ACM Mid-Southeast.
Ensemble Learning for Low-level Hardware-supported Malware Detection
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
October 20-23rd, 2015 FEEBO: A Framework for Empirical Evaluation of Malware Detection Resilience Against Behavior Obfuscation Sebastian Banescu Tobias.
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
Automatic Extraction of Malicious Behaviors
Evaluating Classifiers
Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques Presented by Vikraman Mohan.
Techniques, Tools, and Research Issues
Perceptrons Lirong Xia.
Twitter Augmented Android Malware Detection
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Prasit Usaphapanus Krerk Piromsopa
Perceptrons Lirong Xia.
Presentation transcript:

Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science Columbus State University November 19 th, 2009

Malware  State of the Threat:  Anti-virus firms are having to analyze between 15,000 and 20,000 new malware instances a day. [AV Test Lab 2007, McAfee2009]  1.6 million malware instances were detected in [F-Secure2008]  Professionals are being recruited to make stealthier malware. [ESET2008]  Automation is being used to generate malware. [ESET2008]  Generic Detection is not good. 630 out of 1000 instances of new malware went unnoticed. [Team-cymru2008]  Classes: Viruses, Worms, Trojans  Malware-generating Engines: Script Kiddies, Morphers, Metamorphic, and Virus Generating Toolkits 2/20

Static Program Analysis:  Static program analysis may be imprecise and inefficient (e.g., Def- Use analysis).  Static program analysis may be challenged by obfuscation. Dynamic Program Analysis:  May be challenged by testing the patience of the emulator.[Aycock2005] Extract Procedures Disassembly Signature Verification Control Flow Graphs Malicious Benign Suspect Program Malware is Hard to Detect as it is… Read x Entry Call Proc 1 Call Proc 2 Call Proc 1 Exit YES NO Dead code insertion If (x*x* (x+1) *(x+1) % 4==0) …Stop early in the program analysis pipeline. 3/20

Variant 2Variant 1 Variant 3 Variant n IN OUT Engines-Generated Malware ENGINE Variant 0 MALWARE DETECTOR  Engine generates new variants at a high rate.  Malware detectors typically store one signature per variant.  Too many signatures challenge the detector. 4/20

Proposal: View Engine as Author Goal 1: Reduce the number of steps required in the program analysis pipeline. Goal 2: Eliminate the need for signature per variant. Goal 3: Must be satisfactorily accurate. Proposed Model [Chouchane2006] ENGINE Variant 2Variant 1 Variant 3 Variant n MALWARE DETECTOR Engine Signature OUT Source: Google Images Proposed approach was inspired by Keselj’s work on authorship analysis of natural text produced by humans. [V. Keselj2003] 5/20

 Example Program: P IFV(P) Normalized IFV(P) Feature 1: Instruction Frequency Vector 6 add,push,pop,add,and,jmp,pop,and,mov,jmp,mov,push,jmp,jmp,push, jmp,add,pop,mov,add,mov,push,jmp,mov,mov,jmp,push movpushaddandjmppop movpushaddandjmppop /20

IFV Classification threshold Malicious Benign Suspect program STEPS 1.Given a sample of malicious programs and a sample of benign ones select sets of trainers from each sample. 2.Compute the IFVs of all trainers. 3.Choose a threshold ε. 4. Input : IFV suspect, where ‘suspect’ is a program that is not among the trainers. 5.Count the number of malicious training IFVs within ε of IFV suspect. 6.Count the number of benign training IFVs within ε of IFV suspect. 7. Output: The family that has the highest number of trainers within ε is declared to be that of the suspect program. If there is a tie, pick one at random. 7/20 Distance Measure

Experimental Setup Metamorphic Malware (from vx.netlux.org) W32.Simile (100 samples) Benign Programs (from download.com, sourceforge.net) (100 samples) 8/20 Thanks to Jessica Turner for extracting the original variant of W32.Simile.

Classifying W32.Simile vs. Benigns  RI is the number of instructions considered in the IFV  For RI=4: 0.1 ≤ ε ≤ 0.7, 98 % ≤ accuracy ≤ 100%  For RI=5: 0.1 ≤ ε ≤ 0.7, 96 % ≤ accuracy ≤ 100% Very small signatures (4 and 5 doubles per IFV)  But does not use single signature 9/20

Feature 2: N-gram Frequency Vector  Example Program: P NFV(P) Normalized NFV(P) addpushpushcallcallpoppopcallcallmovmovadd addpushpushcallcallpoppopcallcallmovmovadd add,push,call,pop,call,add,push,call,pop,call,add,mov,add,add,mov,add,add, mov,add,push,call,push,call,call,pop,call,push,mov,add,mov,add,push,call,pop, pop,call, pop,call,pop,call,mov,add,mov,add 10/20

N-Gram Authorship Attribution (Proposed) Malware Detector DSDS DEDE DVDV DNDN FS B FS S FS E FS V FS N STEPS 1.Choose a set of trainers from each of the families. 2.For each family, compute the average of the NFVs of the family’s trainers to create a Family Signature(FS) for each family. 3.Input : NFV suspect, where ‘suspect’ is a program that is not among the trainers. 4.Compute the distance between each of the FS’s and NFV suspect. 5.Output: The suspect program classified as a member of the family with the shortest distance. If there are ties, choose one at random. NFV suspect 11/20 Prediction for = MIN (D B,D S,D E,D V,D N ) DBDB Distance Measure

k-nn Classification STEPS 1.Given a sample of malicious programs and a sample of benign ones select sets of trainers from each sample. 2.Choose k > Input: NFV suspect of suspect program. 4.Find the k closest training NFVs (neighbors of NFV suspect ). 5. Output: The suspect program classified as a member of the family with the most neighbors. If there are ties, choose one at random. NGVCK Evol VCL Simile Benign 12/20 Distance Measure

Experimental Setup Metamorphic Malware (from vx.netlux.org) W32.Simile + W32.Evol (100 samples each) Malware Generation Toolkits (from vx.netlux.org) VCL + NGVCK (100 samples each) Benign Programs (from download.com, sourceforge.net). (100 samples) 13/20 Thanks to Yasmine Kandissounon for collecting the NGVCK and VCL variants.

Ten-fold Cross Validation  Divide each family into a training set of 90 instances and a testing set of 10 instances.  Perform 10-fold cross validation using a new testing set and a new training set each time.  Cross accuracy equals the average accuracy across all the validation accuracies. 14/20

Bigram Selection (Relevant Instructions)  RI is the number of most relevant instructions across the samples used to construct the features.  Best Accuracy 85% for RI =3, RI=4 and RI=9. 15/20

Bigram Selection(Relevant Bigrams)  RB is the number of most relevant bigrams across the samples used to construct the features.  Best accuracy 95% for 17 doubles.  Accuracies of 94.8 % for 6, 8 and 14 doubles. 16/20

Successful Evaluation… A single, small family signature of 17 doubles for each family induced a 95% detection accuracy.  W32.Simile's Engine signature = (0.190, 0.030, 0.155, 0.048, 0.043, 0.057, 0.063, 0.020,0.076, 0.022, 0.0, 0.041, 0.109, 0.0, 0.122, 0.022, 0.0)  W32.Evol's Engine signature = (0.074, 0.026, 0.006, 0.326, 0.208, 0.014, 0.024, 0.073,0.043, 0.048, 0.0, 0.071, 0.042, 0.0, 0.026, 0.019, 0.0)  W32.VCL's Engine signature = (0.111, 0.238, 0.142, 0.027, 0.076, 0.063, 0.063, 0.033,0.009, 0.018, 0.018, 0.054, 0.042, 0.0, 0.040, 0.052, 0.013)  W32.NGVCK's Engine signature = ( 0.132, 0.113, 0.106, 0.048, 0.203, 0.018, 0.055,0.038, 0.022, 0.017, 0.070, 0.122, 0.007, 0.0, 0.007, 0.020, 0.017)  Benign's “Engine signature” = (0.165, 0.173, 0.091, 0.061, 0.052, 0.060, 0.052, 0.046, 0.060,0.028, 0.019, 0.043, 0.024, 0.029, 0.02, 0.031, 0.029) A single, small family signature of 6 doubles for each family induced a 94.8% detection accuracy.  W32.Simile's Engine signature = (0.362, 0.058, 0.295, 0.093, 0.082, )  W32.Evol's Engine signature = (0.113, 0.039, 0.010, 0.497, 0.319, )  W32.VCL's Engine signature = (0.176, 0.358, 0.212, 0.041, 0.115, )  W32.NGVCK's Engine signature = (0.212, 0.182, 0.171, 0.078, 0.327, )  Benign's “Engine signature” = (0.265, 0.279, 0.147, 0.102, 0.098, ) 17/20

MALWARE GENERATING ENGINE Variant 2Variant 1 Variant 3 Variant n MALWARE DETECTOR ES OUT … Re-examining our goals Goal 1: Simplified analysis. Analysis involves only the disassembly and signature verification stages of the program analysis pipeline. Goal 2: One signature per family (Family Signature). Goal 3: Accuracy of 95% using only 17 doubles as a signature. Successful Evaluation cont’d… 18/20

Directions for Future Work  Experiment with other malware instances and families  Address scalability issue?  Experiment with other feature selection methods  Could we do “better” than 95% for a signature of 17 doubles?  Try other classifiers  Other distance measure?  Try byte NFV’s instead of opcode NFV’s  Take into account malware that comes as binary.  Import existing forensic linguistics methods to malware detection 19/20

References Paper documenting this work has already been submitted for possible publication at the Journal in Computer Virology. E. Milgo. A Fast Approximate Detection of Win.32 Simile Malware. Columbus State University Colloquium Series, Feb.’09 and Best Paper Award, 2 nd Place-Masters category: ACM MidSE, ’08. M. R. Chouchane, A. Lakhotia. Using Engine Signature to Detect Metamorphic Malware. WORM, ‘06. M. R. Chouchane. Approximate Detection of Machine-morphed Malware. Ph.D. Dissertation, University of Louisiana at Lafayette, ’08. P. Ször. The Art of Computer Virus Research and Defense J. D. Aycock. Computer Viruses and Malware V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based Author Profiles for Authorship Attribution. PACL, ’03. T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram Based Detection of New Malicious Code. CMPSAC, ‘ /20