A Malware Similarity Testing Framework

Slides:



Advertisements
Similar presentations
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Advertisements

Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Audio Meets Image Retrieval Techniques Dave Kauchak Department of Computer Science University of California, San Diego
Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
How to Evaluate Foreground Maps ?
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Elementary hypothesis testing
“Use of contour signatures and classification methods to optimize the tool life in metal machining” Enrique Alegrea, Rocío Alaiz-Rodrígueza, Joaquín Barreirob.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Presented by Tienwei Tsai July, 2005
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
Experimental Design If a process is in statistical control but has poor capability it will often be necessary to reduce variability. Experimental design.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice George Forman Martin Scholz Shyam.
Automated Classification and Analysis of Internet Malware M. Bailey J. Oberheide J. Andersen Z. M. Mao F. Jahanian J. Nazario RAID 2007 Presented by Mike.
Chapter 16 Data Analysis: Testing for Associations.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Correlation & Regression Analysis
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Working towards a revised MPD standard (ISO ) a sneak-peek on the current mind set Bo Söderling; LMI Technologies Ltd.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
SEMINAR - SCALABLE, BEHAVIOR-BASED MALWARE CLUSTERING GUIDES : BOJAN KOLOSNJAJI, MOHAMMAD REZA NOROUZIAN, GEORGE WEBSTER PRESENTER RAMAKANT AGRAWAL.
THE BIBLIOMETRIC INDICATORS. BIBLIOMETRIC INDICATORS COMPARING ‘LIKE TO LIKE’ Productivity And Impact Productivity And Impact Normalization Top Performance.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Short Contribution Title Goes Here
Scalable Person Re-identification on Supervised Smoothed Manifold
Step 1: Specify a null hypothesis
Appendix 2 Automated Tools for Systems Development
Optimizing Parallel Algorithms for All Pairs Similarity Search
Chapter 1. Basic Static Techniques
Algorithmic complexity: Speed of algorithms
Intro to Machine Learning
Confidentiality in Published Statistical Tables
Business System Development
Hash-Based Indexes Chapter 11
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Writing a Technical Report
Data Management: Documentation & Metadata
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Objective of This Course
Short Contribution Title Goes Here
How to Really Review Papers
Reasoning in Psychology Using Statistics
Architecture Data Exchange Experiments Military Utility Demonstration
Pose Estimation for non-cooperative Spacecraft Rendevous using CNN
iSRD Spam Review Detection with Imbalanced Data Distributions
Architecture Data Exchange Experiments Military Utility Demonstration
Algorithmic complexity: Speed of algorithms
International Defence Enterprise Architecture Specification (IDEAS)
Product moment correlation
Intro to Machine Learning
Algorithmic complexity: Speed of algorithms
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

A Malware Similarity Testing Framework Variant A Malware Similarity Testing Framework

Purpose Define a Standard Dataset for use in malware variant testing Align malware variant detection to broader binary classification field Test current variant testing tools against proposed solution

Setting Sources are not testing datasets Virus total, AV, Open Source, malware code Derivation of sources are poor testing datasets Testing against poor results AV signatures Lack of breath in code modification

Previous Work BitShred, TLSH, FRASH FRASH, CTPH Variant detection papers BitShred, TLSH, FRASH Derived from AV signatures or code Varied source Not reproducible with accuracy File similarity FRASH, CTPH Not all variation is as complex as malware

Hypothesis A malware static, reproducible dataset that is based on human grouping will provide more critical testing of proposed variant detection engines Benefits Static, reproducible, results based on best known classification What are you setting out to do with your research described here today? Why is this significant?

Findings

Deriving Datasets through Algorithm Selection of sets from malware sources Via antivirus identification Varying source code Use available source code and vary by algorithm, compile What are you setting out to do with your research described here today? Why is this significant?

What are we testing against? Reproducing AV signature results? Reproduction of a flawed system Detecting a few, untested, constructed variance engines? What about real world breath? Can we reproduce the dataset for further testing and comparison? What are you setting out to do with your research described here today? Why is this significant?

Gold Standard Dataset Representative of real world data Knowledge of dataset derived from best available source Tests enable reproduction for peer review and further comparison Samples are real, wild malware Knowledge of dataset is derived from manual analysis Dataset is static and information is known What are you setting out to do with your research described here today? Why is this significant?

Alignment of Broader Field Malware Variant Detection is Binary/Statistical Classification Yet field has disparate measurements and terms Alignment of Nomenclature Apples to Apples comparison against other malware projects Apples to Apples comparison for broader statistical classification projects Removal of ambiguous terms ie accuacy What are you setting out to do with your research described here today? Why is this significant?

Measurements What are you setting out to do with your research described here today? Why is this significant?

Dataset Group1: Ziyang RAT – 12 Samples Group2: LinseningSvr – 19 Samples Group3: BeepService – 20 Samples Group4: SimpleFileMover – 13 Samples Group5: DD Keylogger – 5 Samples Group6: PUP – 10 Samples Group7: Unspecified Backdoor – 3 samples Group8: SvcInstaller – 3 samples What are you setting out to do with your research described here today? Why is this significant?

Dataset (Cont.) Dataset is manually analyzed Best possible information Dataset is static Reproducible results Small Can be grown What are you setting out to do with your research described here today? Why is this significant?

Candidate Solutions triggered, n-gram, raw input, pairwise comparisons CTPH (fuzzy hash, ssdeep, as published) triggered, n-gram, raw input, pairwise comparisons TLSH (as published) selective, n-gram, raw input, LSH comparisons sdhash (as published) full, n-gram, raw input, pairwise comparisons BitShred (re-implemented) Full, n-gram, section input, pairwise comparisons FirstByte (in house) Selective, n-gram, normalized input, LSH comparisons What are you setting out to do with your research described here today? Why is this significant?

Limiting and equal footing in measurements 2x2 options of FirstByte Recursive Disassembly vs. Linear Sweep Disassembly Library Filtering on/off Selection of Linear, noLibs Faster Signature generation Near performance curve of R-noLib What are you setting out to do with your research described here today? Why is this significant?

Limiting and equal footing in measurements What are you setting out to do with your research described here today? Why is this significant?

Limiting and equal footing in measurements What are you setting out to do with your research described here today? Why is this significant?

Limiting and equal footing in measurements TLSH bounding TLSH is a distance measurement, not a similarity Authors argue distance is a better approach Change 4 other projects, or TLSH Authors state distance of 300 is very dissimilar Sim = (300 – Distance)/3 Bound Sim < 0 to 0 What are you setting out to do with your research described here today? Why is this significant?

Fmeasure over Threshold What are you setting out to do with your research described here today? Why is this significant?

ROC Curve at Peak Threshold What are you setting out to do with your research described here today? Why is this significant?

Peak Recall and Precision What are you setting out to do with your research described here today? Why is this significant?

Signature Generation Performance What are you setting out to do with your research described here today? Why is this significant?

Comparison Performance What are you setting out to do with your research described here today? Why is this significant?

Conclusion - Inconsistencies A common dataset used in testing reveals inconsistencies in other tests Most easily attributed to dataset generation techniques What are you setting out to do with your research described here today? Why is this significant?

Conclusion – Reproducibility and Alignment Dataset can be reproduced exactly Dataset represents a Gold Standard approach Best known (human) information vs. tested project Measurements are aligned with the greater binary classification field Recall, Precision, Fmeasure What are you setting out to do with your research described here today? Why is this significant?

Conclusion – Comparative Results Topped ranked overall speed TLSH 2.5X signature generation (slower) than sdhash n log n all-pairs easily overcomes slower signature generation Topped ranked precision and recall FirstByte 42% better than Bitshred in Fmeasure 95% better than TLSH in Fmeasure 365K signatures/node/day limit Topped ranked if signature generation is a concern (365K/node/day) BitShred if n2 not a concern TLSH if n2 is a concern What are you setting out to do with your research described here today? Why is this significant?

Impact: Gold Standard A static, reproducible dataset based on human classification is superior in accessing malware variant detection techniques in that it is more representative of the “best known” classification requirement of a “Gold Standard Dataset”

Broader Contributions Testing for malware variant binary classification to date is suspect Inaccuracies What can others take away from your research and build on? What broader questions have you answered? What new questions have you enabled?

Summary and Conclusions Variant reveals much lower recall and precision scores between projects Variant is reproducible Variant represents an evaluation of the ability to reproduce human results. Variant aligns field to binary classification Variant tests multiple tools under the same conditions and can be used in future tests Years ago a professor told me how he had structured his presentations using the conclusions up front, then the “story”, followed again by the conclusions. It makes it a lot easier to follow. A presentation is not an anecdote and the conclusions aren’t a punch line, no impact is lost if folks know what is coming and where you are building to.

Remaining Questions Future Work Growing the dataset Open contributions Retesting of proposed works What are you unable to answer at this time? What new questions came up? What data do you need? What kind of collaborations are you looking for?

Jason Upchurch Jason.R.Upchurch@Intel.com