1 Limits of Learning-based Signature Generation with Adversaries Shobha Venkataraman, Carnegie Mellon University Avrim Blum, Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Boosting Approach to ML
Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome *, Brad Karp *†, and Dawn Song * † Intel Research Pittsburgh * Carnegie.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao and Brian.
Protomatching Network Traffic for High Throughput Network Intrusion Detection Shai RubinSomesh JhaBarton P. Miller Microsoft Security Analysis Services.
Active Learning of Binary Classifiers
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Finding Standard Error. Standard error The standard error is the standard deviation of a sampling distribution – we have three types of standard errors:
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
ROC Curves.
Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage Manan Sanghi.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Experimental Evaluation
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
CMSC 414 Computer and Network Security Lecture 3 Jonathan Katz.
Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.
Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.
Automated malware classification based on network behavior
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* Joint work with: Dawn Song*, Phillip Gibbons ¶,
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.
FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin Poosankam, Min Gyung Kang, Dawn Song & Avrim Blum Carnegie.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome, Brad Karp, and Dawn Song Carnegie Mellon University Presented by Ryan.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Polygraph: Automatically Generating Signatures for Polymorphic Worms Presented by: Devendra Salvi Paper by : James Newsome, Brad Karp, Dawn Song.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Automated Worm Fingerprinting Authors: Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Publish: OSDI'04. Presenter: YanYan Wang.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.
Polygraph: Automatically Generating Signatures for Polymorphic Worms Authors: James Newsome (CMU), Brad Karp (Intel Research), Dawn Song (CMU) Presenter:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Correlation Clustering
Chapter 3: Maximum-Likelihood Parameter Estimation
Dan Roth Department of Computer and Information Science
POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Polygraph: Automatically Generating Signatures for Polymorphic Worms
A Unifying View on Instance Selection
Computational Learning Theory
Computational Learning Theory
Machine Learning: UNIT-3 CHAPTER-2
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

1 Limits of Learning-based Signature Generation with Adversaries Shobha Venkataraman, Carnegie Mellon University Avrim Blum, Carnegie Mellon University Dawn Song, University of California, Berkeley

2 Signatures Signature: function that acts as a classifier  Input: byte string  Output: Is byte string malicious or benign? e.g., signature for Lion worm: “\xFF\xBF” && “\x00\x00\FA” “aaaa” “bbbb”  If both present in byte string, MALICIOUS  If either one not present, BENIGN This talk: focus on signatures that are sets of byte patterns  i.e., signature is conjunction of byte patterns  Our results for conjunctions imply results for more complex functions, e.g. regexp of byte patterns

3 Automatic Signature Generation Generating signatures automatically is important:  Signatures need to be generated quickly  Manual analysis slow and error-prone Pattern-extraction techniques for generating signatures Signature Generator Malicious Strings Normal Strings e.g., ‘aaaa’ && ‘bbbb’ Signature for usage Training Pool

4 History of Pattern-Extraction Techniques Our Work: Lower bounds on how quickly ALL such algorithms converge to signature in presence of adversaries Earlybird, Autograph, Honeycomb [SEVS] [KK] [KC] Polygraph [NKS] Hamsa [LSCCK] Anagram [WPS] … Signature Generation Systems Polymorphic worms Malicious Noise Injection [PDLFS] Paragraph [NKS] Allergy attacks [CM] … Evasion Techniques

5 Learning-based Signature Generation Signature generator’s goal: Learn as quickly as possible Signature Generator Malicious Normal Signature Test Pool Training Pool Adversary’s goal: Force as many errors as possible

6 Our Contributions Formalize a framework for analyzing performance of pattern- extraction algorithms under adversarial evasion  Show fundamental limits on accuracy of pattern-extraction algorithms with adversarial evasion Generalize earlier work (e.g., [FDLFS],[NKS,[CM]]) focused on individual systems  Analyze when fundamental limits are weakened Kind of exploits for which pattern-extraction algorithms may work  Applies to other learning-based algorithms using similar adversarial information (e.g., COVERS [LS] )

7 Outline Introduction Formalizing Adversarial Evasion Learning Framework Results Conclusions

8 Signature Generator ‘aaaa’ && ‘bbbb’ ‘aaaa’ && ‘dddd’ ‘cccc’ && ‘bbbb’ ‘cccc’ && ‘dddd’ Strategy for Adversarial Evasion Increase resemblance between tokens in true signature and spurious tokens e.g. can add infrequent tokens (i.e, red herrings [NKS] ), change token distributions (i.e., pool poisoning [NKS] ), mislabel samples (i.e, noise-injection [PDLFS] ) Could generate high false positives or high false negatives Malicious Normal Signature ‘aaaa’ && ‘bbbb’ Spurious Patterns True Signature

9 Definition: Reflecting Set Reflecting Sets: Sets of Resembling Tokens  Critical token: token in true signature S. e.g., ‘aaaa’, ‘bbbb’  Reflecting set of a critical token i for a signature generator: All tokens as likely to be in S as critical token i, for current signature-generator e.g., Reflecting set for ‘aaaa’: ‘aaaa’, ‘cccc’ ‘aaaa’ && ‘bbbb’ ‘aaaa’ ‘cccc’ ‘bbbb’ ‘dddd’ Reflecting set of ‘bbbb’ S: True Signature T: Set of Potential Signatures ‘aaaa’ && ‘bbbb’ ‘aaaa’ && ‘dddd’ ‘cccc’ && ‘bbbb’ ‘cccc’ && ‘dddd’ Reflecting set of ‘aaaa’

10 Reflecting Sets and Algorithms By definition of reflecting set, to signature-generation algorithm, true signature appears to be drawn at random from R 1 x R 2 Signature Generator 1 ‘aaaa’ ‘cccc’ ‘bbbb’ ‘dddd’ e.g. fine-grained All tokens such that individual tokens and pairs of tokens infrequent e.g., coarse-grained All tokens infrequent in normal traffic, say, first- order statistics ‘aaaa’ ‘cccc’ ‘eeee’ ‘gggg’ ‘bbbb’ ‘dddd’ ‘ffff’ ‘hhhh’ Signature Generator 2 Specific to the family of algorithms under consideration R2R2 R1R1 R2R2 R1R1

11 Learning-based Signature Generation Problem: Learning a signature when a malicious adversary constructs reflecting sets for each critical token Lower bounds depend on size of reflecting set:  power of adversary,  nature of exploit,  algorithms used for signature generation Signature Generator Malicious Normal ‘aaaa’ ‘cccc’ ‘bbbb’ ‘dddd’

12 Outline Introduction Formalizing Adversarial Evasion Learning Framework Results Conclusions

13 Framework: Online Learning Model Signature generator’s goal: Learn as quickly as possible Optimal to update with new information in test pool Signature Generator Malicious Normal Signature Test Pool Feedback Training Pool Adversary’s goal: Force as many errors as possible Optimal to present only one new sample before each update Equivalent to the mistake-bound model of online learning [LW]

14 Learning Framework: Problem Signature Generator (after initial training) 1. Byte string 3. Correct Label 2. Predicted Label Mistake-bound model of learning Notation:  n: number of critical tokens  r: size of reflecting set for each critical token Assumption: true signature is a conjunction of tokens  Set of all potential signatures: r n Goal: find true signature from r n potential signatures minimize mistakes in prediction while learning true signature

15 Learning Framework: Assumptions No Assumptions on Learning Algorithms Used  Algorithm can learn any function for signature Not necessary to learn only conjunctions Adversary Knowledge  Knows system/algorithms/features used to generate signature  Does not necessarily know how system/algorithm is tuned No Mislabeled Samples  No incorrect labeling, due to noise or malicious errors e.g., use host-monitoring techniques [NS] to achieve this

16 Outline Introduction Formalizing Adversarial Evasion Learning Framework Results:  General Adversarial Model  Can General Bounds be Improved? Conclusions

17 Deterministic Algorithms Theorem: For any deterministic algorithm, there exists a sequence of samples such that the algorithm is forced to make at least n log r mistakes. Practical Implication: For arbitrary exploits, any pattern-extraction algorithm can be forced into making a number of mistakes :  even if extremely sophisticated pattern-extraction algorithms are used  even if all labels are accurate, e.g., if TaintCheck [NS] is used Additionally, there exists an algorithm (Winnow) that can achieve a mistake-bound of n(log r + log n)

18 Randomized Algorithms Theorem: For any randomized algorithm, there exists a sequence of samples such that the algorithm is forced to make at least ½ n log r mistakes in expectation. Practical Implication: For arbitrary exploits, any pattern-extraction algorithm can be forced into making a number of mistakes:  even if extremely sophisticated pattern-extraction algorithms are used  even if all labels are accurate (e.g., if TaintCheck [NS] is used)  even if the algorithm is randomized

19 One-Sided Error: False Positives Theorem: Let t < n. Any algorithm forced to have fewer than t false positives can be forced to make at least (n – t) (r – 1) mistakes on malicious samples. Practical Implication: Algorithms that are allowed to have few false positives make significantly many more mistakes than the general algorithms e.g., at t = 0, bounded false positives: n(r – 1) general case: n log r

20 One-Sided Error: False Negatives Theorem: Let t < n. Any algorithm forced to have fewer than t false negatives can be forced to make at least r n/(t+1) _ 1 mistakes on non-malicious samples. Practical Implication : Algorithms allowed to have bounded false negatives have far worse bounds than general algorithms e.g., at t = 0, bounded false negatives: r n - 1 general algorithms: n log r

21 Different Bounds for False Positives & Negatives! Bounded false positives: Ω ((r(n-t))  learning from positive data only No mistakes allowed on negatives Adversary forces mistakes with positives Bounded false negatives: Ω ( r n/t+1 )  learning from negative data only No mistakes allowed on positives Adversary forces mistakes with negatives Much more “information” about signature in a malicious sample e.g. Learning: What is a flower? Positive data only Negative data only

22 Outline Introduction Formalizing Adversarial Evasion Learning Framework Results:  General Adversarial Model  Can General Bounds be Improved? Conclusions

23 Can General Bounds be Improved? Consider Relaxed Problem:  Requirement: Classify correctly only Malicious packets Non-malicious packets regularly present in normal traffic  Classification does NOT have to match true signature on rest Characterize “gap” between malicious & normal traffic  Overlap-ratio d: Of tokens in true signature, fraction that appear together in normal traffic. e.g., signature has 10 tokens, but only 5 appear together in normal traffic: d = 0.5  Bounds are a function of overlap-ratio

24 Lower bounds with Gaps in Traffic Theorem: Let d < 1. For a class of functions called linear separators, any deterministic algorithms can be forced to make log 1/d r mistakes, and any randomized algorithm can be forced to make in expectation, ½ log 1/2d r mistakes. Practical Implication : Pattern-extraction algorithms may work for exploits if:  signatures overlap very little with normal traffic  algorithm is given few (or no) mislabeled samples

25 Related Work Learning-based signature-generation algorithms: Honeycomb[KC03], Earlybird [SEVS04], Autograph[KK04], Polygraph[NKS05], COVERS[LS06], Hamsa[LSCCK06], Anagram[WPS06] Evasions: [PDLFS06], [NKS06],[CM07],[GBV07] Adversarial Learning:  Closely Related: [Angluin88],[Littlestone88]  Others: [A97][ML93],[LM05],[BEK97],[DDMSV04]

26 Conclusions Formalize a framework for analyzing performance of pattern-extraction algorithms under adversarial evasion  Show fundamental limits on accuracy of pattern-extraction algorithms with adversarial evasion Generalize earlier work focusing on individual systems  Analyze when fundamental limits are weakened Kind of exploits for which pattern-extraction algorithms may work

27 Thank you!