One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Brief introduction on Logistic Regression
Evaluation of Decision Forests on Text Categorization
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Transductive Reliability Estimation for Kernel Based Classifiers 1 Department of Computer Science, University of Ioannina, Greece 2 Faculty of Computer.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
What is Statistical Modeling
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Assuming normally distributed data! Naïve Bayes Classifier.
Classification and risk prediction
Decision Theory Naïve Bayes ROC Curves
ROC Curves.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Scalable Text Mining with Sparse Generative Models
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Masquerade Detection Mark Stamp 1Masquerade Detection.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
KNN & Naïve Bayes Hongning Wang
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 15: Text Classification & Naive Bayes
An Enhanced Support Vector Machine Model for Intrusion Detection
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Support Vector Machines
Multivariate Methods Berlin Chen
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab

Masquerade Attack One user impersonates another Access control and authentication cannot detect it (legitimate credentials are presented) Can be the most serious form of computer abuse Common solution is detecting significant departures from normal user behavior

Schonlau Dataset 15,000 truncated UNIX commands for each user, 70 users 100 commands as one block Each block is treated as a “document” Randomly chose 50 users as victim Each user’s first 5,000 commands are clean, the rest have randomly inserted dirty blocks from the other 20 users

Previous work Use two-class classifier: self & non-self profiles for each user First 5,000 as self examples, and the first 5,000 commands of all other 49 users as masquerade examples Examples: Naïve Bayes [Maxion], 1-step Markov, Sequence Matching [Schonlau]

Why two class? It’s reasonable to assume the negative examples (user/self) to be consistent in a certain way, but positive examples (masquerader data) are different since they can belong to any user. Since a true masquerader training data is unavailable, other users stand in their shoes.

Benefits of one-class approach Practical Advantages: Much less data collection Decentralized management Independent training Faster training and testing No need to define a masquerader, but instead detect “impersonators”.

One-class algorithms One-class Naïve Bayes (eg., Maxion) One-class SVM

Naïve Bayes Classifier Bayes Rule Assume each word is independent (the Naïve part) Compute the parameter during training, choose the class of higher probability during testing.

Multi-variate Bernoulli model Each block is N-dimensional binary feature vector. N is the number of unique commands each assigned an index in the vector. Each feature set to 1 if command occurs in the block, 0 otherwise. Each 1 dimension is a Bernoulli, the whole vector is multivariate Bernoulli.

Multinomial model (Bag-of-words) Each block is N-dimensional feature vector, as before. Each feature is the number of times the command occurs in the block. Each block is a vector of multinomial counts.

Model comparison (McCallum & Nigam ’98)

One-class Naïve Bayes Assume each command has equal probability for a masquerader. Can only adjust the threshold of the probability to be user/self, i.e. ratio of the estimated probability to the uniform distribution. Don’t need any information about masquerader at all.

SVM (Support Vector Machine)

One-class SVM Map data into feature space using kernel. Find hyperplane S separating the positive data from the origin (negative) with maximum margin. The probability that a positive test data lies outside of S is bounded by a prior v. Relaxation parameters allow some outliers.

One-class SVM

Experimental setting (revisited) 50 users. Each user’s first 5,000 commands are clean, the rest 10,000 have randomly inserted dirty blocks from other 20 users. First 5,000 as positive examples, and the first 5,000 commands of all other 49 users as negative examples.

Bernoulli vs. Multinomial

One-class vs. two-class result

ocSVM binary vs. previous best- outcome results

Compare different classifiers for multiple users Same classifiers have different performance for different users. (ocSVM binary)

Problem with the dataset Each user has a different number of masquerade blocks. The origins of the masquerade blocks also differ. So this experiment may not illustrate the real performance of the classifier.

Alternative data configuration 1v49 Only first 5,000 commands as user/self’s examples for training. All other 49 users’ first 5,000 commands as masquerade data, against those clean data of self’s rest 10,000 commands. Each user has almost the same masquerade block to detect. Better method to compare the classifiers.

ROC Score ROC score is the fraction of the area under the ROC curve, the larger the better. A ROC score of 1 means perfect detection without any false positives.

ROC Score

Comparison using ROC score

ROC-P Score: false positive<=p%

ROC-5: fp<=5%

ROC-1: fp<=1%

Conclusion One-class training can achieve similar performance as multiple class methods. One-class training has practical benefits. One-class SVM using binary feature is better, especially when the false positive rate is low.

Future work Include command argument as features Feature selection? Real-time detection Combining user commands with file access, system call