Download presentation
Presentation is loading. Please wait.
Published byMarybeth Harrell Modified over 9 years ago
1
Masquerade Detection Mark Stamp 1Masquerade Detection
2
Masquerader --- someone who makes unauthorized use of a computer How to detect a masquerader? Here, we consider… Anomaly-based intrusion detection (IDS) Detection is based on UNIX commands Lots and lots of prior work on this problem We attempt to apply PHMMs For comparison, we also implement other techniques (HMM and N-gram) Masquerade Detection2
3
Schonlau Data Set Schonlau, et al, collected large data set Contains UNIX commands for 50 users 50 files, one for each user Each file has 15k commands, 5k from user plus 10k for masquerade test data Test data: 100 blocks, 100 commands each Dataset includes map file 100 rows (test blocks), 50 columns (users) 0 if block is user data, 1 if masquerade data Masquerade Detection3
4
Schonlau Data Set Map file structure This data set used for many studies Approximately, 50 published papers Masquerade Detection4
5
Previous Work Approaches to masquerade detection Information theoretic Text mining Hidden Markov models (HMM) Naïve Bayes Sequences and bioinformatics Support vector machines (SVM) Other approaches We briefly look at each of these Masquerade Detection5
6
Information Theoretic Original work by Schonlau included a compression technique Based on theory (hope?) that legitimate commands compress more than attack Results were disappointing Some additional recent work Still not competitive with best approaches Masquerade Detection6
7
Text Mining A few papers in this area One approach extracts repetitive sequences from training data Another paper use principal component analysis (PCA) Method of “exploratory data analysis” Good results on Schonlau data set But high cost during training phase Masquerade Detection7
8
Hidden Markov Models Several authors have used HMMs One of the best known approaches We have implemented HMM detector We do sensitivity analysis on the parameters In particular, determine optimal N (number of hidden states) We also use HMMs for comparison with our PHMM results Masquerade Detection8
9
Naïve Bayes In simplest form, relies only on command frequencies That is, no sequence info is used Several papers analyze this approach Among the simplest approaches And, results are good Masquerade Detection9
10
Sequences In a sense, this is the opposite extreme from naïve Bayes Naïve Bayes only considers frequency stats Sequence/bioinformatics focused on sequence-related information Schonlau’s original work included elementary sequence-based analysis Masquerade Detection10
11
Bioinformatics We are aware of only one previous paper that uses bioinformatics approach Use Smith-Waterman algorithm to create local alignments Alignments then used directly for detection In contrast, we do pairwise alignments, MSA, PHMM PHMM is used for scoring (forward algorithm) Our scoring is much more efficient Also, our results are at least as strong Masquerade Detection11
12
Support Vector Machines Support vector machines (SVM) Machine learning technique Separate data points (i.e., classify) based on hyperplanes in high dimensional space Original data mapped to higher dimension, where separation is likely easier SVMs maximize separation And have low computational costs Used for classification and regression analysis Masquerade Detection12
13
SVMs & Masquerade Detection SVMs have been applied to masquerade detection problem Results are good Comparable to naïve Bayes Recent work using SVMs focused on improved efficiency Masquerade Detection13
14
Other Approaches The following have also been studied Detect using low frequency commands Detect using high frequency commands Hybrid Bayes “one step Markov” Natural to consider hybrid approaches Multistep Markov Markov process of order greater than 1 None of these particularly successful Masquerade Detection14
15
Other Approaches (Continued) Non-negative matrix factorization (NMF) At least 2 papers on this topic Appears to be competitive Other hybrids that attempt to combine several approaches So far, no significant improvement over individual techniques Masquerade Detection15
16
HMMs See previous presentation Masquerade Detection16
17
HMM for Masquerade Detection Using the Schonlau data set we… Train HMM for each user Set thresholds Test the models and plot results Note that this has been done before Here, we perform sensitivity analysis That is, we test different number of hidden states, N Also use it for comparison with PHMM Masquerade Detection17
18
HMM Experiments Plotted as “ROC” curves Closer to origin is better Useful region That is, false positives below 5% The shaded region Masquerade Detection18
19
HMM Conclusion Number of hidden states does not matter So, use N=2 Since most efficient Masquerade Detection19
20
PHMM See previous presentation Masquerade Detection20
21
PHMM Experiments A problem with Schonlau data… For given user, 5000 commands No begin/end session markers So, must split it up to obtain multiple sequences But where to split sequence? And what about tradeoff between number of sequences and length of each sequence? That is, how to decide length/number??? Masquerade Detection21
22
PHMM Experiments Experiments done for following cases: See next slide… Masquerade Detection22
23
PHMM Experiments Tests various numbers of sequences Best results 5 sequences, 1k commands each seq. This case in next slide Masquerade Detection23
24
PHMM Comparison Compare PHMM to “weighted N -gram” and HMM HMM is best PHMM is competitive Masquerade Detection24
25
PHMM Detector PHMM at disadvantage on Schonlau data PHMM uses positional information Such info not available for Schonlau data We have to guess the positions for PHMM How to get fairer comparison between HMM and PHMM? We need different data set Only option is simulated data set Masquerade Detection25
26
Simulated Data We generate simulated data as follows Using Schonlau data, construct Markov chain for each user Use resulting Markov chain to generate sequences representing user behavior Restrict “begin” to more common commands What’s the point? Simulated seqs have sensible begin and end Masquerade Detection26
27
Simulated Data Training data and user data for scoring generated using Markov chain Attack data taken from Schonlau data How much data to generate? First test, we generate same amount of simulated data as is in Schonlau set That is, 5k commands per user Masquerade Detection27
28
Detection with Simulated Data PHMM vs HMM Round 2 It’s close, but HMM still wins! Masquerade Detection28
29
Limited Training Data What if less training data is available? In a real application, initially, training data is limited Can’t detect attacks until sufficient training data has been accumulated So, less data required, the better Experiments, using simulated data, limited training date Used 200 to 800 commands for training Masquerade Detection29
30
Limited Training Data PHMM vs HMM Round 3 With 400 or less, PHMM wins big! Masquerade Detection30
31
Conclusion PHMM is competitive with best approaches PHMM likely to do better, given better training data (begin/end info) PHMM much better than HMM when limited training data available Of practical importance Why does it make sense that PHMM would do better with limited training data? Masquerade Detection31
32
Conclusion Given current state of research… Optimal masquerade detection approach Initially, collect small training set Train PHMM and use for detection No attack, then continue to collect data When sufficient data available, train HMM From then on, use HMM for detection Masquerade Detection32
33
Future Work Collect better real data set!!! Many problems/limitations with Schonlau data Improved data set could be basis for lots and lots of research Directly compare PHMM/bioinformatics approaches with previous work (HMM, naïve Bayes, SVM, etc., etc.) Consider hybrid techniques Other techniques? Masquerade Detection33
34
References Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security Masquerading user data, M. Schonlau Masquerading user data Masquerade Detection34
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.