Vipin Kumar, AHPCRC, University of Minnesota

Vipin Kumar, AHPCRC, University of Minnesota
Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set Vipin Kumar, AHPCRC, University of Minnesota Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava

Introduction Key challenge Misuse Detection Anomaly Detection
Maintain high detection rate while keeping low false alarm rate Misuse Detection Two phase learning – PNrule Classification based on Associations (CBA) approach Anomaly Detection Unsupervised (e.g. clustering) and supervised methods to detect novel attacks

DARPA 1998 - KDDCup’99 Data Set
Modification of DARPA 1998 data set prepared and managed by MIT Lincoln Lab DARPA 1998 data includes a wide variety of intrusions simulated in a military network environment 9 weeks of raw TCP dump data simulating a typical U.S. Air Force LAN 7 weeks for training (5 million connection records) 2 weeks for training (2 million connection records)

KDDCup’99 Data Set Connections are labeled as normal or attacks
Attacks fall into 4 main categories (38 attack types) - DOS - Denial Of Service Probe - e.g. port scanning U2R - unauthorized access to root privileges, R2L - unauthorized remote login to machine, U2R and R2L extremely small classes 3 groups of features Basic, content based, time based features (details)

KDDCup’99 Data Set Training set - ~ 5 million connections
10% training set - 494,021 connections Test set ,029 connections Test data has attack types that are not present in the training data => Problem is more realistic Train set contains 22 attack types Test data contains additional 17 new attack types that belong to one of four main categories

Performance of Winning Strategy
Cost-sensitive bagged boosting (B. Pfahringer)

Simple RIPPER classification
RIPPER trained on 10% of data (494,021 connections) Test on entire test set (311,029 connections)

Simple RIPPER on modified data
Remove duplicates and merge new train and test data sets Sample 69,980 examples from the merged data set Sample from neptune and normal subclass. Other subclasses remain intact. Divide in equal proportion to training and test sets Apply RIPPER algorithm on the new data set

Building Predictive Models in NID
Models should handle skewed class distributions Accuracy is not sufficient metric for evaluation Focus on both recall and precision Recall (R) = TP/(TP + FN) Precision (P) = TP/(TP + FP) F – measure = 2*R*P/(R+P) rare class – C large class – NC

Predictive Models for Rare Classes
Over-sampling the small class [Ling, Li, KDD 1998] Down-sizing the large class [Kubat, ICML 1997] Internally bias discrimination process to compen-sate for class imbalance [Fawcett, DMKDD 1997] PNrule and related work [Joshi, Agarwal, Kumar, SIAM, SIGMOD 2001] RIPPER with stratification SMOTE algorithm [Chawla, JAIR 2002] RareBoost [Joshi, Agarwal, Kumar, ICDM 2001]

PNrule Learning P-phase: N-phase:
cover most of the positive examples with high support seek good recall N-phase: remove FP from examples covered in P-phase N-rules give high accuracy and significant support C C NC NC Existing techniques can possibly learn erroneous small signatures for absence of C PNrule can learn strong signatures for presence of NC in N-phase

RIPPER vs. PNrule Classification
Model Attack Recall (%) Precision (%) F-value RIPPER U2R 17.1 6.7 9.6 R2L 13.9 84.9 23.9 Probe 77.8 64.7 70.7 PN rule 18.4 56.8 27.8 14.1 72.8 23.7 83.8 69.2 75.9 5% sample from normal, smurf (DOS), neptune (DOS) from 10% of training data (494,021 connections) Test on entire test set (311,029 connections)

Classification Based on Associations (CBA)
What are Association patterns? Frequent itemset: captures the set of “items” that co-occur together frequently in a transaction database. Association Rule: predicts the occurrence of a set of items in a transaction given the presence of other items. Association Rule: X , a s Þ y Support: Confidence: Example:

Classification Based on Associations (CBA)
Previous work: Use association patterns to improve the overall performance of traditional classifiers. Integrating Classification and Association Rule Mining [Liu, Li, KDD 1998] CMAR: Accurate Classification Based on Multiple Class-Association Rules [Han, ICDM 2001] Associations in Network Intrusion Detection Use classification based on associations for anomaly detection and misuse detection [Lee, Stolfo, Mok 1999] Look for abnormal associations [Barbara, Wu, Jajodia, 2001]

Frequent Itemset Generation
Methodology F1: {A, B,C} => dos F2: {B,D} => dos … DOS Overall data set Feature Selection F1: {A, C, D} => u2r F2: {E,F,H} => u2r … U2R F1: {C,K,L} => r2l F2: {F,G,H} => r2l … R2L F1: {B,F} => probe F2: {B,C,H}=> probe … probe normal F1: {A, B} => normal F2: {E,G} => normal … Feed to classifier Stratification Frequent Itemset Generation

Methodology Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. This may work well only if each class is well-represented in the data set. For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low. Our approach: Apply frequent itemset generation algorithm to each class. Select itemsets to be added as features based on precision, recall and F-Measure. Apply classification algorithm, i.e., RIPPER, to the new data set.

Experimental Results (on modified data)
Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules

Experimental Results (on modified data)
Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules For rare classes, rules ordered according to F-Measure produce the best results.

CBA Summary Association rules can improve the overall performance of classifiers Measure used to select rules for feature addition can affect the performance of classifiers The proposed F-measure rule selection approach leads to better overall performance

Anomaly Detection – Related Work
Detect novel intrusions using pseudo-Bayesian estimators to estimate prior and posterior probabilities of new attacks [Barbara, Wu, SIAM 2001] Generate artificial anomalies (intrusions) and then use RIPPER to learn intrusions [Fan et al, ICDM 2001] Detect intrusions by computing changes in esti-mated probability distributions [Eskin, ICML 2000] Clustering based approaches [Portnoy et al, 2001]

SNN Clustering on KDD Cup 99’ data
SNN clustering suited for finding clusters of varying sizes, shapes, densities in the presence of noise Dataset 10,000 examples were sampled from neptune, smurf and normal both from training and test Other sub-classes remain intact Total number of instances : 97,000 Applied shared nearest neighbors based clustering and k-means clustering

Clustering Results SNN clusters of pure new attack types are found
Cluster name Size Same category Wrong category apache2 (dos) 211 183 4 mscan (probe) 142 118 xterm + ps (u2r) 117 57 24 (r2l), 36 (normal) snmpgetattack (r2l) 69 34 (normal) 131 104 processtable (dos) 146 87 1 1 (dos), 3 (r2l)

Clustering Results K-means performance SNN clustering performance
All k-means clusters Tightest k-means clusters

Nearest Neighbor (NN) based Outlier Detection
For each point in the training set, calculate the distance to the closest point Build a histogram Choose a threshold such that a small percentage (e.g., 2%) of the training set are classified as outliers

Anomaly Detection using NN Scheme
attack

Novel Attack Detection Using NN Scheme
Normal Correct Attack Group Incorrect Attack Group Anomaly Total 12040 176 173 12389 Known Attacks 1119 7581 225 1814 10739 781 347 139 2755 4022 Detection Rate for Novel Attacks = 68.50% False Positive Rate for Normal connections = 2.82%

Novel Attack Detection Using NN Scheme
novel attacks details details

Conclusions Predictive models specifically designed for rare class can help in improving the detection of small attack types SNN clustering based approach shows promise in identifying novel attack types Simple nearest neighbor based approaches appear capable of detecting anomalies

KDDCup’99 Data Set KDDCup’99 contains derived high-level features
3 groups of features basic features of individual TCP connections (duration, protocol type, service, src & dest bytes, …) content features within a connection suggested by domain knowledge (e.g. # of failed login attempts) time-based traffic features of the connection records ''same host'' features examine only the connections that have the same destination host as the current connection ''same service'' features examine only the connections that have the same service as the current connection back

1-NN on Anomalies back

1-NN on Known Attacks back

Vipin Kumar, AHPCRC, University of Minnesota

Similar presentations

Presentation on theme: "Vipin Kumar, AHPCRC, University of Minnesota"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vipin Kumar, AHPCRC, University of Minnesota

Similar presentations

Presentation on theme: "Vipin Kumar, AHPCRC, University of Minnesota"— Presentation transcript:

Similar presentations

About project

Feedback