Download presentation
Presentation is loading. Please wait.
Published byΧλόη Δημητρακόπουλος Modified over 5 years ago
1
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
2
Where Are We Now Introduction and overview Preliminaries
Emerging patterns: definitions and mining Using emerging patterns as features and regression terms Classification using emerging patterns Clustering and clustering evaluation using emerging patterns Outlier and intrusion detection using emerging patterns Ranking attributes for problems with complex multi-attribute interactions using emerging patterns Pattern aided regression and classification Interesting applications of emerging patterns
3
Emerging Pattern Based Outlier Detection and Intrusion Detection
Preliminaries: Outlier detection, intrusion detection OCLEP: One Class Classification using Length of Emerging Patterns OCLEP+: Extension of OCLEP Why EPs are useful for outlier detection EP length indicates degree of deviation from normal Strength of the method One class training Method is not model based No need for distance function Patterns reveal properties of possible outliers Guozhu Dong 2019
4
What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase; Michael Jordon as a basketball player Definition of “deviates” is not unique and different definitions lead to different kinds of outliers Applications: Credit card fraud detection Telecom fraud detection Intrusion detection Masquerader detection Medical analysis
5
Intrusion Detection and Masquerader Detection
Intrusion Detection: Detecting unauthorized entry into a protected system. Network intrusion detection Host intrusion detection Masquerader Detection: Detecting unauthorized activities inside a protected system. Both are examples of outlier detection. Both can be handled by one class classification. Guozhu Dong 2019
6
Using Pattern Lengths in Outlier Detection
Emerging patterns represent group differences. [Chen+Dong 2006] proposed an outlier detection method based on the following observations. Patterns describing differences between similar objects (e.g. coming from the same class) are long Patterns describing differences between highly different objects (e.g. coming from different classes) are short Length of a pattern: the number of items (attributes) it contains Guozhu Dong 2019
7
Using Emerging Pattern Lengths in Masquerader Detection
[Chen+Dong 2006] proposed OCLEP, using the length of emerging patterns for masquerader detection. The method is based on this observation: Given an instance t and a set T of normal instances, let BorderDiff(t,T) be the set of minimal jumping emerging patterns that occur in t but not in T. If t is a normal instance, then the patterns in BorderDiff (t,T) are typically long. If t is not normal, then the patterns in BorderDiff (t,T) are typically short. Guozhu Dong 2019
8
Challenges But we only have one class of data – some normal instances.
We need a way to decide if an instance s is outlier or normal just by the length of some kinds of emerging patterns for s We also need to design repeated experiments to estimate a length threshold for dividing instances into outlier vs normal. Guozhu Dong 2019
9
A Solution: Estimating Length of Emerging Patterns for One Instance
Training data: a dataset N of normal instances Problem: Given an instance t, how to estimate the length of emerging patterns for t vs N? Solution: Perform BorderDiff(t,T) multiple times Each T is a small sample of tuples from N. For each T, define avgLength = average length of the jumping emerging patterns in BorderDiff(t,T). Guozhu Dong 2019
10
The OCELP Algorithm Training data: a dataset N of normal instances
Preprocessing: transform N into feature vectors if needed. Training: Randomly pick m pairs (ti, Ti): ti is a tuple in N, Ti is a small subset of N-{ti} For each i, compute average length, ai, of minimal jumping emerging patterns in BorderDiff(ti,Ti). Let a be the minimum, and b be the maximum, of a1, a2, …, am. Use N to pick a value c (a <= c <= b) as threshold for prediction, maximizing an objective function on TP and FP. Example: m=200 or 500 Guozhu Dong 2019
11
The OCELP Algorithm Testing:
Given t, (*) randomly pick 20 small sets Ti from N For each i, compute average length, bi, of minimal jumping emerging patterns in BorderDiff(t,Ti). Let avgL be the average of b1, …, b20. If avgL > c then t is deemed to be normal, otherwise, t is deemed to be outlier. (*): This helps increase the robustness of OCLEP Guozhu Dong 2019
12
Experiment Data used for OCLEP
Data: User-command sequences in a computer encvironment [DuMouchel et al 2001] Collected sequences of “truncated” commands of 50 users Each user is represented by a sequence of 15,000 commands. The first 5,000 commands of each user are “clean data” (i.e. legitimately issued by the user) The last 10,000 commands were probabilistically injected with commands issued by 20 users outside the community of the 50. Goal: decide which blocks are “self” and which are “masquerader” (not done by self) A block contains 100 commands Guozhu Dong 2019
13
Experiment Setting 1v49’: This is a one-class training experiment setting. Only the first 5,000 commands of one person are used as training data for a “self” The first 5,000 commands of the other 49 users (considered as masqueraders), together with the remaining 10,000 commands of the self, are used as testing data. Several other setting were examined but not discussed here Guozhu Dong 2019
14
Features for Converting Sequences into Vectors
We studied these six feature types: (1) Binary. A command is a feature. There are ~870 distinct commands. A command block is represented as a binary vector of 870 binary values of 0 (absence), 1 (presence) (2) Frequency with equal-length binning. The frequency of each command in command blocks is transformed into binary format using equal length binning (3) frequency with equal-density binning. (4) Pair. Each adjacent command pair is considered as a feature. There are a maximum of 99 features in a block. (5) Skip-one-pair. A pair of commands is a feature, if they are separated by exactly one command. (6) Triple. Each adjacent command triple is a feature. Can treat command blocks (100 commands) as sets, sequences, or bags. c1c2c3 is a subsequence in a block (4) c1c2 and c2c3 are features (5) c1c3 is a feature (6) c1c2c3 is a feature We report results on “binary” below Guozhu Dong 2019
15
Experiment Results OCLEP’s performance is comparable with ocSVM
Binary features are used Guozhu Dong 2019
16
OCELP+ [Dong+Pentukar 2017] introduced OCLEP+, with several new techniques: It uses minimal length instead of avg length of JEPs It also introduces a fixed percentile as starting cutoff threshold Guozhu Dong 2019
17
OCLEP+: Training It's chosen such that if the size of N is small, l is j(N)1j. But if N is considerably large, then l can be any number like 300, 500 etc. For the NSL-KDD dataset we choose l to be 400 r=7 by default, ell is 400 If N is small, then ell is |N|-1. If N is large, then ell can be any number like 300, 500. For the NSL-KDD dataset we choose ell to be 400
18
OCLEP+: Testing r=7 by default Guozhu Dong 2019
19
Experiment Data for OCLEP+
NSL-KDD: an improved version of KDDCUP'99 dataset. KDDCUP'99 includes a wide variety of intrusions simulated in a military network environment. Widely used in evaluating intrusion detection algorithms but it suffers from many disadvantages such as having many redundant records because of which the results were more biased towards the algorithms that are based on frequency of records. NSL-KDD addressed KDDCUP’s disadvantages by removing duplicate records. Moreover, records were selected such that the percentage of records is inversely proportional to their difficulty level thus promoting more accurate evaluation of different learning techniques. Guozhu Dong 2019
20
Performance of OCLEP+ NSL-KDD dataset has a KDDTrain+_20Percent file, containing both anomaly & normal instances with 41 features. We did one class training: removed abnormal instances from the file for training. We used the KDDTest+ file (22544) instances to evaluate OCLEP+. Guozhu Dong 2019
21
OCLEP+ Experiment Settings
NSL-KDD dataset provides a KDDTrain+_20Percent file that contains both anomaly as well as normal instances with 41 features. We did one class training; so we removed the abnormal instances from the file and trained our classifiers with the remaining normal instances. In Experiment B, we used the KDDTest+ file which contains instances to evaluate OCLEP+. Guozhu Dong 2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.