CS562 – Advanced Topics in Databases

CS562 – Advanced Topics in Databases
Nikos Kardoulakis

“Distance-based Outlier Detection in Data Streams”
Authors: Luan Tran, Computer Science Dept. Univ. of Southern California Liyue Fan, Integrated Media Systems Center Univ. of Southern California Cyrus Shahabi, Integrated Media Systems Center Univ. of Southern California

Contents Introduction Outlier Detection Applications Motivation
Preliminaries Streaming Outliers Detection: Problem Statement Related Work DODDS Algorithms Experimental Evaluation Conclusions Assessment

Outlier Detection Applications
An important task in many domains like: Fraud detection Network security Medical and public health etc. Data objects arrive in a streaming manner → new challenges arise in: space efficiency (store data). time efficiency (process data). Outlier: A data object is considered an outlier if it does not conform to the expected behavior, which corresponds to either noise or anomaly.

Motivation Distance-based Outlier Detection in Data Streams (DODDS)
adopts an unsupervised definition No distributional assumptions on data values. data streams are processed in a sliding window, i.e. a set of active data points. Many algorithms have been proposed using different approaches Exact and approximate Lack of a comparative evaluation of the algorithms to compare their performances Need for benchmarking Compare the algorithms using the same platform and datasets. Vary stream parameters to examine their behavior. Gain insights to CPU Time for processing each window including: the new slide. the expired slide. outlier detection. Peak Memory for each window which includes: the data storage the algorithm-specific structures. Effectiveness

Neighbor Given a distance threshold R (R > 0), a data point o is a neighbor of data point o′ if the distance between o and o′ is not greater than R. A data point is not considered a neighbor of itself.

Distance Based Outlier
Given a dataset D, a count threshold k (k > 0) and a distance threshold R (R > 0), a distance-based outlier in D is a data point that has less than k neighbors in D. A data point that has at least k neighbors is called an inlier. 2 outliers with k=4: o1 with 3 neighbors o2 with 1 neighbor

Data Stream A data stream is a possible infinite series of data points ..., on−2, on−1, on, ..., where data point on is received at time on.t.

Count-Based VS Time-Based Windows
Given data point on and a time period T, the time-based window D(n, T) is the set of Wn data points: on′ , on′+1, ..., on with Wn = n − n ′ + 1 and on.t − on′ .t = T. Given data point on and a fixed window size W, the count-based window Dn is the set of W data points: on−W+1, on−W+2, ..., on.

Window slide Denotes the slide size which characterizes the speed of the data stream When new data points arrive, the window slides to incorporate S new data points in the stream. The oldest S data points will be discarded from the current window.

Preceding/Succeeding Neighbor
A data point o is a preceding neighbor of a data point o′ if o is a neighbor of o′ and expires before o′ does. A data point o is a succeeding neighbor of a data point o′ if o is a neighbor of o′ and o expires in the same slide with or after o′e. Two consecutive windows with W=6 and S=2. When the new slide with o7 and o8 arrives, window D6 slides, resulting in the expiration of o1 and o2. o8 has one succeeding neighbor, 07, and four preceding neighbors, i.e. 03, 04, 05, 06.

Safe/Unsafe Inlier An inlier which has at least k succeeding neighbors will never become an outlier in the future → safe inlier An inlier which has less than k succeeding neighbors may become an outlier when the preceding neighbors expire → unsafe inlier

Streaming Outliers Detection: Problem Statement
Given the window size W, the slide size S, the count threshold k, and the distance threshold R, detect the distance-based outliers in every sliding window ..., Dn, Dn+S, .... W=7, k=3, S=5 In D7, O7 has 4 neighbors → inlier In D12, o7 has 0 neighbors → outlier D7 D12 value o8 o9 o1 o3 05 07 o4 o2 o11 o12 o6 o10 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 time New slide Expired Slide

The need for Approximate Algorithms
Exact Algorithms Store all data points and neighborhood information needed for outlier reporting. Effective Detect outliers accurately in each window. Not applicable when: The window does not fit in memory or limited memory can be allocated . The volume of the data is big and the applications fails to process the current window before the next one arrives. Approximate Algorithms Store a subset of the current window and less neighborhood information. More efficient than exact algorithms. Require less CPU Time and memory. Less effective than exact algorithms. May yield false alarms.

Approximate Algorithms
Related Work(1/4) Distance Based Anomaly Detection Centralized Approach Distributed Approach Approximate Algorithms Exact Algorithms Exact-Storm Approx-Storm Abstract-C LUE - DUE COD - MCOD Thresh_LEAP

Related Work(2/4) Centralized Approach Exact Algorithms
Data streams are produced in one node Exact Algorithms Detect outliers accurately in every sliding window Exact-Storm Reports outliers by storing k preceding neighbors and the number of succeeding neighbors. Abstract-C Intuition: the number of windows a point participates in is constant.

Related Work(3/4) Approximate Algorithms LUE – DUE MCOD Thresh_LEAP
Intuition: only the points who are neighbors to the expired data points need to be updated. DUE is an optimization of LUE. MCOD Stores data points in mico-clusters to avoid expensive range queries. Thresh_LEAP Uses an index per slide, not per window. Approx-Storm Adapts Exact-Storm with two approximations: Reduce the number of data points stored in each window. Reduce the space for neighbor store for each data point. Approximate Algorithms Do not guarantee the exact result

Related Work(4/4) Related Work(4/) Distributed Approach
data points are generated at multiple nodes Histogram based method for outlier detection in sensor networks Collects hints (in the form of a histogram) about the data distribution. Uses the hints to identify potential outliers. Assums all data points are available at the time of computation and thus is inapplicable to data streams. Exact Algorithm Distributed density estimation Framework that computes an approximation of data distributions. No guarantee for distance based outlier detection. Approximate Algorithm

exact-Storm(1/2) W=5, R=2,k=2,S=2 Index value D5 D7 o1 o2 o3 o4 o5 o6
06 Point Preceding Number of 5 Neighbors succeeding o3 neighbors 4 o4 o1 2 3 o2 02 2 2 3 o3 o2 2 1 o7 o4 02 2 4 t1 t2 t3 t4 t5 t6 t7 time 05 o1,o3,o4 06 o3,o4 1 07 o4

Exact-Storm(2/2) O(Wlogk) O(kW) Time Complexity Space Complexity
Uses an index structure For each data point o, exact-Storm: stores up to k preceding neighbors of o Stores the number of succeeding neighbors of o O(kW)

approx-Storm(1/2) W=5, R=2,k=2,S=2, p=0.20
Frac_before: the ratio between the number of o’s preceding neighbors which are safe inliers to the number of safe inliers in the window Only pW safe inliers are preserved for each window Index value D5 D7 o1 o2 o3 o4 o5 o6 o7 o1 6 Safe inlier o5 06 Point Frac_before Number of 5 succeeding o3 neighbors 4 o4 o1 2 3 o2 02 2 2 Safe inlier o3 2 1 o7 Safe inlier o4 2 t1 t2 t3 t4 t5 t6 t7 time 05 0.5 1 06 0.5 1 If o.frac before ∗ (W − t + o.t) + o.sn < k, Then o is an outlier 07

approx-Storm(2/2) O(W) O(W) Time Complexity Space Complexity
Uses an index structure For each data point o, exact-Storm: stores the ratio between the number of o’s preceding neighbors which are safe inliers to the number of safe inliers in the window Stores the number of succeeding neighbors of o O(W)

Exact-Storm VS Approx-Storm
Exact Algorithm Approximate Algorithm Uses an index per window structure to store data points Stores k preceding and the number of succeeding neighbors of o stores the the ratio between the number of o’s preceding neighbors which are safe inliers to the number of safe inliers in the window, and the number of succeeding neighbors of o Not optimal in memory usage and CPU Time since expired preceding neighbors are still stored in the list of o Reduced expired data points processing time due to storing a portion of safe inliers No potential outlier store

Abstract-C(1/3) W=3, S=1 D5 D3: o3 has 1 neighbor → o2
The number of windows that each point participates in is a constant, i.e., W/S D5 D3: o3 has 1 neighbor → o2 o2 is also a neighbor in D4 → [1, 1, 0] D4 value D3 o1 D4: o3 has 2 neighbors → o2,o4 o4 is also a neighbor in D5 → [2,1] o3 05 o4 o2 D5:o3 has 2 neighbors → o4,o5 → [2] t1 t2 t3 t4 t5 time

Abstract-C(2/3) Exact-Storm Abstract-C
Uses an index per window structure to store data points Stores k preceding and the number of succeeding neighbors of o Stores the number of neighbors of o in every window that o participates in Not optimal in memory usage and CPU Time since expired preceding neighbors are still stored in the list of o Does not spend time on finding active preceding neighbors for each data point No potential outlier store

Abstract-C(3/3) O(W2/S) Time Complexity Space Complexity
Uses an index structure the number of windows that each point participates in is a constant, i.e., W/S The memory requirement heavily depends on the input data stream, i.e., W/S For each data point o, Abstract-C stores the number of neighbors of o in every window that o participates in O(W2/S + W)

Direct Update of Events – DUE(1/3)
W=8, k=4 Index p1 p2 … p8 p9 Point Preceding Number of neighbors succeeding inlier neighbors Outlier AGAIN outlier p3 p1 2 3 inlier p5 outlier p6 p1,p2,p4,p5 p8 p2 p9 p3 p4 p3 p1 p7 p6 outlier Outlier R List p9 R p3 Event Queue p9 p3 Expires at time 9 p6 Event Time: 1+8=9

Exact-Storm Abstract-C DUE Index per window structure to store data points Stores k preceding and the number of succeeding neighbors of o Stores the number of neighbors of o in every window that o participates in Stores the number of succeeding neighbors of o (o.sn) and the k-o.sn most recent preceding neighbors Not optimal in memory usage and CPU Time since expired preceding neighbors are still stored in the list of o Does not spend time on finding active preceding neighbors for each data point Event queue provides efficient re-evaluation 0f the data points (extra CPU Time and memory to maintain sorted) No potential outlier store Event queue, outlier list

Time Complexity Space Complexity O(WlogW) Uses an index structure A priority queue called event queue stores all unsafe inliers sorted in increasing order of the smallest expiration time Stores outliers in an outlier list For each data point o, DUE stores the number of succeeding neighbors stores up to k preceding neighbors O(kW)

Micro-Cluster Based Algorithm – MCOD(1/3)
K=4 Cluster size < k+ 1 MC2 o3 MC1 o5 R/2 o1 o6 o2 o4 MC3 PD o1 o2 o3 o4 05 Expires

Exact-Storm Abstract-C DUE MCOD Index per window structure to store data points Micro-Clusters Stores k preceding and the number of succeeding neighbors of o Stores the number of neighbors of o in every window that o participates in Stores the number of succeeding neighbors of o (o.sn) and the k-o.sn most recent preceding neighbors Micro-clusters & number of succeeding neighbors and k preceding neighbors for points that belong to no cluster Not optimal in memory usage and CPU Time since expired preceding neighbors are still stored in the list of o Does not spend time on finding active preceding neighbors for each data point Event queue provides efficient re-evaluation 0f the data points (extra CPU Time and memory to maintain sorted) Effective pair-wise distance computations and neighbor information due to micro-clusters(no range queries) No potential outlier store event queue, outlier list Event queue, PD list

Time Complexity Space Complexity O((1 − c)W log((1 − c)W) + kW log k) Uses micro-clusters A priority queue called event queue stores all unsafe inliers sorted in increasing order of the smallest expiration time Stores data point that belong to no cluster in a list called PD O(cW + (1 − c)kW) 0 ≤ c ≤ 1 denotes the fraction of the window stored in micro-clusters

Thresh_LEAP(1/3) K=4 Index for slide S4 [o] [o] [o] S1 S2 S3 S4 S5 S6
Preceding next Succeeding first New probing S1 and S2 expire each data point stores the number of neighbors in every slide: For o : [1, 1, 1, 1]

Thresh_LEAP(2/3) Exact-Storm Abstract-C DUE MCOD Thresh_LEAP
Index per window structure to store data points Micro-Clusters Index per slide Stores k preceding and the number of succeeding neighbors of o Stores the number of neighbors of o in every window that o participates in Stores the number of succeeding neighbors of o (o.sn) and the k-o.sn most recent preceding neighbors Micro-clusters & number of succeeding neighbors and k preceding neighbors for points that belong to no cluster Stores the number of neighbors in every slide and the number of succeeding neighbors Not optimal in memory usage and CPU Time since expired preceding neighbors are still stored in the list of o Does not spend time on finding active preceding neighbors for each data point Event queue provides efficient re-evaluation 0f the data points (extra CPU Time and memory to maintain sorted) Effective pair-wise distance computations and neighbor information due to micro-clusters(no range queries) Effective range queries due to the smaller index structures per slide No potential outlier store event queue, outlier list Event queue, PD list Trigger list per slide

Thresh_LEAP(3/3) O(W2/S) Time Complexity Space Complexity O(W2logS/S)
Uses an index structure per slide Stores the number of neighbors in every slide and the number of succeeding neighbors Stores a trigger list for each slide O(W2/S)

Experimental Evaluation
All algorithms were implemented in Java and the experiments were carried out in the same system. Four real-world datasets were used: Evaluation was based on varying the following parameters that affect the algorithms’ performance: Window Size W Slide Size S Neighbor Threshold count k Distance Threshold R Dimensionality Default value k for all datasets is 50

Varying Window Size(1/3)
CPU Time W increases → more data points to process in each window. CPU time of each algorithm is expected to increase. MCOD is faster, as a large portion of data, i.e. inliers can be stored in micro-clusters. range queries are performed more efficiently than index structures.

Peak Memory W increases → more data points and their neighborhood information to store in each window. Peak memory of each algorithm is expected to increase. MCOD requires the lowest memory due to micro-clusters. Οne micro-cluster can efficiently capture the neighborhood information for each data point in the same cluster.

Varying Slide Size(1/2) CPU Time
S increases → more data points arrive and expire at the same time When S=W every data point resides only in 1 window. CPU Time As S increases from 1%W to 50%W. CPU Time of all algorithms(except Thresh_LEAP) increases. As S increases from 50%W to 100%W. CPU Time drops since when S=W the index of each window can be discarded at once. In Thresh_LEAP CPU Time decreases when S is small (10%W – 20%W) since more slides need to be probed. As S increases less slides need to be probed and the trigger list of each slide becomes shorter.

Varying Slide Size(2/2) Peak Memory
When S increases, the memory cost of all algorithms decreases. When S = W, all the algorithms show similar memory consumptions. every data point does not have any preceding neighbors and only participates in one window MCOD shows superior performance thanks to micro clusters One micro-cluster can efficiently capture the neighborhood information for each data point in the same cluster.

Varying k(1/2) Important parameter affecting the outlier rate and space requirement for neighbor store. CPU Time CPU Time of most algorithms does not vary a lot. These algorithms do not heavily depend on k. CPU time of Abstract-C is stable neighbor store depends only on W and S. CPU Time of MCOD increases when k increases since less points fall in micro clusters. Thresh_LEAP needs to probe more slides to find k neighbors as k increases.

Varying k(2/2) Peak Memory
Increasing storage requirement is expected since the storage of neighbors is dependent on k. Memory requirement of Abstract-C is stable. neighbor store depends only on W and S. MCOD’s memory requirement increases as k increases. Larger number of data point which are not in any micro-clusters. Still superior to other algorithms.

Varying R(1/2) CPU Time MCOD: Thresh_LEAP:
When R increases, every data point has more neighbors and the number of outliers decreases. CPU Time MCOD: When R initially increases, CPU Time is high due to sorting, adding and removing many points in PD list. As R continues to increase, the CPU Time drops since the number of points in micro- clusters increase. Thresh_LEAP: When R initially increases, the CPU Time of Thresh_LEAP is high beacuse each point has a longer trigger list. When R increases more, the CPU Time decreases since each point can find neighbors in a small number of probes which causes a shorter trigger list.

Varying R(2/2) Peak Memory
For each algorithm, when we first increase R, the memory required increases. When we increase R more, the memory requirement decreases Shorter trigger list in Thresh_LEAP. More data point in micro-clusters in MCOD.

Varying Dimensionality(1/2)
Vary input data dimensionality(D) to analyze the impact on performance With FC dataset, D is varied from 1 to 55 10 runs were used with D randomly selected attributes each time and the results were averaged. When D increases, the distance computation requires more time. Larger distance leads to increased outlier rates

Varying Dimensionality(2/2)
CPU Time CPU time of exact-Storm, DUE, Abstract-C decreases as D increases. each data point has less neighbors and less information need to be updated. MCOD’s CPU Time increases as D increases. less points in micro-clusters because of larger distances. Thresh_LEAP’s CPU time increases as D increases. each data point has to probe more slides to find neighbors. Peak Memory When D increases, the memory consumption of all algorithms increases. the storage for the attribute values of the data points increases.

Approximate Solution(1/2)
Up to pW safe inliers are preserved for each window. Vary the parameter p from 0.01 to 1. approx-Storm may miss or falsely report some outliers. Precision and recall for TAO and Gauss with the ground truth provided by exact-Storm approx-Storm was compared to exact-Storm, MCOD and Thresh_LEAP with an optimal value p for each dataset

Approximate Solution(2/2)
CPU time of approx-Storm m is superior to exact-Storm and comparable to MCOD’s and Thresh_LEAP’s. approx-Storm does not have to update the neighbor information for each data point when the window slides. Approx-Storm requires the lowest peak memory requirement. for each data point o, only two numbers o.sn and o.frac_before are stored for neighbor information.

Conclusions Comprehensive comparative evaluation of the state-of-the-art algorithms for distance based outlier detection in data streams. Evaluated the CPU Time and Peak Memory of five exact and one approximate algorithms with various datasets and stream settings. Concluded that MCOD provides superior performance across multiple datasets in most streaming settings by utilizing the micro-cluster structure. Future Work Storing several consecutive slides in Thresh_LEAP. Considering the data expiration time in micro-clusters in MCOD. Designing hybrid approaches, like MCOD combined with Thresh_LEAP. Optimizing of new/expired slides processing when the window slides. Designing approximate DODDS solutions. Designing DODDS solutions in a decentralized setting.

Assessment Motivation Soundness Novelty Technical Depth Presentation
The paper tackles the problem of Distance-Based Outlier Detection in Data Streams. Soundness Evaluation used 4 real-world datasets. Resulted in high quality results. However: Would be interesting to include the approximate approach of more exact algorithms (not only exact/approx Storm). What about the distributed approach for DODDS? A small reference is made to this solution of the problem and no algorithm is provided. Distributed solutions are the next step since they are more agile and are able to scale to larger sizes. Novelty This is the first approach that provides a benchmarking on the state-of-the-art algorithms for DODDS. Technical Depth Easy to follow. Presentation Enough examples and figures. Examples are provided with plots and bar graphs.

Thank You! Questions?

CS562 – Advanced Topics in Databases

Similar presentations

Presentation on theme: "CS562 – Advanced Topics in Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS562 – Advanced Topics in Databases

Similar presentations

Presentation on theme: "CS562 – Advanced Topics in Databases"— Presentation transcript:

Similar presentations

About project

Feedback