Download presentation
Presentation is loading. Please wait.
2
A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez
3
A Robust Outlier Detection Scheme for Large Data Sets Outlier: Outlier: – An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. D. Hawkins
4
A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes Recent Detection Schemes – Distance Based DB(n,q): if an object’s q neighborhood contains less than n objects then it’s called an outlier with respect to n and q. DB(n,q): if an object’s q neighborhood contains less than n objects then it’s called an outlier with respect to n and q. (t, k) nearest neighbor: ranks the top t objects with the maximum to their kth nearest neighbors as outliers. (t, k) nearest neighbor: ranks the top t objects with the maximum to their kth nearest neighbors as outliers.
5
A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Density Based Let p, o be members of D and let k be a positive integer Let p, o be members of D and let k be a positive integer k-distance(o): the distance from o to its kth nearest neighbor k-distance(o): the distance from o to its kth nearest neighbor reachability distance of p with respect to k: reachability distance of p with respect to k: reach-dist k (p, o) = max {k-distance(o), dist(p, o)}
6
A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Density Based (cont.) The local reachability density of p for k, lrd k (p),is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. The local reachability density of p for k, lrd k (p),is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. Let N k (p) stand for N k-distance(p) (p) Let N k (p) stand for N k-distance(p) (p) lrd k (p) is define as: lrd k (p) is define as: The local outlier factor of p, LOF k (p), is just the average fraction of the reachability densities of p’s k-distance neighbors and that of p The local outlier factor of p, LOF k (p), is just the average fraction of the reachability densities of p’s k-distance neighbors and that of p LOF k (p) is defined as: LOF k (p) is defined as:
7
A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Advantages of Distance Based – Disadvantages of Distance Based – Advantages of Density Based – Disadvantages of Density Based – Where does this leave us?
8
A Robust Outlier Detection Scheme for Large Data Sets A Unified Model for Outliers A Unified Model for Outliers – First some terms D = {I 1, …, I N } be a data set in a multi-demensional space S D = {I 1, …, I N } be a data set in a multi-demensional space S N v (p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p N v (p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p – Some functions d( ) : D R + d( ) : D R + m( ) : D R + m( ) : D R + F( ) : R + x R + R 0+ F( ) : R + x R + R 0+ F(m(p), |N d(p) (p)|) for every p in D is called an outlier measure on D F(m(p), |N d(p) (p)|) for every p in D is called an outlier measure on D d( ) and m( ) are known as the characteristic functions d( ) and m( ) are known as the characteristic functions We can now construct the new functions – DB(n, q) d(p) = q and m(p) = n for all p in D d(p) = q and m(p) = n for all p in D F(x,y) = 1 if x > y and 0 otherwise F(x,y) = 1 if x > y and 0 otherwise The outlier measure function for DB(n,q) is:F(n, |N q (p)|) shortened as F 1 (n, q, p) The outlier measure function for DB(n,q) is:F(n, |N q (p)|) shortened as F 1 (n, q, p) F 1 (n, q, p) = 1 if n > |N q (p)| F 1 (n, q, p) = 1 if n > |N q (p)| 0 otherwise 0 otherwise
9
A Robust Outlier Detection Scheme for Large Data Sets – (t, k) nearest neighbor is just a special case of DB(n, q) where q = ( k-distance t + k-distance t + 1 ) / 2 q = ( k-distance t + k-distance t + 1 ) / 2 Outlier function: F(k, |N ( k-distance t + k-distance t+1 ) / 2 (p)|) use F 2 (t, k, p) Outlier function: F(k, |N ( k-distance t + k-distance t+1 ) / 2 (p)|) use F 2 (t, k, p) F 2 (t, k, p) = 1 if t > |N ( k-distance t + k-distance t+1 ) / 2 (p)|) F 2 (t, k, p) = 1 if t > |N ( k-distance t + k-distance t+1 ) / 2 (p)|) 0 otherwise 0 otherwise – density based scheme d(p) = k-distance(p) d(p) = k-distance(p) F(x, y) = x / y 2 F(x, y) = x / y 2 this is the same as LOF k (p) this is the same as LOF k (p) F 3 (k, p) = LOF k (p) F 3 (k, p) = LOF k (p)
10
A Robust Outlier Detection Scheme for Large Data Sets Thoughts on the previous Thoughts on the previous – For the DB(n, q) outlier model the characteristic functions do not change as objects change – To detect outliers whose neighborhoods possess different kinds of structures, we should use characteristic functions with different values for different structures. Enhancing the expressive power of a formulation scheme Enhancing the expressive power of a formulation scheme – Formulation schemes have a tough time describing the outlies in terms of a user’s intuition User’s view of an outlier User’s view of an outlier Outlier measure function’s view of an outlier Outlier measure function’s view of an outlier – Question to answer: Under the constraint that the multiple patterns of a user’s interest for any data set are not available, can we enhance the expressive power of these schemes?
11
A Robust Outlier Detection Scheme for Large Data Sets More useful notations More useful notations – For any C subset of D AND p member of D dist max (C) = max{ dist(x, y) : x and y are members of C } dist max (C) = max{ dist(x, y) : x and y are members of C } dist min (C) = min { dist(x, y) : x and y are members of C and x != y } dist min (C) = min { dist(x, y) : x and y are members of C and x != y } dist(p, C) = min { dist(p, x) : x member of C } dist(p, C) = min { dist(p, x) : x member of C } Any outlier measure function is denoted by O(r, d, p) Any outlier measure function is denoted by O(r, d, p) where 0 <= d <= dist max (D), p member of D, r member of Dom O (D) or or the domain for the variable r of the function O
12
A Robust Outlier Detection Scheme for Large Data Sets Construct the new functions Construct the new functions – For DB(n, q): O(n, q, p) = F 1 (n, q, p) where n mem of Dom O (D) = {0, 1, …, |D| + 1} – For (t, k) nearest neighbor: O(t, k, p) = F 2 (t, k, p) where t member of Dom O (D) = {1, 2, …, |D|} – For density based scheme: O(r, k, p) = F 3 (k, p) where the r variable is not needed
13
A Robust Outlier Detection Scheme for Large Data Sets Some definitions Some definitions – Definition 1 Let D be a Data Set Let D be a Data Set An interpretation of D is a partition D = D o U D n where D o and D n denote the outlier set and non-outlier set, respectively An interpretation of D is a partition D = D o U D n where D o and D n denote the outlier set and non-outlier set, respectively – Definition 2 Let O(r, q, p) be an outlier measure function and I be an interpretation D = D o U D n Let O(r, q, p) be an outlier measure function and I be an interpretation D = D o U D n 1. O(r, q, p) is O-compatible with I if there exists a u > 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 <…< q i such that 2. O(r, q, p) is N-compatible with I if there exists a u > 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 <…< q i such that
14
A Robust Outlier Detection Scheme for Large Data Sets For O-compatability, the entire sequence must consent for the object to be an outlier, but one member is enough for it to be a non-outlier. For O-compatability, the entire sequence must consent for the object to be an outlier, but one member is enough for it to be a non-outlier. For N-compatability, it’s just the other way around. For N-compatability, it’s just the other way around. Thoughts Thoughts – Objective: trying to produce an outlier function that fit’s the user’s intuition. – An O-compatibility scheme may filter out many objects – An N-compatibility scheme may allow unworthies to pass through – So, pick a scheme based upon the user’s requirements
15
A Robust Outlier Detection Scheme for Large Data Sets A concrete example: A concrete example: – Consider the data set D = C 1 U C 2 U {o} Assume |C 1 | = 400, |C 2 | = 403 Assume dist min (C 2 ) > dist(o, x 3 ), Assume dist max (C 1 ) = dist(x 1, x 3 ) <= dist(o, x 1 ) < dist(o, x 2 ) Assume dist(o, C 2 ) = dist(o, x 2 ) = dist max (C 2 )
16
A Robust Outlier Detection Scheme for Large Data Sets Assertion: Let D be the data as shown above in Figure 1(a). Then DB(n, q) outlier scheme is O-compatible but not N-compatible with I Proof: Recall that the outlier measure function O for the BN(r, q) scheme is O(r, q, p) = F 1 (r, q, p) = 1 if r > |N q (p)| 0 otherwise
17
A Robust Outlier Detection Scheme for Large Data Sets We choose u = 1. We choose u = 1. Let: Let: q1 = dist(o, C1) = dist(o, x1) q1 = dist(o, C1) = dist(o, x1) r1 = 2 r1 = 2 q2 = dist(o, C2) = dist(o, x2) q2 = dist(o, C2) = dist(o, x2) r2 = 402 r2 = 402 Use the properties given in the example to verify that u and the sequence of (r1, q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure function O(r, q, p). Use the properties given in the example to verify that u and the sequence of (r1, q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure function O(r, q, p). Since q1 = u. Since q1 = u.
18
A Robust Outlier Detection Scheme for Large Data Sets Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of the circle that covers C2, and = u. Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of the circle that covers C2, and dist max (C1) = u. For any p member of C1, since q1, N q1 (p) has all points in C1 – {p}, but may or may not have the point o, i.e. |N q1 (p)| >= |C1| - 1 = 399 >= r1, thus, O(r1, q1, p) = 0 = 402 >= r2. Thus, O(r1, q1, p) = 0 = |C1| - 1 = 399 >= r1, thus, O(r1, q1, p) = 0 = 402 >= r2. Thus, O(r1, q1, p) = 0 < u for all p member of C2. It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible. It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible.
19
A Robust Outlier Detection Scheme for Large Data Sets References: References: 1. Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Robust Outlier Detection Scheme for Large Data Sets” 1. Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Robust Outlier Detection Scheme for Large Data Sets”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.