A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez.

Slides:



Advertisements
Similar presentations
Cryptography and Network Security
Advertisements

Quantum One: Lecture Postulate II 3 Observables of Quantum Mechanical Systems 4.
A (1+  )-Approximation Algorithm for 2-Line-Center P.K. Agarwal, C.M. Procopiuc, K.R. Varadarajan Computational Geometry 2003.
Use of moment generating functions. Definition Let X denote a random variable with probability density function f(x) if continuous (probability mass function.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 22 By Herbert I. Gross and Richard A. Medeiros next.
1 Complexity of Network Synchronization Raeda Naamnieh.
Prune-and-search Strategy
INFINITE SEQUENCES AND SERIES
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
LIMITS 2. LIMITS 2.4 The Precise Definition of a Limit In this section, we will: Define a limit precisely.
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Applied Combinatorics, 4th Ed. Alan Tucker
Equivalence Class Testing
Chapter 41 Enhanced Entity-Relationship and Object Modeling.
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
Induction and recursion
Chapter 13 Multiple Integrals by Zhian Liang.
1/1/20001 Topic >>>> Scan Conversion CSE Computer Graphics.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Functions of Random Variables. Methods for determining the distribution of functions of Random Variables 1.Distribution function method 2.Moment generating.
Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved. CHAPTER 9 COUNTING AND PROBABILITY.
Copyright © Cengage Learning. All rights reserved.
The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,
Functions of Random Variables. Methods for determining the distribution of functions of Random Variables 1.Distribution function method 2.Moment generating.
1 Review Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central Measures (mean,
11 Table 2.1 defines a correspondence between a set of percent scores and a set of letter grades. For each score from 0 to 100, there corresponds only.
© The McGraw-Hill Companies, Inc., Chapter 6 Prune-and-Search Strategy.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 Hawkes Learning Systems. All rights reserved. Hawkes Learning Systems: College Algebra.
Outlier Detection Lian Duan Management Sciences, UIOWA.
P.1 Real Numbers. 2 What You Should Learn Represent and classify real numbers. Order real numbers and use inequalities. Find the absolute values of real.
1 Review Sections Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central.
1 Prune-and-Search Method 2012/10/30. A simple example: Binary search sorted sequence : (search 9) step 1  step 2  step 3  Binary search.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
CHAPTER 3 FUZZY RELATION and COMPOSITION. 3.1 Crisp relation Product set Definition (Product set) Let A and B be two non-empty sets, the product.
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Integration Copyright © Cengage Learning. All rights reserved.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
1.2 Logical Reasoning page 9. Inductive Reasoning: Reasoning that is based on patterns you observe. Conjecture: A conclusion that is reached using inductive.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Equivalence Class Testing In chapter 5, we saw that all four variations of boundary value testing are vulnerable to –gaps of untested functionality, and.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Random Variables. Numerical Outcomes Consider associating a numerical value with each sample point in a sample space. (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
Mathematical Induction Section 5.1. Climbing an Infinite Ladder Suppose we have an infinite ladder: 1.We can reach the first rung of the ladder. 2.If.
Integration 4 Copyright © Cengage Learning. All rights reserved.
1 Melikyan/DM/Fall09 Discrete Mathematics Ch. 7 Functions Instructor: Hayk Melikyan Today we will review sections 7.3, 7.4 and 7.5.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Equivalence Class Testing Use the mathematical concept of partitioning into equivalence classes to generate test cases for Functional (Black-box) testing.
Functions (2) - 1ICOM 4075 (Spring, 2010) UPRM Department of Electrical and Computer Engineering University of Puerto Rico at Mayagüez ICOM 4075: Foundations.
Infinite Sequences and Series 8. Sequences Sequences A sequence can be thought of as a list of numbers written in a definite order: a 1, a 2, a.
Objectives: 1. Look for a pattern 2. Write an equation given a solution 4-8 Writing Equations from Patterns.
Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Ordering of Hypothesis Space
Outlier Discovery/Anomaly Detection
Equivalence Class Testing
Chapter 15 Multiple Integrals
Copyright © Cengage Learning. All rights reserved.
Antiderivatives and Indefinite Integration
Presentation transcript:

A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez

A Robust Outlier Detection Scheme for Large Data Sets Outlier: Outlier: – An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. D. Hawkins

A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes Recent Detection Schemes – Distance Based DB(n,q): if an object’s q neighborhood contains less than n objects then it’s called an outlier with respect to n and q. DB(n,q): if an object’s q neighborhood contains less than n objects then it’s called an outlier with respect to n and q. (t, k) nearest neighbor: ranks the top t objects with the maximum to their kth nearest neighbors as outliers. (t, k) nearest neighbor: ranks the top t objects with the maximum to their kth nearest neighbors as outliers.

A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Density Based Let p, o be members of D and let k be a positive integer Let p, o be members of D and let k be a positive integer k-distance(o): the distance from o to its kth nearest neighbor k-distance(o): the distance from o to its kth nearest neighbor reachability distance of p with respect to k: reachability distance of p with respect to k: reach-dist k (p, o) = max {k-distance(o), dist(p, o)}

A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Density Based (cont.) The local reachability density of p for k, lrd k (p),is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. The local reachability density of p for k, lrd k (p),is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. Let N k (p) stand for N k-distance(p) (p) Let N k (p) stand for N k-distance(p) (p) lrd k (p) is define as: lrd k (p) is define as: The local outlier factor of p, LOF k (p), is just the average fraction of the reachability densities of p’s k-distance neighbors and that of p The local outlier factor of p, LOF k (p), is just the average fraction of the reachability densities of p’s k-distance neighbors and that of p LOF k (p) is defined as: LOF k (p) is defined as:

A Robust Outlier Detection Scheme for Large Data Sets Recent Detection Schemes (cont.) Recent Detection Schemes (cont.) – Advantages of Distance Based – Disadvantages of Distance Based – Advantages of Density Based – Disadvantages of Density Based – Where does this leave us?

A Robust Outlier Detection Scheme for Large Data Sets A Unified Model for Outliers A Unified Model for Outliers – First some terms D = {I 1, …, I N } be a data set in a multi-demensional space S D = {I 1, …, I N } be a data set in a multi-demensional space S N v (p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p N v (p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p – Some functions d( ) : D  R + d( ) : D  R + m( ) : D  R + m( ) : D  R + F( ) : R + x R +  R 0+ F( ) : R + x R +  R 0+ F(m(p), |N d(p) (p)|) for every p in D is called an outlier measure on D F(m(p), |N d(p) (p)|) for every p in D is called an outlier measure on D d( ) and m( ) are known as the characteristic functions d( ) and m( ) are known as the characteristic functions We can now construct the new functions – DB(n, q) d(p) = q and m(p) = n for all p in D d(p) = q and m(p) = n for all p in D F(x,y) = 1 if x > y and 0 otherwise F(x,y) = 1 if x > y and 0 otherwise The outlier measure function for DB(n,q) is:F(n, |N q (p)|) shortened as F 1 (n, q, p) The outlier measure function for DB(n,q) is:F(n, |N q (p)|) shortened as F 1 (n, q, p) F 1 (n, q, p) = 1 if n > |N q (p)| F 1 (n, q, p) = 1 if n > |N q (p)| 0 otherwise 0 otherwise

A Robust Outlier Detection Scheme for Large Data Sets – (t, k) nearest neighbor is just a special case of DB(n, q) where q = ( k-distance t + k-distance t + 1 ) / 2 q = ( k-distance t + k-distance t + 1 ) / 2 Outlier function: F(k, |N ( k-distance t + k-distance t+1 ) / 2 (p)|) use F 2 (t, k, p) Outlier function: F(k, |N ( k-distance t + k-distance t+1 ) / 2 (p)|) use F 2 (t, k, p) F 2 (t, k, p) = 1 if t > |N ( k-distance t + k-distance t+1 ) / 2 (p)|) F 2 (t, k, p) = 1 if t > |N ( k-distance t + k-distance t+1 ) / 2 (p)|) 0 otherwise 0 otherwise – density based scheme d(p) = k-distance(p) d(p) = k-distance(p) F(x, y) = x / y 2 F(x, y) = x / y 2 this is the same as LOF k (p) this is the same as LOF k (p) F 3 (k, p) = LOF k (p) F 3 (k, p) = LOF k (p)

A Robust Outlier Detection Scheme for Large Data Sets Thoughts on the previous Thoughts on the previous – For the DB(n, q) outlier model the characteristic functions do not change as objects change – To detect outliers whose neighborhoods possess different kinds of structures, we should use characteristic functions with different values for different structures. Enhancing the expressive power of a formulation scheme Enhancing the expressive power of a formulation scheme – Formulation schemes have a tough time describing the outlies in terms of a user’s intuition User’s view of an outlier User’s view of an outlier Outlier measure function’s view of an outlier Outlier measure function’s view of an outlier – Question to answer: Under the constraint that the multiple patterns of a user’s interest for any data set are not available, can we enhance the expressive power of these schemes?

A Robust Outlier Detection Scheme for Large Data Sets More useful notations More useful notations – For any C subset of D AND p member of D dist max (C) = max{ dist(x, y) : x and y are members of C } dist max (C) = max{ dist(x, y) : x and y are members of C } dist min (C) = min { dist(x, y) : x and y are members of C and x != y } dist min (C) = min { dist(x, y) : x and y are members of C and x != y } dist(p, C) = min { dist(p, x) : x member of C } dist(p, C) = min { dist(p, x) : x member of C } Any outlier measure function is denoted by O(r, d, p) Any outlier measure function is denoted by O(r, d, p) where 0 <= d <= dist max (D), p member of D, r member of Dom O (D) or or the domain for the variable r of the function O

A Robust Outlier Detection Scheme for Large Data Sets Construct the new functions Construct the new functions – For DB(n, q): O(n, q, p) = F 1 (n, q, p) where n mem of Dom O (D) = {0, 1, …, |D| + 1} – For (t, k) nearest neighbor: O(t, k, p) = F 2 (t, k, p) where t member of Dom O (D) = {1, 2, …, |D|} – For density based scheme: O(r, k, p) = F 3 (k, p) where the r variable is not needed

A Robust Outlier Detection Scheme for Large Data Sets Some definitions Some definitions – Definition 1 Let D be a Data Set Let D be a Data Set An interpretation of D is a partition D = D o U D n where D o and D n denote the outlier set and non-outlier set, respectively An interpretation of D is a partition D = D o U D n where D o and D n denote the outlier set and non-outlier set, respectively – Definition 2 Let O(r, q, p) be an outlier measure function and I be an interpretation D = D o U D n Let O(r, q, p) be an outlier measure function and I be an interpretation D = D o U D n 1. O(r, q, p) is O-compatible with I if there exists a u > 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 <…< q i such that 2. O(r, q, p) is N-compatible with I if there exists a u > 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 0 and a sequence (r 1, q 1 ), (r 2, q 2 ), …, (r i, q i ) with i >= 1 and q 1 <…< q i such that

A Robust Outlier Detection Scheme for Large Data Sets For O-compatability, the entire sequence must consent for the object to be an outlier, but one member is enough for it to be a non-outlier. For O-compatability, the entire sequence must consent for the object to be an outlier, but one member is enough for it to be a non-outlier. For N-compatability, it’s just the other way around. For N-compatability, it’s just the other way around. Thoughts Thoughts – Objective: trying to produce an outlier function that fit’s the user’s intuition. – An O-compatibility scheme may filter out many objects – An N-compatibility scheme may allow unworthies to pass through – So, pick a scheme based upon the user’s requirements

A Robust Outlier Detection Scheme for Large Data Sets A concrete example: A concrete example: – Consider the data set D = C 1 U C 2 U {o} Assume |C 1 | = 400, |C 2 | = 403 Assume dist min (C 2 ) > dist(o, x 3 ), Assume dist max (C 1 ) = dist(x 1, x 3 ) <= dist(o, x 1 ) < dist(o, x 2 ) Assume dist(o, C 2 ) = dist(o, x 2 ) = dist max (C 2 )

A Robust Outlier Detection Scheme for Large Data Sets Assertion: Let D be the data as shown above in Figure 1(a). Then DB(n, q) outlier scheme is O-compatible but not N-compatible with I Proof: Recall that the outlier measure function O for the BN(r, q) scheme is O(r, q, p) = F 1 (r, q, p) = 1 if r > |N q (p)| 0 otherwise

A Robust Outlier Detection Scheme for Large Data Sets We choose u = 1. We choose u = 1. Let: Let: q1 = dist(o, C1) = dist(o, x1) q1 = dist(o, C1) = dist(o, x1) r1 = 2 r1 = 2 q2 = dist(o, C2) = dist(o, x2) q2 = dist(o, C2) = dist(o, x2) r2 = 402 r2 = 402 Use the properties given in the example to verify that u and the sequence of (r1, q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure function O(r, q, p). Use the properties given in the example to verify that u and the sequence of (r1, q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure function O(r, q, p). Since q1 = u. Since q1 = u.

A Robust Outlier Detection Scheme for Large Data Sets Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of the circle that covers C2, and = u. Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of the circle that covers C2, and dist max (C1) = u. For any p member of C1, since q1, N q1 (p) has all points in C1 – {p}, but may or may not have the point o, i.e. |N q1 (p)| >= |C1| - 1 = 399 >= r1, thus, O(r1, q1, p) = 0 = 402 >= r2. Thus, O(r1, q1, p) = 0 = |C1| - 1 = 399 >= r1, thus, O(r1, q1, p) = 0 = 402 >= r2. Thus, O(r1, q1, p) = 0 < u for all p member of C2. It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible. It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible.

A Robust Outlier Detection Scheme for Large Data Sets References: References: 1. Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Robust Outlier Detection Scheme for Large Data Sets” 1. Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Robust Outlier Detection Scheme for Large Data Sets”