COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. local-density based spatial clustering algorithm with noise Presenter : Lin, Shu-Han Authors : Lian Duan,
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Chapter 7 – K-Nearest-Neighbor
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Data Mining Techniques
Outlier Detection & Analysis
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Spatial Statistics Applied to point data.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
1 Review Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central Measures (mean,
2002/4/10IDSL seminar Estimating Business Targets Advisor: Dr. Hsu Graduate: Yung-Chu Lin Data Source: Datta et al., KDD01, pp
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Outlier Detection Lian Duan Management Sciences, UIOWA.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
1 Review Sections Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central.
Outlier analysis. Outliers Working definition –An outlier x k is an element of a data sequence S that is inconsistent with out expectations, based on.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Slides are modified from Lada Adamic
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Chapter 16 Exploratory data analysis: numerical summaries CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics Instructor:
Exploratory Spatial Data Analysis (ESDA) Analysis through Visualization.
Other Clustering Techniques
Chapter 4 Measures of Central Tendency. 2 Central Tendency Major Points Measures of central tendency summarize the average level or magnitude of a set.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
COMP53311 Classification Prepared by Raymond Wong The examples used in Decision Tree are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Spatial Outlier Detection
Clustering 1 (Introduction and kmean)
Classification 3 (Nearest Neighbor Classifier)
Ch8: Nonparametric Methods
Chapter 7 – K-Nearest-Neighbor
William Norris Professor and Head, Department of Computer Science
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
CSCI N317 Computation for Scientific Applications Unit Weka
Other Classification Models: Support Vector Machine (SVM)
Topic 5: Cluster Analysis
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Outlier Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 … Cluster 2 (e.g. High Score in History and Low Score in Computer) Clustering: Computer History Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 … Outlier (e.g. Low Score in Computer and Low Score in History) Cluster 1 (e.g. High Score in Computer and Low Score in History) Outlier (e.g. High Score in Computer and High Score in History) Problem: to find all outliers

Outlier Applications Fraud Detection Medical Analysis Detect unusual usage of credit cards or telecommunication services Medical Analysis Finding unusual response to various medical treatment Customized Marketing Customers with extremely low or extremely high incomes Network A potential network attack Software A potential bug

Outlier Statistical Model Distance-based Model Density-Based Model

Statistical Model An outlier is an observation that is numerically distant from the rest of the data E.g., Consider 1-dimensional data How is a data point considered as an outlier?

Statistical Model Assume the 1-dimensional data follows the normal distribution p(x) P(x > 10000) is a small value or P(x < 5) is a small value Outlier: all values > 10000 or all values < 5 x

Statistical Model Disadvantage Assume that the data follows a particular distribution

Outlier Statistical Model Distance-based Model Density-Based Model

Distance-based Model Advantage Idea This model does not assume any distribution Idea A point p is considered as an outlier if there are too few data points which are close to p

Distance-based Model Given a point p and a non-negative real number , the -neighborhood of point p, denoted by N(p), is the set of points q (including point p itself) such that the distance between p and q is within . Given a non-negative integer No and a non-negative real number  A point p is said to be an outlier if N(p) <= No

Distance-based Model C2  C1 No = 2   a

Distance-based Model Is the distance-based model “perfect” to find the outliers?

Distance-based Model C2 C1 b  No = 2  a

Outlier Statistical Model Distance-based Model Density-Based Model

Density-Based Model Advantage: This model can find some “local” outliers

Density-Based Model Idea C2 Density is high Density is low C1 b The ratio of these densities is large  outlier a

Density-Based Model Idea C2 Density is high C1 b The ratio of these densities is large  outlier a Density is very low

Density-Based Model Idea C2 Density is high C1 b These densities are “similar”  NOT outlier a Density is high

Density-Based Model Idea C2 Density is high Density is high C1 b These densities are “similar”  NOT outlier a

Density-Based Model Formal definition Given an integer k and a point p, Nk(p) is defined to be the -neighborhood of p (excluding point p) where  is the distance between p and the k-th nearest neighbor e d c N1(a) = ? b N2(a) = ? a

Density-Based Model Reachability Distance of p with respect to o Given two points p and o and an integer k, Reach_distk(p, o) is defined to be max{dist(p, o), } where  is the distance between p and the k-th nearest neighbor e d Reach_dist2(a, b) =? c k = 2 Reach_dist2(a, c) =? b  Reach_dist2(a, d) =? a Reach_dist2(a, e) =?

Density-Based Model Why? The average reachability distance of p among all k nearest neighbors is equal to  where  is the distance between p and the k-th nearest neighbor The local reachability density of p (denoted by lrdk(p)) is defined to be 1/ k = 2 e e d d c b c   b a a

Density-Based Model The local outlier factor (LOF) of a point p is equal to

Density-Based Model Idea C2 Local reachability density is high Local reachability density is low C1 b The ratio of these densities is large  outlier a