Graph preprocessing. Framework for validating data cleaning techniques on binary data.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans- Peter Kriegel, Jörg Sander Presented by Chris Mueller.
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Intelligent Database Systems Lab N.Y.U.S.T. I. M. local-density based spatial clustering algorithm with noise Presenter : Lin, Shu-Han Authors : Lian Duan,
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Cluster Analysis.
SAK 5609 DATA MINING Prof. Madya Dr. Md. Nasir bin Sulaiman
An Introduction to Clustering
A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Cluster Analysis.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.
Modeling and Finding Abnormal Nodes (chapter 2) 駱宏毅 Hung-Yi Lo Social Network Mining Lab Seminar July 18, 2007.
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
A Vertical Outlier Detection Algorithm with Clusters as by-product Dongmei Ren, Imad Rahal, and William Perrizo Computer Science and Operations Research.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Outlier Detection Lian Duan Management Sciences, UIOWA.
Density-Based Clustering Algorithms
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Topic9: Density-based Clustering
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Presented by Ho Wai Shing
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Distance-Based Outlier Detection: Consolidation and Renewed Bearing Gustavo Henrique Orair Federal University of Minas Gerais Wagner Meira Jr. Federal.
Anomaly Detection CS 5323 Lecture 14
Dr. Hongqin FAN Department of Building and Real Estate
More on Clustering in COSC 4335
Machine Learning University of Eastern Finland
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
University of Crete Department Computer Science CS-562
Presentation transcript:

Graph preprocessing

Framework for validating data cleaning techniques on binary data

Motivation & Problem Statement

Data cleaning techniques at the data analysis stage

Distance based Outlier Detection * Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98 ** S. Ramaswamy, R. Rastogi, S. Kyuseok: Efficient Algorithms for Mining Outliers from Large Data Sets, ACM SIGMOD Conf. On Management of Data, 2000.

Nearest Neighbour Based Techniques

*Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98

Distance based approaches

Data cleaning techniques at the data analysis stage

Local Outlier Factor (LOF)* For each data point q compute the distance to the k-th nearest neighbor (k- distance) Compute reachability distance (reach-dist) for each data example q with respect to data example p as: o reach-dist(q, p) = max{k-distance(p), d(q,p)} Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q o lrd(q) = Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest neighbors and local reachability density of the data record q o LOF(q) = * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.

Advantages of Density based Techniques Local Outlier Factor (LOF) approach Example: p 2  p 1  In the NN approach, p 2 is not considered as outlier, while the LOF approach find both p 1 and p 2 as outliers NN approach may consider p 3 as outlier, but LOF approach does not  p 3 Distance from p 3 to nearest neighbor Distance from p 2 to nearest neighbor

Local Outlier Factor (LOF)* For each data point q compute the distance to the k-th nearest neighbor (k- distance) Compute reachability distance (reach-dist) for each data example q with respect to data example p as: reach-dist(q, p) = max{k-distance(p), d(q,p)} Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q lrd(q) = Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest neighbors and local reachability density of the data record q LOF(q) = * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.