On Simultaneous Clustering and Cleaning over Dirty Data

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Incremental Clustering for Trajectories
AMCS/CS229: Machine Learning
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Classification and Decision Boundaries
Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma Microsoft Research Asia
Cluster Analysis.
Aki Hecht Seminar in Databases (236826) January 2009
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Cluster Analysis.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Power saving technique for multi-hop ad hoc wireless networks.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Radial Basis Function Networks
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machine & Image Classification Applications
Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
Density-Based Clustering Algorithms
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Topic9: Density-based Clustering
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
黃福銘 (Angus). Angus Fuming Huang Academia Sinica, Institute of Information Science, ANTS Lab Jae-Gil Lee Jiawei Han UIUC Kyu-Young Whang KAIST ACM SIGMOD’07.
Presented by Ho Wai Shing
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Extracting stay regions with uncertain boundaries from GPS trajectories a case study in animal ecology Haidong Wang.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.
Net-Centric Software and Systems I/UCRC Self-Detection of Abnormal Event Sequences Project Lead: Farokh Bastani, I-Ling Yen, Latifur Khan Date: April 1,
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
SCREEN: Stream Data Cleaning under Speed Constraints Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu SIGMOD 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
What Is Cluster Analysis?
Data Driven Resource Allocation for Distributed Learning
Data Mining Soongsil University
Support Vector Machines
Overview Of Clustering Techniques
Sequential Data Cleaning: A Statistical Approach
COSC 4335: Other Classification Techniques
CSE572, CBS572: Data Mining by H. Liu
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Presentation transcript:

On Simultaneous Clustering and Cleaning over Dirty Data Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data Shaoxu Song, Chunping Li, Xiaoquan Zhang Tsinghua University

Motivation Dirty data commonly exist Often a (very) large portion E.g., GPS readings Density-based clustering Such as DBSCAN Successfully identify noises Grouping non-noise points in clusters Discarding noise points KDD 2015

Mining Cleaning Useless Guide Make Valuable Find KDD 2015

Mining + Repairing Repair Knowledge Constraints Rules (Dirty) Data Density (Dirty) Data Repaired Discover KDD 2015

Discarding vs. Repairing Simply discarding a large number of dirty points (as noises) could greatly affect clustering results Propose to repair and utilize noises to support clustering Basic idea: simultaneously repairing noise points w.r.t. the density of data during the clustering process KDD 2015

Density-based Cleaning Both the clustering and repairing tasks benefit Clustering: with more supports from repaired noise points Repairing: under the guide of density information Already embedded in the data Rather than manually specified knowledge KDD 2015

Basics DBSCAN: density-based identification of noise points Distance threshold 𝞮 Density threshold 𝞰 𝞮-neighbor: if two points have distance less than 𝞮 Noise point With the number of 𝞮-neighbors less than 𝞰 Not in 𝞮-neighbor of some other points that have 𝞮-neighbors no less than 𝞰 (core points) KDD 2015

Modification Repair [SIGMOD’05] [ICDT’09] A repair over a set of points is a mapping λ : P → P We denote λ(pi) the location of point pi after repairing The ε-neighbors of λ(pi) after repairing is Cλ(pi) = { pj ∈ P | δ( λ(pi) , λ(pj) ) ≤ ε } KDD 2015

Repair Cost Following the minimum change principle in data cleaning Intuition: systems or humans always try to minimize mistakes in practice prefer a repair close to the input The repair cost ∆(λ) is defined as ∆(λ) = ∑i w( pi , λ(pi) ) w( pi , λ(pi) ) is the cost of repairing a point pi to the new location λ(pi) E.g., by counting modified data points KDD 2015

All the points are utilized, no noise remains Problem Statement Given a set of data points P, a distance threshold ε and a density threshold η Density-based Optimal Repairing and Clustering (DORC) problem is to find a repair λ (a mapping λ : P → P ) such that (1) the repairing cost ∆(λ) is minimized, and (2) each repaired λ(pi) is either a core point or a board point for each repaired λ(pi), either |Cλ(pi)| ≥ η (core points), or |Cλ(pj)| ≥ η for some pj with δ(λ(pi),λ(pj)) ≤ ε All the points are utilized, no noise remains KDD 2015

Technique Concern Simply repairing only the noise points to the closest clusters is not sufficient e.g., repairing all the noise points to C1 does not help in identifying the second cluster C2 Indeed, it should be considered that dirty points may possibly form clusters with repairing (i.e., C2) KDD 2015

Problem Solving No additional parameters are introduced for DORC besides the density and distance requirements η and ε for clustering ILP formulation Efficient solvers can be applied Quadratic time approximation via LP relaxation Trade-off between Effectiveness and Efficiency By grouping locally data points into several partitions KDD 2015

Experimental Results Answers the following questions By utilizing dirty data, can it form more accurate clusters? By simultaneous repairing and clustering, in practice is the repairing accuracy improved compared with the existing data repairing approaches? How do the approaches scale? Criteria Clustering Accuracy: purity and NMI Repairing Accuracy: root-mean-square error (RMS) between truth and repair results dirty truth RMS repair KDD 2015

Artificial Data Set Compared to existing methods without repairing DBSCAN and OPTICS Proposed DORC (ILP/Quadratic-time-approximation) shows Higher clustering purity KDD 2015

Real GPS Data With errors naturally embedded, and manually labelled Compared to Median Filter (MF) A filtering technique for cleaning the noisy data in time-space correlated time-series DORC is better than MF+DBSCAN KDD 2015

Restaurant Data Tabular data, with artificially injected noises Widely considered in conventional data cleaning Compared to FD A repairing approach under integrity constraints (Functional Dependencies), [name,address → city] KDD 2015

More results Two labeled publicly available benchmark data, Iris and Ecoli, from UCI Normalized mutual information (NMI) clustering accuracy Similar results are observed DORC shows higher accuracy than DBSCAN and OPTICS KDD 2015

Summary Preliminary density-based clustering can successfully identify noisy data but without cleaning them Existing constraint-based repairing relies on external constraint knowledge without utilizing density information embedded inside the data With the happy marriage of clustering and repairing advantages both the clustering and repairing accuracies are significantly improved KDD 2015

References (data repairing) [SIGMOD’05] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143–154, 2005. [TODS’05] J. Wijsen. Database repairing using updates. ACM Trans. Database Syst., TODS, 30(3):722–768, 2005. [PODS’08] W. Fan. Dependencies revisited for improving data quality. In PODS, pages 159–170, 2008. [ICDT’09] S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53–62, 2009. KDD 2015

Thanks