Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew McCallum, Kamal Nigam and Lyle H. Ungar Efficient clustering of high- dimensional data sets with application to reference matching ACM 2000
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Efficient Clustering with Canopies Experimental Results Conclusions Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Traditional clustering algorithms become computationally expensive when the data set to be clustered is large ─ large number of elements in the data set ─ many features ─ many clusters to discover
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Introduces a technique for clustering that is efficient when the problem is large in all of these three ways at once Using canopies for clustering can increase computational efficiency without losing any clustering accuracy
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Divide the clustering process into two stages efficiently divide the data into overlapping subsets we call canopies (increase computational efficiency) completes the clustering by running a standard clustering algorithm (reduce numbers of cluster)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Efficient Clustering with Canopies The key idea of the canopy algorithm ─ greatly reduce the number of distance computations required for clustering ─ by first cheaply partitioning the data into overlapping subsets ─ then only measuring distances among pairs of data points that belong to a common subset Uses two different sources of information ─ cheap and approximate similarity measure ─ expensive and accurate similarity measure
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Divide the clustering process into two stages First stage ─ use the cheap distance measure in order to create some number of overlapping subsets, called “canopies" Second stage ─ execute some traditional clustering algorithm ─ using the accurate distance measure ─ but with the restriction that we do not calculate the distance between two points that never appear in the same canopy
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Create canopies Start with a list of the data points in any order ─ two distance thresholds, T1 and T2, where T1 > T2 ─ Pick a point off the list and approximately measure its distance to all other points. (This is extremely cheap with an inverted index.) ─ Put all points that are within distance threshold T1 into a canopy ─ Remove from the list all points that are within distance threshold T2 Repeat until the list is empty Figure 1 shows some canopies that were created by this procedure
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Example
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Canopies with Greedy Agglomerative Clustering GAC is used to group items together based on similarity Standard GAC implementation, we need to apply the distance function O(n2) times to calculate all pair-wise distances between items A canopies-based implementation of GAC can drastically reduce this required number of comparisons
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Experimental Results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The error and time costs of different methods of clustering references
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The accuracy of the clustering
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Canopies provide a principled approach The canopy approach is widely applicable Have demonstrated the success of the canopies approach on a reference matching problem
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion High-dimensional data sets problem will become more and more important