Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004.

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
K-means clustering Hongning Wang
Robert Pless, CS 546: Computational Geometry Lecture #3 Last Time: Convex Hulls Today: Plane Sweep Algorithms, Segment Intersection, + (Element Uniqueness,
Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering Color/Intensity
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Dimensionality Reduction
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering and greedy algorithms Prof. Noah Snavely CS1114
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Image segmentation Prof. Noah Snavely CS1114
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
An Efficient Greedy Method for Unsupervised Feature Selection
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004.
K-Means clustering accelerated algorithms using the triangle inequality Ottawa-Carleton Institute for Computer Science Alejandra Ornelas Barajas School.
Machine Learning Queens College Lecture 7: Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive FIR Neural Model for Centroid Learning in Self-Organizing.
COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CSC317 1 So far so good, but can we do better? Yes, cheaper by halves... orkbook/cheaperbyhalf.html.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Clustering Data Streams A presentation by George Toderici.
Genetic Algorithms for clustering problem Pasi Fränti
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Semi-Supervised Clustering
Clustering Motion or Crunching Clusters, Hiding Logs
Haim Kaplan and Uri Zwick
Spectral Clustering.
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering.
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004

Clustering is difficult! Source: Patrick de Smet, University of Ghent

The standard k-means algorithm Input: n points, distance function d(), number k of clusters to find. STEP NAME 1. Start with k centers 2. Compute d(each point x, each center c) 3. For each x, find closest center c(x) “ALLOCATE” 4. If no point has changed “owner” c(x), stop 5. Each c  mean of points owned by it “LOCATE” 6. Repeat from 2

A typical k-means result

Observations Theorem: If d() is Euclidean, then k-means converges monotonically to a local minimum of within-class squared distortion:  x d(c(x),x) 2 1. Many variants, complex history since 1956, over 100 papers per year currently 2. Iterative, related to expectation-maximization (EM) 3. # of iterations to converge grows slowly with n, k, d 4. No accepted method exists to discover k.

We want to … 1. … make the algorithm faster. 2. … find lower-cost local minima. (Finding the global optimum is NP-hard.) 3. … choose the correct k intelligently. With success at (1), we can try more alternatives for (2). With success at (2), comparisons for different k are less likely to be misleading.

Standard initialization methods Forgy initialization: choose k points at random as starting center locations. Random partitions: divide the data points randomly into k subsets. Both these methods are bad. E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768, 1965.

Forgy initialization

k-means result

Smarter initialization The “furthest first" algorithm (FF): 1. Pick first center randomly. 2. Next is the point furthest from the first center. 3. Third is the point furthest from both previous centers. 4. In general: next center is argmax x min c d(x,c) D. Hochbaum, D. Shmoys. A best possible heuristic for the k-center problem, Mathematics of Operations Research, 10(2): , 1985.

Furthest-first initialization FF Furthest-first initialization

Subset furthest-first (SFF) FF finds outliers, by definition not good cluster centers! Can we choose points far apart and typical of the dataset? Idea: A random sample includes many representative points, but few outliers. But: How big should the random sample be? Lemma: Given k equal-size sets and c >1, with high probability ck log k random points intersect each set.

Subset furthest-first c = 2

Comparing initialization methods methodmeanstd. dev.bestworst Forgy Furthest-first Subset furthest-first means 218% worse than the best clustering known. Lower is better.

Goal: Make k-means faster, but with same answer Allow any black-box d(), any initialization method. 1. In later iterations, little movement of centers. 2. Distance calculations use the most time. 3. Geometrically, these are mostly redundant. Source: D. Pelleg.

Lemma 1: Let x be a point, and let b and c be centers. If d(b,c)  2d(x,b) then d(x,c)  d(x,b). Proof: We know d(b,c)  d(b,x) + d(x,c). So d(b,c) - d(x,b)  d(x,c). Nowd(b,c) - d(x,b)  2d(x,b) - d(x,b) = d(x,b). So d(x,b)  d(x,c). b x c

Beyond the triangle inequality Geometrically, the triangle inequality is a coarse screen Consider the 2-d case with all points on a line: data point at (-4,0) center1 at x=(0,0) center2 at x=(4,0) Triangle inequality ineffective, even though our safety margin is huge

Remembering the margins Idea: when the triangle inequality fails, cache the margin and refer to it in subsequent iterations Benefit: a further 1.5X to 30X reduction over Elkan’s already excellent result Conclusion: if distance calculations weren’t much of a problem before, they really aren’t a problem now So what is the new botteneck?

Memory bandwidth is the current bottleneck Main loop reduces to: 1) fetch previous margin 2) update per most recent centroid movement 3) compare against current best and swap if needed Most compares are favorable (no change results) Memory bandwidth requirement is one read and one write per cell in an NxK margin array Reducible to one read if we store margins as deltas

Going parallel – data partitioning Use a PRNG distribution of the points Avoids hotspots “statically” Besides N more compare loops running, we may also get super-scalar benefit if our problem moves from main memory to L2 or from L2 to L1

Misdirections – trying to exploit the sparsity Full sorting Waterfall priority queues