K-Means clustering accelerated algorithms using the triangle inequality Ottawa-Carleton Institute for Computer Science Alejandra Ornelas Barajas School.

Slides:



Advertisements
Similar presentations
An Introduction to Artificial Intelligence
Advertisements

Collision Detection and Resolution Zhi Yuan Course: Introduction to Game Development 11/28/
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
K-means clustering Hongning Wang
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
Optimization of ICP Using K-D Tree
Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 22, 2010.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Aho-Corasick String Matching An Efficient String Matching.
POSTER TEMPLATE BY: Note: in high dimensions, the data are sphered prior to distance matrix calculation. Three Groups Example;
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004.
A New Algorithm for Solving Many-objective Optimization Problem Md. Shihabul Islam ( ) and Bashiul Alam Sabab ( ) Department of Computer Science.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
PG 2011 Pacific Graphics 2011 The 19th Pacific Conference on Computer Graphics and Applications (Pacific Graphics 2011) will be held on September 21 to.
Active Learning for Class Imbalance Problem
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Fast search methods Pasi Fränti Clustering methods: Part 5 Speech and Image Processing Unit School of Computing University of Eastern Finland
A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Bahman Bahmani Stanford University
Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
An Efficient Greedy Method for Unsupervised Feature Selection
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004.
Presented by Ho Wai Shing
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
A Fast LBG Codebook Training Algorithm for Vector Quantization Presented by 蔡進義.
Machine Learning Queens College Lecture 7: Clustering.
Two High Speed Quantization Algorithms Luc Brun Myriam Mokhtari L.E.R.I. Reims University (I.U.T.)
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
Other Clustering Techniques
Mantid: Performance of Building and Binning MDEvents Janik Zikovsky April 8 th, 2011.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Optimization Indiana University July Geoffrey Fox
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Clustering Data Streams A presentation by George Toderici.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.
Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.
Agglomerative clustering (AC)
SRAM Yield Rate Optimization EE201C Final Project
Distributed Dynamic BDD Reordering
A New Support Vector Finder Method Based on Triangular Calculations
Haim Kaplan and Uri Zwick
Nearest Neighbor Queries using R-trees
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Finding Subgraphs with Maximum Total Density and Limited Overlap
Clustering 77B Recommender Systems
Indiana University July Geoffrey Fox
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Print the following triangle, using nested loops
Clustering.
Student : Yan Wang student ID:
Presentation transcript:

K-Means clustering accelerated algorithms using the triangle inequality Ottawa-Carleton Institute for Computer Science Alejandra Ornelas Barajas School of Computer Science Ottawa, Canada K1S 5B6

Overview  Introduction  The Triangle Inequality in k-means  Maintaining Distance Bounds with the Triangle Inequality  Accelerated Algorithms  Elkan’s Algorithm  Hamerly’s Algorithm  Drake’s Algorithm  Conclusions  References

1. Introduction  The k-means clustering algorithm is a very popular tool for data analysis.  The idea of the k-means optimization problem is:  Separate n data points into k clusters while minimizing the distance (square distance) between each data point and the center of the cluster it belongs to.  Lloyd’s algorithm is the standard approach for this problem.  The total running time of Lloyd’s algorithm is O(wnkd) for w iterations, k centers, and n points in d dimensions.

1. Introduction

 Observation: Lloyd’s algorithm spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naive implementation very inefficient. How can we make it more efficient? Using the triangle inequality

2. The Triangle Inequality in k-means c c’ x

2. The Triangle Inequality in k-means c c’ x

3. Maintaining Distance Bounds with the Triangle Inequality

4. Accelerated algorithms: How do we a accelerate Lloyd’s algorithm? Using upper and lower bounds.

4.1. Elkan’s Algorithm

4.2. Hamerly’s Algorithm

4.3.Drake’s Algorithm

5. Conclusions  Elkan’s algorithm computes fewer distances than Hamerly’s, since it has more bounds to prune the required distance calculations, but it requires more memory to store the bounds.  Hamerly’s algorithm uses less memory, it spends less time checking bounds and updating bounds. Having the single lower bound allows it to avoid entering the innermost loop more often than Elkan’s algorithm. Also, it works better in low dimension because in high dimension all centers tend to move a lot.  For Drake’s algorithm, increasing b incurs more computation overhead to update the bound values and sort each point’s closest centers by their bound, but it is also more likely that one of the bounds will prevent searching over all k centers.

6. References [1] Jonathan Drake and Greg Hamerly. Accelerated k-means with adaptive distance bounds. the 5th NIPS Workshop on Optimization for Machine Learning, OPT2012, pages 2–5, [2] Charles Elkan. Using the Triangle Inequality to Accelerate -Means. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pages 147–153, [3] Greg Hamerly. Making k-means even faster. Computer, pages 130–140, [4] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, [5] StevenJ. Phillips. Acceleration of K-Means and Related Clustering Algorithms, volume 2409 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, [6] Rui Xu and Donald C. Wunsch. Partitional Clustering