Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 SOFSEM 2007 Weighted Nearest Neighbor Algorithms for the Graph Exploration Problem on Cycles Eiji Miyano Kyushu Institute of Technology, Japan Joint.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Introduction to Algorithms
Robust hierarchical k- center clustering Ilya Razenshteyn (MIT) Silvio Lattanzi (Google), Stefano Leonardi (Sapienza University of Rome) and Vahab Mirrokni.
Unsupervised learning
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Introduction to Bioinformatics Algorithms Clustering.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Krakow, Jan. 9, Outline: 1. Online bidding 2. Cow-path 3. Incremental medians (size approximation) 4. Incremental medians (cost approximation) 5.
Optimization problems INSTANCE FEASIBLE SOLUTIONS COST.
Introduction to Bioinformatics - Tutorial no. 12
Clustering and greedy algorithms Prof. Noah Snavely CS1114
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Hierarchical Clustering
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Oblivious Routing in Wireless networks Costas Busch Rensselaer Polytechnic Institute Joint work with: Malik Magdon-Ismail and Jing Xi.
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
tch?v=Y6ljFaKRTrI Fireflies.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Image segmentation Prof. Noah Snavely CS1114
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Collection Depots Facility Location Problems in Trees R. Benkoczi, B. Bhattacharya, A. Tamir 陳冠伶‧王湘叡‧李佳霖‧張經略 Jun 12, 2007.
Clustering.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
“Fault Tolerant Clustering Revisited” - - CCCG 2013 Nirman Kumar, Benjamin Raichel خوشه بندی مقاوم در برابر خرابی سپیده آقاملائی.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering Data Streams A presentation by George Toderici.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering Data Streams
Unsupervised Learning
Haim Kaplan and Uri Zwick
Clustering.
Clustering.
Clustering.
Clustering.
Presentation transcript:

Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Hierarchical clustering Recursive partitioning of a data set        clustering  clustering  clustering  clustering  clustering

Popular form of data analysis No need to specify number of clusters Can view data at many levels of granularity, all at the same time Simple heuristics for constructing hierarchical clusterings

Applications Has long been used by biologists and social scientists A standard part of the statistician’s toolbox since the 60s or 70s Recently: common tool for analyzing gene expression data

Performance guarantees There are many simple greedy schemes for constructing hierarchical clusterings. But are these resulting clusterings any good? Or are they pretty arbitrary?

One basic problem In fact, the whole enterprise of hierarchical clustering could use some more justification. e.g.

An existence question Must there always exist a hierarchical clustering which is close to optimal at every level of granularity, simultaneously? [I.e., such that for all k, the induced k- clustering is close to the best k-clustering?]

What is the best k-clustering? The k-clustering problem. Input: data points in a metric space; k Output: a partition of the points into k clusters C 1,…, C k with centers    k Goal: minimize cost of the clustering

Cost functions for clustering Two cost functions which are commonly used: Maximum radius (k-center) max {d(x,  i ): i = 1…k, x in C i } Average radius (k-median) avg {d(x,  i ): i = 1…k, x in C i } Both yield NP-hard optimization problems, but have constant-factor approximation algorithms.

Maximum-radius cost function

Our main result Adopt the maximum-radius cost function. Our algorithm returns a hierarchical clustering such that for every k, the induced k-clustering is guaranteed to be within a factor eight of optimal.

Standard heuristics The standard heuristics for hierarchical clustering are greedy and work bottom-up: single-linkage, average-linkage, complete-linkage Their k-clusterings can be off by a factor of: -- at least log 2 k (average-, complete-linkage); -- at least k (single-linkage). Our algorithm is similar in efficiency and simplicity, but works top-down.

A heuristic for k-clustering [Hochbaum and Shmoys, 1985] Eg. k = 4.     R This 4-clustering has cost R  2 OPT 4

Algorithm: step one Number all points by farthest-first traversal.           R2R2 R3R3 R5R5 R4R4 R6R6 For all k, the k-clustering defined by centers {1,2,…,k} has radius R k+1  2 OPT k. (Note: R 2  R 3  …  R n.)

A possible hierarchical clustering                 R2R2 R5R5 R4R4 R3R3 R7R7 R8R8 R6R6 R9R9 R 10 Hierarchical clustering specified by parent function:  (j) = closest point to j in {1,2,…,j-1}. Note: R k = d(k,  (k))

Algorithm: step two Divide points into levels of granularity. Set R = R 2 ; and fix some  > 1. The j th level has points {i: R/  j  R i > R/  j+1 }.          10

Algorithm: step two, cont’d          10      Different parent function:  *(j) = closest point to j at lower level of granularity

Algorithm: summary 1.Number the points by farthest-first traversal; note the values R i = d(i, {1,2,…, i-1}). 2.Choose R =  R 2. 3.L(0) = {1}; for j > 0, L(j) = {i: R/  j-1  R i > R/  j }. 4.If point i is in L(j),  *(i) = closest point to i in L(0), …., L(j-1). Theorem: Fix  =1,  =2. If the data points lie in a metric space, then for all k simultaneously, the induced k- clustering is within a factor eight of optimal.

Randomization trick Pick  from the distribution  U[0,1]. Set  = e. Then for all k, the induced k-clustering has expected cost at most 2e  5.44 times optimal. Thanks to Rajeev Motwani for suggesting this.

What does a constant-factor approximation mean? Prevent the worst.

Standard agglomerative heuristics 1.Initially each point is its own cluster. 2.Repeatedly merge the two “closest” clusters. Need to define distance between clusters… Single-linkage: distance between closest pair of points Average-linkage: distance between centroids Complete-linkage: distance between farthest pair

Single-linkage clustering Chaining effect.  … jj+1n … 1 - j  The k-clustering will have diameter about n-k, instead of n/k. Therefore: off by a factor of k.

Average-linkage clustering Points in d-dimensional space, d = log 2 k, under an l 1 metric. Final radius should be 1, instead is d. Therefore: off by a factor of log 2 k.

Complete-linkage clustering Can similarly construct a bad case… Off by a factor of at least log 2 k.

Summary There is a basic existence question about hierarchical clustering which needs to be addressed: must there always exist a hierarchical clustering in which, for each k, the induced k-clustering is close to optimal? It turns out the answer is yes.

Summary, cont’d In fact, there is a simple, fast algorithm to construct such hierarchical clusterings. Meanwhile, the standard agglomerative heuristics do not always produce close-to- optimal clusterings.

Where next? 1.Reduce the approximation factor. 2.Other cost functions for clustering. 3.For average- and complete-linkage, is the log k lower bound also an upper bound? 4.Local improvement procedures for hierarchical clustering?