Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org John Oliver from “The Daily Show” Supporting worthy causes at the G20 Pittsburgh Summit: “Bayesians.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

Clustering.
Great Theoretical Ideas in Computer Science for Some.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown.
Linked Based Clustering and Its Theoretical Foundations Paper written by Margareta Ackerman and Shai Ben-David Yan T. Yang Presented by Yan T. Yang.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Section 7.4: Closures of Relations Let R be a relation on a set A. We have talked about 6 properties that a relation on a set may or may not possess: reflexive,
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Margareta Ackerman Joint work with Shai Ben-David Measures of Clustering Quality: A Working Set of Axioms for Clustering.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Characterization of Linkage-Based Algorithms Margareta Ackerman Joint work with Shai Ben-David and David Loker University of Waterloo To appear in COLT.
Weighted Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.
Discerning Linkage-Based Algorithms Among Hierarchical Clustering Methods Margareta Ackerman and Shai Ben-David IJCAI 2011.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Support Vector Machines and Kernel Methods
CSE 780 Algorithms Advanced Algorithms Minimum spanning tree Generic algorithm Kruskal’s algorithm Prim’s algorithm.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
4. Ad-hoc I: Hierarchical clustering
1 Minimum Spanning Trees Definition of MST Generic MST algorithm Kruskal's algorithm Prim's algorithm.
Segmentation Graph-Theoretic Clustering.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Message Passing Algorithms for Optimization
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Lecture 8. Why do we need residual networks? Residual networks allow one to reverse flows if necessary. If we have taken a bad path then residual networks.
Introduction Outline The Problem Domain Network Design Spanning Trees Steiner Trees Triangulation Technique Spanners Spanners Application Simple Greedy.
Clustering III. Lecture outline Soft (model-based) clustering and EM algorithm Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering.
MATH 310, FALL 2003 (Combinatorial Problem Solving) Lecture 10, Monday, September 22.
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Clustering Unsupervised learning Generating “classes”
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
An Impossibility Theorem for Clustering By Jon Kleinberg.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Towards Theoretical Foundations of Clustering Margareta Ackerman Caltech Joint work with Shai Ben-David and David Loker.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
Accessible Set Systems Andreas Klappenecker. Matroid Let S be a finite set, and F a nonempty family of subsets of S, that is, F  P(S). We call (S,F)
7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.
CSE 024: Design & Analysis of Algorithms Chapter 9: NP Completeness Sedgewick Chp:40 David Luebke’s Course Notes / University of Virginia, Computer Science.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.
CS774. Markov Random Field : Theory and Application Lecture 02
Data Structures & Algorithms Graphs
10. Lecture WS 2006/07Bioinformatics III1 V10: Network Flows V10 follows closely chapter 12.1 in on „Flows and Cuts in Networks and Chapter 12.2 on “Solving.
Theoretical Foundations of Clustering – CS497 Talk Shai Ben-David February 2007.
Formal Foundations of Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING.
Lecture 14, CS5671 Clustering Algorithms Density based clustering Self organizing feature maps Grid based clustering Markov clustering.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
1 Assignment #3 is posted: Due Thursday Nov. 15 at the beginning of class. Make sure you are also working on your projects. Come see me if you are unsure.
The Theory of NP-Completeness
IOI/ACM ICPC Training 4 June 2005.
Great Theoretical Ideas in Computer Science
Chapter 5. Optimal Matchings
Discrete Mathematics for Computer Science
3.3 Applications of Maximum Flow and Minimum Cut
CS 583 Analysis of Algorithms
Locality In Distributed Graph Algorithms
Minimum Spanning Trees
Presentation transcript:

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org John Oliver from “The Daily Show” Supporting worthy causes at the G20 Pittsburgh Summit: “Bayesians Against Discrimination” “Ban Genetic Algorithms” “Support Vector Machines” Picture: Arthur Gretton Watch out for the protests tonight on The Daily Show!

Reza Bosagh Zadeh (Joint with Shai Ben-David) CMU Machine Learning Lunch, September 2009 TOWARDS A PRINCIPLED THEORY OF CLUSTERING

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org WHAT IS CLUSTERING? Given a collection of objects (characterized by feature vectors, or just a matrix of pair- wise similarities), detects the presence of distinct groups, and assign objects to groups.

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org “Clustering” is an ill-defined problem  There are many different clustering tasks, leading to different clustering paradigms: THERE ARE MANY CLUSTERING TASKS

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org “Clustering” is an ill-defined problem  There are many different clustering tasks, leading to different clustering paradigms: THERE ARE MANY CLUSTERING TASKS

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org TALK OUTLINE Questions being addressed Introduce Axioms & Properties Characterization for Single-Linkage and Max-Sum Taxonomy of Partitioning Functions

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org SOME BASIC UNANSWERED QUESTIONS  Are there principles governing all clustering paradigms?  Which clustering paradigm should I use for a given task?

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org WE WOULD LIKE TO DISCUSS THE BROAD NOTION OF CLUSTERING Independently of any particular algorithm, particular objective function, or particular generative data model

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org WHAT FOR?  Choosing a suitable algorithm for a given task.  Axioms: to capture intuition about clustering in general. Expected to be satisfied by all clustering paradigms  Properties: to capture differences between different clustering paradigms

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org TIMELINE – AXIOMATIC APPROACH Jardine, Sibson 1971 Considered only hierarchical functions Kleinberg 2003 Presented an impossibility result Ackerman, Ben-David 2008 Clustering Quality measures formalization These are only axiomatic approaches, there are other ways of building a principled theory for Clustering, e.g. Balcan, Blum, Vempala STOC 2008

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org THE BASIC SETTING A s(x,y)  For a finite domain set A, a similarity function s(x,y) is a symmetric mapping to a similarity score s(x,y) s(x,y) > 0, and s(x,y) ∞x=y s(x,y) → ∞ iff x=y A  A partitioning function takes a similarity function and returns a partition of A.  We wish to define axioms that distinguish clustering functions, from other partitioning functions.

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org KLEINBERG’S AXIOMS (NIPS 2001)  Scale Invariance F(λs)=F(s)s λ F(λs)=F(s) for all s and all strictly positive λ.  Richness F(s)s The range of F(s) over all s is the set of all possible partitionings  Consistency s’s F(s) If s’ equals s except for increasing similarities within clusters of F(s) or decreasing between-cluster similarities, F(s) = F(s’). then F(s) = F(s’).

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org KLEINBERG’S AXIOMS (NIPS 2001)  Scale Invariance F(λs)=F(s)s λ F(λs)=F(s) for all s and all strictly positive λ.  Richness F(s)s The range of F(s) over all s is the set of all possible partitionings  Consistency s’s F(s) If s’ equals s except for increasing similarities within clusters of F(s) or decreasing between-cluster similarities, F(s) = F(s’). then F(s) = F(s’). Inconsistent! No algorithm can satisfy all 3 of these.

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org KLEINBERG’S AXIOMS (NIPS 2001)  Scale Invariance F(λs)=F(s)s λ F(λs)=F(s) for all s and all strictly positive λ.  Richness F(s)s The range of F(s) over all s is the set of all possible partitionings  Consistency s’s F(s) If s’ equals s except for increasing similarities within clusters of F(s) or decreasing between-cluster similarities, F(s) = F(s’). then F(s) = F(s’). Proof:

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org CONSISTENT AXIOMS (UAI 2009)  Scale Invariance F(λs, k)=F(s, k)d λ F(λs, k)=F(s, k) for all d and all strictly positive λ.  k-Richness F(s, k)s The range of F(s, k) over all s is the set of all possible k- k- partitionings  Consistency s’s F(s, k) If s’ equals s except for increasing similarities within clusters of F(s, k) or decreasing between-cluster similarities, F(s, k)=F(s’, k). then F(s, k)=F(s’, k). Consistent! (And satisfied by Single-Linkage, Max-Sum, …) k Fix k

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org Definition. Call any partitioning function which satisfies a Clustering Function  Scale Invariance  k-Richness  Consistency CLUSTERING FUNCTIONS

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org TWO CLUSTERING FUNCTIONS Single-Linkage 1.Start with with all points in their own cluster 2.While there are more than k clusters Merge the two most similar clusters Similarity between two clusters is the similarity of the most similar two points from differing clusters Max-Sum k-Clustering Find the k-partitioning Γ which maximizes (Is NP-Hard to optimize)  Scale Invariance  k-Richness  Consistency Both Functions satisfy: Hierarchical Not Hierarchical Proofs in paper.

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org CLUSTERING FUNCTIONS Single-Linkage and Max-Sum are both Clustering functions. How to distinguish between them in an Axiomatic framework? Use Properties Not all properties are desired in every clustering situation: pick and choose properties for your task

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES - ORDER- CONSISTENCY  Order-Consistency ss’ k If two datasets s and s’ have the same ordering of similarity scores, then for all k, F(s, k)=F(s’, k) o In other words the clustering function only cares about whether a pair of points are more/less similar than another pair of points. o i.e. Only relative similarity matters. o Satisfied by Single-Linkage, Max-Linkage, Average-Linkage… o NOT satisfied by most objective functions (Max-Sum, k-means, …)

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org e.g. P s (, ) = 3 Since the path through the bottom has bottleneck of 3 PATH-SIMILARITY In other words, we find the path from x to y, which has the largest bottleneck Undrawn edges are small

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PATH-SIMILARITY Imagine each point is an island out in the ocean, with bridges that have some weight restriction, and we would like to go from island to island Having some mass, we are restricted in which bridges we can take from island to island. Path-Similarity would have us find the path with the largest bottleneck, ensuring that we could complete all the crossings successfully, or fail if there is no path with a large enough bottleneck P s (, ) =

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES – PATH-SIMILARITY COHERENCE  Path-Similarity Coherence ss’ If two datasets s and s’ have the same induced-path-similarity edge ordering kF(s, k)=F(s’, k) then for all k, F(s, k)=F(s’, k)

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org UNIQUENESS THEOREM: SINGLE- LINKAGE Theorem (Bosagh Zadeh 2009) Single-Linkage is the only clustering function satisfying Path-Similarity-Coherence

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org UNIQUENESS THEOREM: SINGLE- LINKAGE Theorem (Bosagh Zadeh 2009) Single-Linkage is the only clustering function satisfying Path-Similarity-Coherence Is Path-Similarity-Coherence doing all the work? No. Consistency is necessary for uniqueness k-Richness is necessary for uniqueness “X is Necessary”: All other axioms/properties satisfied, just X missing, still not enough to get uniqueness

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org Time to characterize another clustering function Use a different property in lieu of path-similarity Turns out generalizing Path-Similarity does the trick. UNIQUENESS THEOREM: MAX-SUM

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org GENERALIZED PATH SIMILARITY Claims:

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org Theorem Max-Sum is the clustering function satisfying -Coherence UNIQUENESS THEOREMS Theorem Single-Linkage is the clustering function satisfying -Coherence For two-class Clustering (k=2) only

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org TWO CLUSTERING FUNCTIONS Single-Linkage 1.Start with with all points in their own cluster 2.While there are more than k clusters Merge the two most similar clusters Similarity between two clusters is the similarity of the most similar two points from differing clusters Max-Sum k-Clustering Find the k-partitioning Γ which maximizes Can use Uniqueness Theorems as alternate definitions to replace these definitions that on the surface seem unrelated.

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PRACTICAL CONSIDERATIONS Single-Linkage, or Max-Sum are not always the right functions to use Because Generalized Path-Similarity is not always desirable. It’s not always immediately obvious when we want a function to focus on the Generalized Path Similarity Introduce a different formulation involving Tree Constructions

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org ASIDE: MINIMUM CUT TREES Min-Cut Tree for GGraph G on 6 nodes Min-Cut tree can be computed in at most n-1 Min-Cut Max-Flow computations! Weight of Min-Cut between nodes x and y is weight of smallest edge on the unique x-y path Cutting that edge will give the two sides of the cut in the original graph Nodes in Min-Cut tree correspond to nodes in G, but edges do not. Picture: Encyclopedia of Algorithms

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org ASIDE: MAXIMUM SPANNING TREES Spanning Tree: Tree Sub-graph of original graph which touches all nodes. Weight of tree is equal to sum of all edge weights. Spanning Trees ordered by weight, we are interested in the Maximum Spanning Tree Picture: Wikipedia Bold: Minimum Spanning Tree of the graph

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES - MST-COHERENCE  MST-Coherence ss’ If two datasets s and s’ have the same Maximum-Spanning-Tree edge ordering kF(s, k)=F(s’, k) then for all k, F(s, k)=F(s’, k)  MCT-Coherence ss’ If two datasets s and s’ have the same Minimum-Cut-Tree edge ordering kF(s, k)=F(s’, k) then for all k, F(s, k)=F(s’, k)

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES - MST-COHERENCE  MST-Coherence ss’ If two datasets s and s’ have the same Maximum-Spanning-Tree edge ordering kF(s, k)=F(s’, k) then for all k, F(s, k)=F(s’, k)  MCT-Coherence ss’ If two datasets s and s’ have the same Minimum-Cut-Tree edge ordering kF(s, k)=F(s’, k) then for all k, F(s, k)=F(s’, k) The uniqueness theorems apply in the same way to the tree constructions Characterizes Single-Linkage Characterizes Max-Sum

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org A TAXONOMY OF CLUSTERING FUNCTIONS Min-Sum satisfies neither MST-Coherence nor Order-Consistency Future work: Characterize other clustering functions

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org TAKEAWAY LESSONS Impossibility result wasn’t too bad Can go a long way by fixing k Uniqueness theorems can help you decide when to use a function An axiomatic approach can bring out underlying motivating principles, Which in the case of Max-Sum and Single- Linkage are very similar principles

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org CLUSTERING THEORY WORKSHOP Axiomatic Approach is only one approach There are other approaches. Come hear about them at our workshop ClusteringTheory.org NIPS 2009 Workshop Deadline: 30 th October 2009

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org THANKS FOR YOUR ATTENTION!

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org ASIDE: MINIMUM SPANNING TREES Spanning Tree: Tree Sub-graph of original graph which touches all nodes. Weight of tree is equal to sum of all edge weights. Spanning Trees ordered by weight, we are interested in the Minimum Spanning Tree Picture: Wikipedia Bold: Minimum Spanning Tree of the graph

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROOF OUTLINE: CHARACTERIZATION OF SINGLE- LINKAGE 1. Start with arbitrary d, k 2. By k-Richness, there exists a d 1 such that F(d 1, k) = SL(d, k) 3. Through a series of Consistent transformations, can transform d 1 into d 6, which will have the same MST as d 4. Invoke MST-Coherence to get F(d 1, k) = F(d 6, k) = F(d, k) = SL(d, k)

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org KLEINBERG’S IMPOSSIBILITY RESULT There exist no clustering function all 3 properties Proof: Scaling up Consistency

Machine Learning Lunch - 29 Sep 2009 AXIOMS AS A TOOL FOR A TAXONOMY OF CLUSTERING PARADIGMS The goal is to generate a variety of axioms (or properties) over a fixed framework, so that different clustering approaches could be classified by the different subsets of axioms they satisfy. Scale Invariance k-RichnessConsistencySeparabilityOrder Invariance Hier- archy Single Linkage Center Based Spectral MDL ++- Rate Distortion ++- “Axioms” “Properties”

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES Order-Consistency Function only compares distances together, not using absolute value Minimum Spanning Tree Coherence If two datasets d and d’ have the same Minimum Spanning Tree, then for all k, F(d, k) = F(d’, k) Function makes all its decisions using the Minimum Spanning Tree

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org SOME MORE EXAMPLES

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org AXIOMS - SCALE INVARIANCE F If F Then  Scale Invariance F(λd)=F(d)d λ F(λd)=F(d) for all d and all strictly positive λ. 3 6 e.g. double the distances

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org AXIOMS - RICHNESS F F F … Etc. can get all partitionings of the points  Richness F(d)d The range of F(d) over all d is the set of all possible partitionings

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org AXIOMS - CONSISTENCY  Consistency d’d F(d) If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, F(d)=F(d’). then F(d)=F(d’). F If Then F

Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org PROPERTIES - ORDER- CONSISTENCY F If Then  Order-Consistency dd’ k If two datasets d and d’ have the same ordering of the distances, then for all k, F(d, k)=F(d’, k) F 35 3 Maintain edge ordering, 2