Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING
TALK OUTLINE Questions being addressed Introduce Axioms Introduce Properties Uniqueness Theorem for Single-Linkage Taxonomy of Partitioning Functions Bosagh Zadeh, Ben-David, UAI 2009
WHAT IS CLUSTERING? Given a collection of objects (characterized by feature vectors, or just a matrix of pair- wise similarities), detects the presence of distinct groups, and assign objects to groups. Bosagh Zadeh, Ben-David, UAI 2009
SOME BASIC UNANSWERED QUESTIONS Are there principles governing all clustering paradigms? Which clustering paradigm should I use for a given task? Bosagh Zadeh, Ben-David, UAI 2009
“Clustering” is an ill-defined problem There are many different clustering tasks, leading to different clustering paradigms: THERE ARE MANY CLUSTERING TASKS Bosagh Zadeh, Ben-David, UAI 2009
“Clustering” is an ill-defined problem There are many different clustering tasks, leading to different clustering paradigms: THERE ARE MANY CLUSTERING TASKS Bosagh Zadeh, Ben-David, UAI 2009
WE WOULD LIKE TO DISCUSS THE BROAD NOTION OF CLUSTERING Independently of any particular algorithm, particular objective function, or particular generative data model Bosagh Zadeh, Ben-David, UAI 2009
WHAT FOR? Choosing a suitable algorithm for a given task. Axioms: to capture intuition about clustering in general. Expected to be satisfied by all clustering functions Properties: to capture differences between different clustering paradigms Bosagh Zadeh, Ben-David, UAI 2009
TIMELINE Jardine, Sibson 1971 Considered only hierarchical functions Kleinberg 2003 Presented an impossibility result This paper Presents a uniqueness result for Single-Linkage Bosagh Zadeh, Ben-David, UAI 2009
THE BASIC SETTING S For a finite domain set S, a distance function is a symmetric mapping d:SxSR + d:SxS → R + such that d(x,y)=0x=y d(x,y)=0 iff x=y. SS A partitioning function takes a dissimilarity function on S and returns a partition of S. We wish to define the axioms that distinguish clustering functions, from any other functions that output domain partitions. Bosagh Zadeh, Ben-David, UAI 2009
KLEINBERG’S AXIOMS Scale Invariance F(λd)=F(d)d λ F(λd)=F(d) for all d and all strictly positive λ. Richness F(d)d The range of F(d) over all d is the set of all possible partitionings Consistency d’d F(d) If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, F(d) = F(d’). then F(d) = F(d’). Bosagh Zadeh, Ben-David, UAI 2009
KLEINBERG’S AXIOMS Scale Invariance F(λd)=F(d)d λ F(λd)=F(d) for all d and all strictly positive λ. Richness F(d)d The range of F(d) over all d is the set of all possible partitionings Consistency d’d F(d) If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, F(d) = F(d’). then F(d) = F(d’). Inconsistent! No algorithm can satisfy all 3 of these. Bosagh Zadeh, Ben-David, UAI 2009
CONSISTENT AXIOMS: Scale Invariance F(λd, k)=F(d, k)d λ F(λd, k)=F(d, k) for all d and all strictly positive λ. k-Richness F(d, k)d The range of F(d, k) over all d is the set of all possible k- k- partitionings Consistency d’d F(d, k) If d’ equals d except for shrinking distances within clusters of F(d, k) or stretching between-cluster distances, F(d, k)=F(d’, k). then F(d, k)=F(d’, k). Consistent! (And satisfied by Single-Linkage, Min-Sum, …) k Fix k Bosagh Zadeh, Ben-David, UAI 2009
Definition. Call any partitioning function which satisfies a Clustering Function Scale Invariance k-Richness Consistency CLUSTERING FUNCTIONS Bosagh Zadeh, Ben-David, UAI 2009
TWO CLUSTERING FUNCTIONS Single-Linkage 1.Start with with all points in their own cluster 2.While there are more than k clusters Merge the two most similar clusters Similarity between two clusters is the similarity of the most similar two points from differing clusters Min-Sum k-Clustering Find the k-partitioning Γ which minimizes (Is NP-Hard to optimize) Scale Invariance k-Richness Consistency Both Functions satisfy: Hierarchical Not Hierarchical Proofs in paper. Bosagh Zadeh, Ben-David, UAI 2009
CLUSTERING FUNCTIONS Single-Linkage and Min-Sum are both Clustering functions. How to distinguish between them in an Axiomatic framework? Use Properties Not all properties are desired in every clustering situation: pick and choose properties for your task Bosagh Zadeh, Ben-David, UAI 2009
PROPERTIES - ORDER- CONSISTENCY Order-Consistency dd’ k If two datasets d and d’ have the same ordering of the distances, then for all k, F(d, k)=F(d’, k) Bosagh Zadeh, Ben-David, UAI 2009 o In other words the clustering function only cares about whether a pair of points are closer/further than another pair of points. o Satisfied by Single-Linkage, Max-Linkage, Average-Linkage… o NOT satisfied by most objective functions (Min-Sum, k-means, …)
PATH-DISTANCE In other words, we find the path from x to y, which has the smallest longest jump in it e.g. P d (, ) = 2 Since the path from above has a jump of distance 2 Undrawn edges are large Bosagh Zadeh, Ben-David, UAI 2009
PATH-DISTANCE Imagine each point is an island, and we would like to go from island a to island b. As if we’re trying to cross a river by jumping on rocks. Being human, we are restricted in how far we can jump from island to island. Path-Distance would have us find the path with the smallest longest jump, ensuring that we could complete all the jumps successfully. Bosagh Zadeh, Ben-David, UAI 2009
PROPERTIES – PATH-DISTANCE COHERENCE Path-Distance Coherence dd’ k If two datasets d and d’ have the same induced path distance then for all k, F(d, k)=F(d’, k) Bosagh Zadeh, Ben-David, UAI 2009
UNIQUENESS THEOREM Theorem (This work) Single-Linkage is the only clustering function satisfying Order-Consistency and Path Distance-Coherence Bosagh Zadeh, Ben-David, UAI 2009
UNIQUENESS THEOREM Theorem (This work) Single-Linkage is the only clustering function satisfying Order-Consistency and Path-Distance-Coherence Is Path-Distance-Coherence doing all the work? No. Consistency is necessary for uniqueness k-Richness is necessary “X is Necessary”: All other axioms/properties satisfied, just X missing, still not enough to get uniqueness Bosagh Zadeh, Ben-David, UAI 2009
PRACTICAL CONSIDERATIONS Single-Linkage is not always the right function to use. Because Path-Distance-Coherence is not always desirable. It’s not always immediately obvious when we want a function to focus on the Path Distance Introduce a different formulation involving Minimum Spanning Trees Bosagh Zadeh, Ben-David, UAI 2009
20 PROPERTIES - MST-COHERENCE F If Then MST-Coherence dd’ k If two datasets d and d’ have the same Minimum Spanning Tree then for all k, F(d, k)=F(d’, k), 2 F 20, 2 Bosagh Zadeh, Ben-David, UAI 2009
A TAXONOMY OF CLUSTERING FUNCTIONS Min-Sum satisfies neither MST-Coherence nor Order-Consistency Future work: Characterize other clustering functions Bosagh Zadeh, Ben-David, UAI 2009
THANKS FOR YOUR ATTENTION! Bosagh Zadeh, Ben-David, UAI 2009
ASIDE: MINIMUM SPANNING TREES Spanning Tree: Tree Sub-graph of original graph which touches all nodes. Weight of tree is equal to sum of all edge weights. Spanning Trees ordered by weight, we are interested in the Minimum Spanning Tree Picture: Wikipedia Bold: Minimum Spanning Tree of the graph Bosagh Zadeh, Ben-David, UAI 2009
PROOF OUTLINE: CHARACTERIZATION OF SINGLE- LINKAGE 1. Start with arbitrary d, k 2. By k-Richness, there exists a d 1 such that F(d 1, k) = SL(d, k) 3. Through a series of Consistent transformations, can transform d 1 into d 6, which will have the same MST as d 4. Invoke MST-Coherence to get F(d 1, k) = F(d 6, k) = F(d, k) = SL(d, k) Bosagh Zadeh, Ben-David, UAI 2009
KLEINBERG’S IMPOSSIBILITY RESULT There exist no clustering function all 3 properties Proof: Scaling up Consistency Bosagh Zadeh, Ben-David, UAI 2009
AXIOMS AS A TOOL FOR A TAXONOMY OF CLUSTERING PARADIGMS The goal is to generate a variety of axioms (or properties) over a fixed framework, so that different clustering approaches could be classified by the different subsets of axioms they satisfy. Scale Invariance k-RichnessConsistencySeparabilityOrder Invariance Hier- archy Single Linkage Center Based Spectral MDL ++- Rate Distortion ++- “Axioms” “Properties” Bosagh Zadeh, Ben-David, UAI 2009
PROPERTIES Order-Consistency Function only compares distances together, not using absolute value Minimum Spanning Tree Coherence If two datasets d and d’ have the same Minimum Spanning Tree, then for all k, F(d, k) = F(d’, k) Function makes all its decisions using the Minimum Spanning Tree Bosagh Zadeh, Ben-David, UAI 2009
SOME MORE EXAMPLES Bosagh Zadeh, Ben-David, UAI 2009
AXIOMS - SCALE INVARIANCE F If F Then Scale Invariance F(λd)=F(d)d λ F(λd)=F(d) for all d and all strictly positive λ. 3 6 e.g. double the distances Bosagh Zadeh, Ben-David, UAI 2009
AXIOMS - RICHNESS F F F … Etc. can get all partitionings of the points Richness F(d)d The range of F(d) over all d is the set of all possible partitionings Bosagh Zadeh, Ben-David, UAI 2009
AXIOMS - CONSISTENCY Consistency d’d F(d) If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, F(d)=F(d’). then F(d)=F(d’). F If Then F Bosagh Zadeh, Ben-David, UAI 2009
PROPERTIES - ORDER- CONSISTENCY F If Then Order-Consistency dd’ k If two datasets d and d’ have the same ordering of the distances, then for all k, F(d, k)=F(d’, k) F 35 3 Maintain edge ordering, 2 Bosagh Zadeh, Ben-David, UAI 2009