1 8. Estimating the cluster tree of a density from the MST by Runt Pruning Problem: 1-nn density estimate is very noisy --- singularity at each observation => cluster tree would have n leaves Idea: Control size of cluster tree by runt size threshold Split of connected component of L(c, p*) is considered “significant” if both daughter components are larger than runt size threshold. Sketch of algorithm Repeat { Break longest edge of MST } Until min (size of left subtree, size of right subtree) > runt size threshold If … apply recursively to subtrees
2 Runt analysis Define runt size (J. H.) of MST edge e: Break all MST edges that are longer than e runt_size (e) = min (#obs in left subtree, #obs in right subtree) Algorithm: compute_cluster_tree (mst, runt_size_threshold) { node = new_cluster_tree_node; node.leftson = node.rightson = NULL; node.obs = leaves (mst); cut_edge = longest_edge_with_large_runt_size (mst, runt_size_threshold); if (cut_edge) { node.leftson = compute_cluster_tree (left_subtree(mst, cut_edge), runt_size_threshold); node.rightson = compute_cluster_tree (right_subtree(mst, cut_edge), runt_size_threshold); } return(node); } rs = 1 rs = 5 rs = 2
3 Heuristic justification: MST edges with large runt size indicate presence of multiple modes Recall multi-fragment algorithm for MST construction: Define distance d (G1, G2) between groups as minimum distance between observations Initialize each obs to form its own group Repeat { Find closest groups Add shortest edge connecting them Merge closest groups } Until only one group remains What will happen? Fragments will start and grow in high density regions, where distances are small Eventually, those fragments will be joined by edges Those edges will have large runt size
4 Illustration Left: data set Middle: rootogram of runt sizes Right: MST after removal of all edges with length > length (edge with largest runt size)
5 Computational complexity Computing MST: O (n log n) using spatial hashing Computing runt sizes for edges of MST: O (n log n) Deciding on whether a cluster with m observations should be split: O (m) However Spatial partitioning most effective if n large relative to d.
6 Relationship to single linkage clustering Single linkage clustering = standard way of extracting clusters from MST To obtain k clusters, break k-1 longest edges in MST Problems: Breaking longest edges tends to separate stragglers from the bulk of the data and often results in one large and many small clusters (“chaining”) Choosing a single threshold for edge length choosing a single cut level for 1-NN density estimate. However, there might not be a single cut level that reveals all the leaves of the mode tree. Cut at upper level reveals two leftmost modes. Cut at lower level reveals right mode. Need to consider cuts at all levels
7 Illustration - olive oil data Objects: 572 olive oil samples coming from 9 different areas, grouped into 3 regions (1, 2, 3, 4) (5, 6) (7, 8, 9) Features: Concentration of 8 different chemicals Question: How well can we recover the grouping into regions and areas Note: To evaluate performance of unsupervised learning methods, need labeled data 20 largest runt sizes: Fairly clear gap: Choose runt size 33 as threshold Note: Situation not always that clear cut
8 Estimate of cluster tree, olive oil data Interpretation: Bottom split separates region 3 from regions 1, 2 Next split on left separates region 1 from region 2 Not able to correctly partition region 1 into areas
9 Areas vs clusters: Interpretation of table: There are 25 olive oil samples from area 1. One of them ended up in cluster 2, 17 in cluster 6, and 7 in cluster 8 Not able to recognize areas 1- 4 in region 1
10 Diagnostic plot: Do the two clusters in area 3 really correspond to modes ? (a) cluster tree with node splitting area 3 selected; (b) projection of data in node on Fisher discriminant direction separating daughters; (c) cluster tree with node separating area 3 from area 2 selected; (d) projection of data on Fisher direction
11 Diagnostic plot: Do areas 1 and 4 really correspond to modes ? Projection of areas 1 (black), 2 (green), 3 (blue), and 4 (red) on the plane spanned by first two discriminant coordinates Note: Not an operational diagnostic --- assumes knowledge of true labels