1 8. Estimating the cluster tree of a density from the MST by Runt Pruning Problem: 1-nn density estimate is very noisy --- singularity at each observation.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Traveling Salesperson Problem
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Searching on Multi-Dimensional Data
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
SASH Spatial Approximation Sample Hierarchy
1 Greedy Algorithms. 2 2 A short list of categories Algorithm types we will consider include: Simple recursive algorithms Backtracking algorithms Divide.
HCS Clustering Algorithm
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
4. Ad-hoc I: Hierarchical clustering
1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore.
1 Unsupervised Learning: Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample Werner Stuetzle Professor and Chair,
POSTER TEMPLATE BY: Note: in high dimensions, the data are sphered prior to distance matrix calculation. Three Groups Example;
An Interval MST Procedure Rebecca Nugent w/ Werner Stuetzle November 16, 2004.
1 Efficient Algorithms for Non-parametric Clustering With Clutter Weng-Keen Wong Andrew Moore (In partial fulfillment of the speaking requirement)
Steiner trees Algorithms and Networks. Steiner Trees2 Today Steiner trees: what and why? NP-completeness Approximation algorithms Preprocessing.
Clustering Unsupervised learning Generating “classes”
Data Structures and Algorithms Graphs Minimum Spanning Tree PLSD210.
Design and Analysis of Computer Algorithm September 10, Design and Analysis of Computer Algorithm Lecture 5-2 Pradondet Nilagupta Department of Computer.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Greedy methods Prudence Wong
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Neighborhood-Based Topology Recognition in Sensor Networks S.P. Fekete, A. Kröller, D. Pfisterer, S. Fischer, and C. Buschmann Corby Ziesman.
Prepared by: Mahmoud Rafeek Al-Farra
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Detection of closed sharp edges in point clouds Speaker: Liuyu Time:
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Algorithm Design and Analysis June 11, Algorithm Design and Analysis Pradondet Nilagupta Department of Computer Engineering This lecture note.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Unsupervised Learning
Machine Vision ENT 273 Lecture 4 Hema C.R.
Greedy Algorithms.
CSE 4705 Artificial Intelligence
Clustering CSC 600: Data Mining Class 21.
DECISION TREES An internal node represents a test on an attribute.
COMP108 Algorithmic Foundations Greedy methods
Clustering and Segmentation
COMP 9517 Computer Vision Segmentation 7/2/2018 COMP 9517 S2, 2017.
Week 11 - Friday CS221.
Mean Shift Segmentation
6. Introduction to nonparametric clustering
Party-by-Night Problem
Graphs & Graph Algorithms 2
Interface / CSNA Meeting St. Louis
CSE572, CBS598: Data Mining by H. Liu
Advanced Analysis of Algorithms
CSE572, CBS572: Data Mining by H. Liu
Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
CSE572: Data Mining by H. Liu
Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree
Clustering.
Unsupervised Learning
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

1 8. Estimating the cluster tree of a density from the MST by Runt Pruning Problem: 1-nn density estimate is very noisy --- singularity at each observation => cluster tree would have n leaves Idea: Control size of cluster tree by runt size threshold Split of connected component of L(c, p*) is considered “significant” if both daughter components are larger than runt size threshold. Sketch of algorithm Repeat { Break longest edge of MST } Until min (size of left subtree, size of right subtree) > runt size threshold If … apply recursively to subtrees

2 Runt analysis Define runt size (J. H.) of MST edge e: Break all MST edges that are longer than e runt_size (e) = min (#obs in left subtree, #obs in right subtree) Algorithm: compute_cluster_tree (mst, runt_size_threshold) { node = new_cluster_tree_node; node.leftson = node.rightson = NULL; node.obs = leaves (mst); cut_edge = longest_edge_with_large_runt_size (mst, runt_size_threshold); if (cut_edge) { node.leftson = compute_cluster_tree (left_subtree(mst, cut_edge), runt_size_threshold); node.rightson = compute_cluster_tree (right_subtree(mst, cut_edge), runt_size_threshold); } return(node); } rs = 1 rs = 5 rs = 2

3 Heuristic justification: MST edges with large runt size indicate presence of multiple modes Recall multi-fragment algorithm for MST construction: Define distance d (G1, G2) between groups as minimum distance between observations Initialize each obs to form its own group Repeat { Find closest groups Add shortest edge connecting them Merge closest groups } Until only one group remains What will happen? Fragments will start and grow in high density regions, where distances are small Eventually, those fragments will be joined by edges Those edges will have large runt size

4 Illustration Left: data set Middle: rootogram of runt sizes Right: MST after removal of all edges with length > length (edge with largest runt size)

5 Computational complexity Computing MST: O (n log n) using spatial hashing Computing runt sizes for edges of MST: O (n log n) Deciding on whether a cluster with m observations should be split: O (m) However Spatial partitioning most effective if n large relative to d.

6 Relationship to single linkage clustering Single linkage clustering = standard way of extracting clusters from MST To obtain k clusters, break k-1 longest edges in MST Problems: Breaking longest edges tends to separate stragglers from the bulk of the data and often results in one large and many small clusters (“chaining”) Choosing a single threshold for edge length choosing a single cut level for 1-NN density estimate. However, there might not be a single cut level that reveals all the leaves of the mode tree. Cut at upper level reveals two leftmost modes. Cut at lower level reveals right mode. Need to consider cuts at all levels

7 Illustration - olive oil data Objects: 572 olive oil samples coming from 9 different areas, grouped into 3 regions (1, 2, 3, 4) (5, 6) (7, 8, 9) Features: Concentration of 8 different chemicals Question: How well can we recover the grouping into regions and areas Note: To evaluate performance of unsupervised learning methods, need labeled data 20 largest runt sizes: Fairly clear gap: Choose runt size 33 as threshold Note: Situation not always that clear cut

8 Estimate of cluster tree, olive oil data Interpretation: Bottom split separates region 3 from regions 1, 2 Next split on left separates region 1 from region 2 Not able to correctly partition region 1 into areas

9 Areas vs clusters: Interpretation of table: There are 25 olive oil samples from area 1. One of them ended up in cluster 2, 17 in cluster 6, and 7 in cluster 8 Not able to recognize areas 1- 4 in region 1

10 Diagnostic plot: Do the two clusters in area 3 really correspond to modes ? (a) cluster tree with node splitting area 3 selected; (b) projection of data in node on Fisher discriminant direction separating daughters; (c) cluster tree with node separating area 3 from area 2 selected; (d) projection of data on Fisher direction

11 Diagnostic plot: Do areas 1 and 4 really correspond to modes ? Projection of areas 1 (black), 2 (green), 3 (blue), and 4 (red) on the plane spanned by first two discriminant coordinates Note: Not an operational diagnostic --- assumes knowledge of true labels