Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.

Similar presentations


Presentation on theme: "BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006."— Presentation transcript:

1 BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006 Paul Haake, Spring 2007

2 April 21, 20072 Problem Introduction Data clustering How do we divide n data points into k groups? How do we minimize the difference within the groups? How do we maximize the difference between different groups? How do we avoid trying all possible solutions?  Very large data sets  Limited computational resources: memory, I/O

3 April 21, 20073 Outline Problem introduction Previous work Introduction to BIRCH The algorithm Experimental results Conclusions & practical use

4 April 21, 20074 Previous Work Two classes of clustering algorithms: Probability-Based Incremental, top-down sorting, probabilistic objective function (like CU) Examples: COBWEB (discrete) and CLASSIT (continuous) Distance-Based KMEANS, KMEDOIDS and CLARANS

5 April 21, 20075 Previous work: COBWEB Probabilistic approach to make decisions Probabilistic measure: Category Utility Clusters are represented with probabilistic description Incrementally generates a hierarchy Cluster data points one at a time Maximizes Category Utility at each decision point

6 April 21, 20076 Computing category utility is very expensive Attributes are assumed to be statistically independent Every instance translates into a terminal node in the hierarchy Infeasible for large data sets Large hierarchies tend to over fit data Previous work: COBWEB limitations

7 April 21, 20077 Previous work: distance-based clusterering Partition Clustering Starts with an initial clustering, then moves data points between different clusters to find better clusterings Each cluster represented by a “centroid” Hierarchical Clustering Repeatedly merges closest pairs of objects and splits farthest pairs of objects

8 April 21, 20078 Previous work: KMEANS Distance based approach There must be a distance measurement between any two instances (data points) Iteratively groups instances towards the nearest centroid to minimize distances Converges on a local minimum Sensitive to instance order May have exponential run time (worst case)

9 April 21, 20079 Previous work: KMEANS limitations All instances must be initially available Instances must be stored in memory Frequent scan (non-incremental) Global methods at the granularity of data points All instances are considered individually  But, not all data are equally important for clustering  Possible improvement (foreshadowing!): close or dense data points could be considered collectively.

10 April 21, 200710 Previous work: KMEDOIDS, CLARAMS KMEDOIDS Similar to KMEANS, except that the centroid of a cluster is represented by one centrally located object CLARAMS = KMEDOIDS + randomized partial search strategy Should out-perform KMEDOIDS, theoretically

11 April 21, 200711 Introduction to BIRCH BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies Only works with "metric" attributes Must have Euclidean coordinates Designed for very large data sets Time and memory constraints are explicit Treats dense regions of data points collectively Thus, not all data points are equally important for clustering Problem is transformed to clustering a set of “summaries” rather than a set of data points Only one scan of data is necessary

12 April 21, 200712 Introduction to BIRCH Incremental, distance-based approach Decisions are made without scanning all data points, or all currently existing clusters Does not need the whole data set in advance Unique approach: Distance-based algorithms generally need all the data points to work Make best use of available memory while minimizing I/O costs Does not assume that the probability distributions on attributes is independent

13 April 21, 200713 Background Given a cluster of instances, we define: Centroid: Radius: average distance from member points to centroid Diameter: average pair-wise distance within a cluster The Radius and Diameter are two alternative measures of cluster “tightness”

14 April 21, 200714 Background: Cluster-to-Cluster Distance Measures We define the centroid Euclidean distance and centroid Manhattan distance between any two clusters as: Anyone know how these are different?

15 April 21, 200715 Background: Cluster-to-Cluster Distance Measures We define the average inter-cluster, the average intra-cluster, and the variance increase distances as: Diameter of merged cluster

16 April 21, 200716 Background Cluster {X i }: i = 1, 2, …, N 1 Cluster {X j }: j = N 1 +1, N 1 +2, …, N 1 +N 2

17 April 21, 200717 Background Cluster X l = {X i } + {X j }: l = 1, 2, …, N 1, N 1 +1, N 1 +2, …, N 1 +N 2

18 April 21, 200718 Background Optional Data Preprocessing (Normalization) Avoids bias caused by dimensions with a large spread  But, large spread may naturally describe data!

19 April 21, 200719 Clustering Feature “How much information should be kept for each subcluster?” “How is the information about subclusters organized?” “How efficiently is the organization maintained?”

20 April 21, 200720 Clustering Feature A Clustering Feature (CF) summarizes a sub-cluster of data points: Additivity theorem

21 April 21, 200721 Properties of Clustering Feature CF entry is more compact Stores significantly less then all of the data points in the sub-cluster A CF entry has sufficient information to calculate the centroid, D, R, and D0-D4 Additivity theorem allows us to merge sub- clusters incrementally & consistently

22 April 21, 200722 CF-Tree

23 April 21, 200723 Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold T, a maximum diameter or radius P (page size in bytes) is the maximum size of a node Compact: each leaf node is a subcluster, not a data point!

24 April 21, 200724 CF-Tree Insertion  Recurse down from root, find the appropriate leaf Follow the "closest" CF path, w.r.t. D0 / … / D4  Modify the leaf If the closest-CF leaf cannot absorb due to exceeding the threshold, make a new CF entry. If there is no room for new leaf entry, split the parent node, recursively back to the root if necessary.  Traverse back & up Update CFs on the path or splitting nodes to reflect the new addition to the cluster Merging refinement Splits are caused by page size, which may result in unnatural clusters. Merge nodes to try to compensate for this.

25 April 21, 200725 CF-Tree Anomalies Anomaly 1 Two subclusters that should be together are split across two nodes due to page size, or Two subclusters that should not be in the same cluster are in the same node

26 April 21, 200726 CF-Tree Anomalies Anomaly 2 Equal-valued data points inserted at different times may end up in different leaf entries

27 April 21, 200727 CF-Tree Rebuilding  If we run out of memory, increase threshold T By increasing the threshold, CFs absorb more data, but are less granular: leaf entry clusters become larger Rebuilding "pushes" CFs over The larger T allows different CFs to group together Reducibility theorem Increasing T will result in a CF-tree as small or smaller then the original Rebuilding needs at most h (height of tree) extra pages of memory

28 April 21, 200728 BIRCH Overview

29 April 21, 200729 The Algorithm: BIRCH Phase 1: Load data into memory Build an initial in-memory CF-tree with the data (one scan) Subsequent phases are fast (no more I/O needed, work on sub-clusters instead of individual data points) accurate (outliers are removed) less order sensitive (because the CF-Tree forms an initial ordering of the data)

30 April 21, 200730 The Algorithm: BIRCH

31 April 21, 200731 The Algorithm: BIRCH Phase 2: Condense data Allows us to resize the data set so Phase 3 runs on an optimally sized data set Rebuild the CF-tree with a larger T Remove more outliers Group together crowded subclusters Condensing is optional

32 April 21, 200732 The Algorithm: BIRCH Phase 3: Global clustering Use existing clustering algorithm (e.g., HC, KMEANS, CLARANS) on CF entries Helps fix problem where natural clusters span nodes (Anomaly 1)

33 April 21, 200733 The Algorithm: BIRCH Phase 4: Cluster refining Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Refining is optional Fixes the problem with CF-trees where same-valued data points may be assigned to different leaf entries (anomaly 2)  Always converges to a minimum  Allows us to discard more outliers

34 April 21, 200734 Memory Management BIRCH's memory usage is determined by data distribution not data size In phases 1-3, BIRCH uses all available memory to generate clusters as granular as possible Phase 4 can correct inaccuracies caused by insufficient memory (i.e., lack of granularity) Time/space trade-off: if memory is low, spend more time in phase 4

35 April 21, 200735 Experimental Results Input parameters: Memory (M): 5% of data set Distance equation: D2 (average intercluster distance) Quality equation: weighted average diameter (D) Initial threshold (T): 0.0 Page size (P): 1024 bytes Phase 3 algorithm: an agglomerative Hierarchical Clustering algorithm One refinement pass in phase 4

36 April 21, 200736 Experimental Results Create 3 synthetic data sets for testing Also create an ordered copy for testing input order KMEANS and CLARANS require entire data set to be in memory Initial scan is from disk, subsequent scans are in memory

37 April 21, 200737 Experimental Results Intended clustering

38 April 21, 200738 Experimental Results KMEANS clustering DSTimeD# ScanDSTimeD# Scan 143.92.092891o33.81.97197 213.24.43512o12.74.2029 332.93.661873o36.04.35241 Ordered data

39 April 21, 200739 Experimental Results CLARANS clustering DSTimeD# ScanDSTimeD# Scan 19322.1033071o7942.112854 27582.6326612o8162.312933 38353.3929593o9243.283369

40 April 21, 200740 Experimental Results BIRCH clustering DSTimeD# ScanDSTimeD# Scan 111.51.8721o13.61.872 210.71.9922o12.11.992 311.43.9523o12.23.992

41 April 21, 200741 Conclusions & Practical Use Pixel classification in images From top to bottom: BIRCH classification Visible wavelength band Near-infrared band

42 April 21, 200742 Conclusions & Practical Use Image compression using vector quantization Generate codebook for frequently occurring patterns BIRCH performs faster then CLARANS or LBG, while getting better compression and nearly as good quality

43 April 21, 200743 Conclusions & Practical Use BIRCH works with very large data sets Explicitly bounded by computational resources Runs with specified amount of memory Superior to CLARANS and KMEANS Quality, speed, stability and scalability

44 April 21, 200744 Exam Questions  What is the main limitation of BIRCH? BIRCH only works with metric attributes (i.e., Euclidian coordinates)  Name the two algorithms in BIRCH clustering: CF-Tree Insertion CF-Tree Rebuilding

45 April 21, 200745 Exam Questions  What is the purpose of phase 4 in BIRCH? Slides 33-34: Guaranteed convergence to minimum, discards outliers, ensures duplicate data points are in the same cluster, compensates for low memory availability during previous phases


Download ppt "BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006."

Similar presentations


Ads by Google