Download presentation
Presentation is loading. Please wait.
Published byShanna Francis Modified over 9 years ago
1
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications
2
Data Clustering: Problem of dividing N data points into K groups so as to minimize an intra-group difference metric. Many Algorithms already exists Problem: Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions.
3
Probability-Based Clustering COBWEB –use probabilistic measurements for making decisions (Category Utility) –Hierarchical, Iterative Disadvantage –Category Utility takes time and memory –tends to over fit Other Methods
4
Cobweb’s Limits -assumes probability distributions of attributes are independent -can only have discrete values, approximates continuous data with discretization -storing and manipulating large sets of data becomes infeasible
5
The Competition Distance-Based Clustering KMEANS, CLARAN CLARAN –like Kmeans –Node - set of medians –starts in random node and moves to closest neighbor
6
BIRCH Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the concept of summarization) -exploits the observation that not every data point is important for clustering purposes
7
Limitations of BIRCH Handles only metric data
8
Definitions Centroid –avg value Radius –std dev Diameter –avg. pairwise distance within a cluster
9
How Far Away Given the centroids of two clusters: centroid Euclidean distance D0 centroid Manhattan distance D1 –d is the dimension
10
More Distances Average Inter-cluster Distance (D2) Average Intra-cluster Distance (D3) Variance Increase Distance (D4)
11
Clustering Feature (CF) “A Clustering Feature (CF) entry is a triple summarizing the information that we maintain about a sub-cluster of data points.” CF Definition: CF = (N,, SS) –N : number of data points in the cluster – : Linear sum of the N data points, –SS : Square sum of the N data points,
12
CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately
13
CF Additivity Theorem: Assume that CF 1 and CF 2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the sub- cluster that is formed by merging the two disjoint sub-clusters is :
14
CF-tree Features Has two parameters –1. Branching Factor B - non-leaf node ( [CF i, child i ], i = 1..B) –child i - pointer to its i-th child node L - leaf node ( [CF j, prev, next], j = 1..L) –2. Threshold specify the size of each leaf entry –diameter(D) of each leaf entry < T –or radius(R) of each leaf entry < T
15
CF-tree Features(Continue) Tree size is a function of T –tree size = f(T) –T increases tree size decreases Page Size (P) –A node is required to fit in a page of size P –P can be varied for performance tuning CF tree will be built dynamically as new data objects are inserted
16
CF-tree Two Algorithms used to build a CF-tree –1. Insertion Algorithm Purpose: Building a CF-Tree –2. Rebuilding Algorithm Purpose: –Rebuild the whole tree with larger T (smaller size) –this happens when CF-tree size limit is exceeded
17
Insertion Algorithm 1. Identifying the appropriate leaf –non-leaf node –use distance metric to chose closest branch 2. Insert leaf into leaf node –merge with closest leaf [CF i, prev, next] –if T violated make new leaf [CF i+1, prev, next] –if L violated split into two leaf nodes choose two leaves that are farthest apart put the rest in leaf node with the closest leaf
18
Insertion Algorithm 3. Update tree path –if B is violated split the node –CF should be the sum of child CFs 4. A Merging Refinement –try to combine two closest children of the node that did not split –might free space
19
Rebuilding Algorithm When to do it? –If the CF-tree size limit is exceeded What does it do? –Creates new tree with larger T(diameter/radius) larger T -> smaller tree size How –Deletes path from old tree and adds path to new tree.
20
Rebuilding Algorithm Procedure –‘OldCurrentPath’ starts at leftmost branch –1. Create the ‘NewCurrentPath’ in new tree copy nodes from OldCurrentPath into NewCurrentPath –2. Insert leaf entries in ‘OldCurrentPath’ into the new tree if leaf does not go into NewCurrentPath remove it from the NewCurrentPath
21
Rebuilding Algorithm –3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’ delete nodes in ‘OldCurrentPath’ if ‘NewCurrentPath’ Is empty delete its nodes –4. Process the next path in the old tree –only needs enough pages to cover ‘OldCurrentPath’ usually the height of the tree
22
Potential Problems Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node Anomaly 2: Sub cluster ends up in the wrong leaf node
23
Reducibility Theorem: Assume we rebuild CF-tree t i+1 of threshold T i by the above algorithm, and let S i and S i+1 be the sizes of t i and t i+1 respectively. If T i+1 > T i, then S i+1 <S i, and the transformation from t i to t i+1 needs at most h extra pages of memory, where h is the height of t i
24
BIRCH Clustering Algorithm Four Phases –1. Loading scan all data and build an initial in-memory CF-tree –2. Optional Condensing rebuild CF-tree to make it smaller faster analysis, but reduced accuracy –3. Global Clustering run clustering algorithm on CFs (KMEANS) handles anomaly 1
25
BIRCH Clustering Algorithm –4. Optional Refining Use the centroids of the clusters as seeds Scan data again and assign points to closest seed Handles anomaly 2
26
BIRCH Phase 1 Building CF-tree –Heuristic Threshold (T default = 0) when rebuilding need new T (diameter/radius) use avg distance of closest leaf pairs in same node should reduce size of tree by about half
27
BIRCH Phase 1 –Outlier Handling Option CF with small N(# data points) is saved to disk try to reinsert when –run out of memory –finish reading data if data is noisy, improves runtime and accuracy –Delay Split Option about to run out of memory CFs that would cause the tree to split saved to disk
28
How does this compare to other clustering methods? Ran against Kmeans and CLARANS
29
Results
33
Runtime
34
Conclusions BIRCH vs CLARANS and KMEANS –runs faster (fewer scans) –less order sensitive –less memory
35
Where can I use this? Interactive and Iterative Pixel Classification –MVI ( Multi-band Vegetation Imager) –BIRCH helps classify pixels through clustering Code Generalization in Image Compression –compressing visual data to save space –code book - vector code words for image blocks –BIRCH assigns nearest code word to vector
36
Main limitations of BIRCH? Ability to only handle metric data.
37
Name the two algorithms in BIRCH clustering? 1.Inserting 2.Rebuilding
38
What is the purpose to have phase 4 in the BIRCH clustering algorithm? -All copies of a given data point go to the same cluster. -option to discard outliers -can converge to a minimum
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.