BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Introduction to Bioinformatics
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
Clustering Algorithms BIRCH and CURE
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Other Clustering Techniques
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Parallel Density-based Hybrid Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
John Nicholas Owen Sarah Smith
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
Clustering The process of grouping samples so that the samples are similar within each group.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies Data Mining and Knowledge Discovery, Volume 1, Issue 2, 1997, pp. 141-182 Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Zhao Li 2009, Spring

Outline Introduction to Clustering Main Techniques in Clustering Hybrid Algorithm: BIRCH Example of the BIRCH Algorithm Experimental results Conclusions August 10, 2019

Clustering Introduction Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methods Partitioning : K-Means… Hierarchical : BIRCH,ROCK,… Density-based: DBSCAN,… A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity Partitioning clustering, especially k-means algorithm, is widely used, and is regarded as a benchmark clustering algorithm. Hierarchical clustering is the algorithm on which Birch algorithm based. August 10, 2019

Main Techniques (1) Partitioning Clustering (K-Means) step.1 initial center Given the number of clusters, chose the initial centers randomly August 10, 2019

K-Means Example Step.2 x new center after 1st iteration Assign every data instance to the closest cluster based on the distance between the data and the center of the cluster and compute the new centers of the k clusters August 10, 2019

K-Means Example Step.3 new center after 2nd iteration August 10, 2019

Main Techniques (2) Hierarchical Clustering Multilevel clustering: level 1 has n clusters  level n has one cluster, or upside down. Agglomerative HC: starts with singleton and merge clusters (bottom-up). Divisive HC: starts with one sample and split clusters (top-down). Dendrogram August 10, 2019

Agglomerative HC Example Nearest Neighbor Level 2, k = 7 clusters. 1. Calculate the similarity between all possible combinations of two data instances August 10, 2019

Nearest Neighbor, Level 3, k = 6 clusters. August 10, 2019

Nearest Neighbor, Level 4, k = 5 clusters. August 10, 2019

Nearest Neighbor, Level 5, k = 4 clusters. 2. Two most similar clusters are grouped together to form a new cluster August 10, 2019

Nearest Neighbor, Level 6, k = 3 clusters. 3. Calculate the similarity between the new cluster and all remaining clusters. August 10, 2019

Nearest Neighbor, Level 7, k = 2 clusters. August 10, 2019

Nearest Neighbor, Level 8, k = 1 cluster. August 10, 2019

Remarks Partitioning Clustering Hierarchical Time Complexity O(n) O(n2log n) Pros Easy to use and Relatively efficient Outputs a dendrogram that is desired in many applications. Cons Sensitive to initialization; bad initialization might lead to bad results. Need to store all data in memory. higher time complexity; 1.The time complexity of computing the distance between every pair of data instances is O(n2). 2. The time complexity to create the sorted list of inter-cluster distances is O(n2log n). Obviously, the algorithms in these regards are failed to effectively handle large datasets that space and time are considered. August 10, 2019

Introduction to BIRCH Designed for very large data sets Time and memory are limited Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Does not need the whole data set in advance Two key phases: Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes August 10, 2019

Similarity Metric(1) Given a cluster of instances , we define: Centroid: Radius: average distance from member points to centroid Diameter: average pair-wise distance within a cluster August 10, 2019

Similarity Metric(2) centroid Euclidean distance: centroid Manhattan distance: average inter-cluster: average intra-cluster: variance increase: August 10, 2019

Clustering Feature The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. N is the number of data points LS is the linear sum of the N points SS is the square sum of the N points August 10, 2019

Properties of Clustering Feature CF entry is more compact Stores significantly less than all of the data points in the sub-cluster A CF entry has sufficient information to calculate D0- D4 Additivity theorem allows us to merge sub-clusters incrementally & consistently CF Additivity Theorem: if CF_1 and CF_2 are disjoint, merging them is equal to the sum of their parts August 10, 2019

CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries, each of which satisfies threshold T Node size is determined by dimensionality of data space and input parameter P (page size) CF tree can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data A non-leaf node entry is a CF triplet and a child node link Each non-leaf node contains a number of child nodes. The number of children that a non-leaf node can contain is limited by a threshold called the branching factor. A leaf node is a collection of CF triplets and links to the next and previous leaf nodes Each leaf node contains a number of subclusters that contains a group of data instances The diameter of a subcluster under a leaf node can not exceed a threshold. August 10, 2019

CF-Tree Insertion Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / … / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back Updating CFs on the path or splitting nodes Closest CF is determined by any of D0-D4 Leaf node split: take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new one that caused the split) into the closest node. Splitting the root on the traverse back increases tree height by one August 10, 2019

CF-Tree Rebuilding If we run out of space, increase threshold T By increasing the threshold, CFs absorb more data Rebuilding "pushes" CFs over The larger T allows different CFs to group together Reducibility theorem Increasing T will result in a CF-tree smaller than the original Rebuilding needs at most h extra pages of memory Bigger T = Smaller CF-tree The rebuild will also take no more than h (height of the original tree) extra pages of memory (nodes) August 10, 2019

Example of BIRCH Root LN1 LN2 LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 New subcluster August 10, 2019

Insertion Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN1 is split. Root LN1” LN2 LN3 LN1’ sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 August 10, 2019

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one. Root LN1” LN2 LN3 LN1’ sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 NLN1 NLN2 August 10, 2019

BIRCH Overview Phase 1: Load data into memory Phase 2: Condense data Build an initial in-memory CF-tree with the data (one scan) Subsequent phases become fast, accurate, less order sensitive Phase 2: Condense data Rebuild the CF-tree with a larger T Condensing is optional Phase 3: Global clustering Use existing clustering algorithm on CF leafs Phase 4: Cluster refining Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Refining is optional August 10, 2019

Experimental Results Input parameters: Memory (M): 5% of data set Disk space (R): 20% of M Distance equation: D2 Quality equation: weighted average diameter (D) Initial threshold (T): 0.0 Page size (P): 1024 bytes Node size is determined by dimensionality of data space and input parameter P (page size) August 10, 2019

Experimental Results KMEANS clustering BIRCH clustering Page size DS Time D # Scan 1 43.9 2.09 289 1o 33.8 1.97 197 2 13.2 4.43 51 2o 12.7 4.20 29 3 32.9 3.66 187 3o 36.0 4.35 241 Page size When using Phase 4, P can vary from 256 to 4096 without much effect on the final results Memory vs. Time Results generated with low memory can be compensated for by multiple iterations of Phase 4 BIRCH clustering DS Time D # Scan 1 11.5 1.87 2 1o 13.6 10.7 1.99 2o 12.1 3 11.4 3.95 3o 12.2 3.99 August 10, 2019

Conclusions A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. Given a limited amount of main memory, BIRCH can minimize the time required for I/O. BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data. August 10, 2019

Exam Questions What is the main limitation of BIRCH? Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster. August 10, 2019

Exam Questions Name the two algorithms in BIRCH clustering: CF-Tree Insertion CF-Tree Rebuilding What is the purpose of phase 4 in BIRCH? Do additional passes over the dataset and reassign data points to the closest centroid . August 10, 2019

Thank you for your patience Good luck for final exam! Q&A Thank you for your patience Good luck for final exam! August 10, 2019