Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Clustering II.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
CS690L: Clustering References:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering Algorithms BIRCH and CURE
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
CS4432: Database Systems II
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Presented by Ho Wai Shing
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Other Clustering Techniques
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
What Is Cluster Analysis?
DATA MINING Spatial Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10

What is Data Clustering? A cluster is a closely-packed group. A collection of data objects that are similar to one another and treated collectively as a group. Data Clustering is the partitioning of a dataset into clusters 2 of 28 Vladimir Jelić

Data Clustering Helps understand the natural grouping or structure in a dataset Provided a large set of multidimensional data – Data space is usually not uniformly occupied – Identify the sparse and crowded places – Helps visualization 3 of 28 Vladimir Jelić

Some Clustering Applications Biology – building groups of genes with related patterns Marketing – partition the population of consumers to market segments Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar land use from satellite images 4 of 28 Vladimir Jelić

Clustering Problems Today many datasets are too large to fit into main memory The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times 5 of 28 Vladimir Jelić

Previous Work Two classes of clustering algorithms:  Probability-Based  Examples: COBWEB and CLASSIT  Distance-Based  Examples: KMEANS, KMEDOIDS, and CLARANS 6 of 28 Vladimir Jelić

Previous Work: COBWEB Probabilistic approach to make decisions Clusters are represented with probabilistic description Probability representations of clusters is expensive Every instance (data point) translates into a terminal node in the hierarchy, so large hierarchies tend to over fit data 7 of 28 Vladimir Jelić

Previous Work: KMeans Distance based approach, so there must be distance measurement between any two instances Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time 8 of 28 Vladimir Jelić

Previous Work: CLARANS Also distance based approach, so there must be distance measurement between any two instances computational complexity of CLARANS is about O(n2) Sensitive to instance order Ignore the fact that not all data points in the dataset are equally important 9 of 28 Vladimir Jelić

Contributions of BIRCH Each clustering decision is made without scanning all data points BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency) 10 of 28 Vladimir Jelić

Background Knowledge (1) Given a cluster of instances, we define: 11 of 28 Vladimir Jelić Centroid: Radius: Diameter:

Background Knowledge (2) Vladimir Jelić 12 of 28 centroid Manhattan distance: centroid Euclidian distance: average inter-cluster: average intra-cluster: variance increase:

Clustering Features (CF) The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. 13 of 28 Vladimir Jelić

Clustering Feature (CF) Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points. 14 of 28 Vladimir Jelić

CF Additivity Theorem (1) If CF1 = (N1, LS1, SS1), and CF2 = (N2,LS2, SS2) are the CF entries of two disjoint sub-clusters. The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is: CF1 + CF2 = (N1 + N2, LS1 + LS2, SS1 + SS2) 15 of 28 Vladimir Jelić

CF Additivity Theorem (2) Vladimir Jelić 16 of 28 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Example:

Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold T Node size is determined by dimensionality of data space and input parameter P (page size) 17 of 28 Vladimir Jelić

CF Tree Insertion Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node Modifying the path: update CF information up the path. 18 of 28 Vladimir Jelić

Example of the BIRCH Algorithm Root LN1 LN2 LN3 LN1 LN2LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 New subcluster 19 of 28 Vladimir Jelić

19 of 28 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 LN1’ LN1” If the branching factor of a leaf node can not exceed 3, then LN1 is split

Vladimir Jelić 19 of 28 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2 LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 LN1’ LN1” If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one NLN1 NLN2

Vladimir Jelić 19 of 28 Merge Operation in BIRCH root LN1 LN2 sc1 sc2sc3sc4 sc5 sc6 LN1 Assume that the subclusters are numbered according to the order of formation sc3 sc4 sc5 sc6 sc2 sc1 LN2

Vladimir Jelić 19 of 28 LN2” LN1 sc3 sc4 sc5 sc6 sc2 If the branching factor of a leaf node can not exceed 3, then LN2 is split sc1 LN2’ root LN2’ sc1 sc2 sc3 sc4 sc5 sc6 LN1 LN2” Merge Operation in BIRCH

Vladimir Jelić 19 of 28 root LN2” LN3’ LN3” sc3 sc4 sc5 sc6 sc2 sc1 sc2sc3 sc4 sc5 sc6 LN3’ LN2’ and LN1 will be merged, and the newly formed node wil be split immediately sc1 LN3” LN2” Merge Operation in BIRCH

Birch Clustering Algorithm (1) Phase 1: Scan all data and build an initial in- memory CF tree. Phase 2: condense into desirable length by building a smaller CF tree. Phase 3: Global clustering Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results 20 of 28 Vladimir Jelić

Birch Clustering Algorithm (2) Vladimir Jelić 21 of 28

Birch – Phase 1 Start with initial threshold and insert points into the tree If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values Good initial threshold is important but hard to figure out Outlier removal – when rebuilding tree remove outliers 22 of 28 Vladimir Jelić

Birch - Phase 2 Optional Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. 23 of 28 Vladimir Jelić

Birch – Phase 3 Problems after phase 1: – Input order affects results – Splitting triggered by node size Phase 3: – cluster all leaf nodes on the CF values according to an existing algorithm – Algorithm used here: agglomerative hierarchical clustering 24 of 28 Vladimir Jelić

Birch – Phase 4 Optional Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Recalculating the centroids and redistributing the items. Always converges (no matter how many time phase 4 is repeated) 25 of 28 Vladimir Jelić

Conclusions (1) Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets Scans whole data only once Handles outliers better Superior to other algorithms in stability and scalability 26 of 28 Vladimir Jelić

Conclusions (2) Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster Vladimir Jelić 27 of 28

References T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96 Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees Tan, Steinbach, Kumar: Introduction to Data Mining Vladimir Jelić 28 of 28