The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering Algorithms BIRCH and CURE
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Birch: An efficient data clustering method for very large databases
CS4432: Database Systems II
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Clustering Unsupervised learning Generating “classes”
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CSC 213 – Large Scale Programming Lecture 37: External Caching & (a,b)-Trees.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Presented by Ho Wai Shing
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Other Clustering Techniques
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
DATA MINING Spatial Clustering
CS522 Advanced database Systems
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management Systems (CS 564)
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

The BIRCH Algorithm Davitkov Miroslav, 2011/3116 Faculty of Electrical Engineering University of Belgrade The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Balanced Iterative Reducing and Clustering using Hierarchies 1. BIRCH – the definition Balanced Iterative Reducing and Clustering using Hierarchies 2 / 32

1. BIRCH – the definition An unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. 3 / 32

2. Data Clustering Cluster A closely-packed group. - A collection of data objects that are similar to one another and treated collectively as a group. Data Clustering - partitioning of a dataset into clusters. 4 / 32

2. Data Clustering – problems Data-set too large to fit in main memory. I/O operations cost the most (seek times on disk are orders of a magnitude higher than RAM access times). BIRCH offers I/O cost linear in the size of the dataset. 5 / 32

2. Data Clustering – other solutions Probability-based clustering algorithms (COBWEB and CLASSIT) Distance-based clustering algorithms (KMEANS, KMEDOIDS and CLARANS) 6 / 32

3. BIRCH advantages It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole dataset in advance. 7 / 32

4. BIRCH concepts and terminology Hierarchical clustering 8 / 32

4. BIRCH concepts and terminology Hierarchical clustering The algorithm starts with single point clusters (every point in a database is a cluster). Then it groups the closest points into separate clusters, and continues, until only one cluster remains. The computation of the clusters is done with a help of distance matrix (O(n2) large) and O(n2) time. 9 / 32

4. BIRCH concepts and terminology Clustering Feature The BIRCH algorithm builds a clustering feature tree (CF tree) while scanning the data set. Each entry in the CF tree represents a cluster of objects and is characterized by a triple (N, LS, SS). 10 / 32

4. BIRCH concepts and terminology Clustering Feature Given N d-dimensional data points in a cluster,  Xi (i = 1, 2, 3, … , N) CF vector of the cluster is defined as a triple CF = (N,LS,SS): N - number of data points in the cluster LS - linear sum of the N data points SS - square sum of the N data points 11 / 32

4. BIRCH concepts and terminology CF Tree a height balanced tree with two parameters: - branching factor B - threshold T Each non-leaf node contains at most B entries of the form [CFi, childi], where childi is a pointer to its i-th child node and CFi is the CF of the subcluster represented by this child. So, a non-leaf node represents a cluster made up of all the subclusters represented by its entries. 12 / 32

4. BIRCH concepts and terminology CF Tree A leaf node contains at most L entries, each of them of the form [CFi], where i = 1, 2, …, L . It also has two pointers, prev and next, which are used to chain all leaf nodes together for efficient scans. A leaf node also represents a cluster made up of all the subclusters represented by its entries. But all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. 13 / 32

4. BIRCH concepts and terminology CF Tree 14 / 32

4. BIRCH concepts and terminology CF Tree The tree size is a function of T (the larger the T is, the smaller the tree is). We require a node to fit in a page of size of P . B and L are determined by P (P can be varied for performance tuning ). Very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster. 15 / 32

4. BIRCH concepts and terminology CF Tree The leave contains actual clusters. The size of any cluster in a leaf is not larger than T. 16 / 32

5. BIRCH algorithm An example of the CF Тree Initially, the data points in one cluster. root A A 17 / 32

5. BIRCH algorithm cluster does not exceed T. An example of the CF Тree The data arrives, and a check is made whether the size of the cluster does not exceed T. root A T A 18 / 32

5. BIRCH algorithm too big, the cluster is split into two clusters, An example of the CF Тree If the cluster size grows too big, the cluster is split into two clusters, and the points are redistributed. root A B B T A 19 / 32

5. BIRCH algorithm the CF tree keeps information about the mean of the An example of the CF Тree At each node of the tree, the CF tree keeps information about the mean of the cluster, and the mean of the sum of squares to compute the size of the clusters efficiently. root A B B A 20 / 32

5. BIRCH algorithm Another example of the CF Tree Insertion LN3 LN2 sc7 sc6 LN3 sc5 sc4 LN2 sc1 Root LN1 LN2 LN3 sc3 sc2 sc4 sc5 sc7 sc6 LN1 sc1 sc2 sc3 sc8 sc8 21 / 32

5. BIRCH algorithm Another example of the CF Tree Insertion If the branching factor of a leaf node can not exceed 3, then LN1 is split. sc7 sc6 LN3 sc5 sc4 LN2 Root LN1’’ LN1’ LN1’ LN1’’ LN2 LN3 sc3 sc1 sc2 sc8 sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7 22 / 32

5. BIRCH algorithm Another example of the CF Tree Insertion If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one. Root LN3 NLN1 NLN2 sc5 sc4 sc7 LN2 sc6 NLN2 LN1’ LN1’’ LN2 LN3 LN1’’ NLN1 sc3 sc1 sc2 sc8 LN1’ sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7 23 / 32

5. BIRCH algorithm Phase 1: Scan all data and build an initial in-memory CF tree, using the given amount of memory and recycling space on disk. Phase 2: Condense into desirable length by building a smaller CF tree. Phase 3: Global clustering. Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results. 24 / 32

5. BIRCH algorithm 5.1. Phase 1 Starts with initial threshold, scans the data and inserts points into the tree. If it runs out of memory before it finishes scanning the data, it increases the threshold value and rebuilds a new, smaller CF tree, by re-inserting the leaf entries from the older tree and then resuming the scanning of the data from the point at which it was interrupted. Good initial threshold is important but hard to figure out. Outlier removal (when rebuilding tree). 25 / 32

5. BIRCH algorithm 5.1. Phase 2 (optional) Preparation for Phase 3. Potentially, there is a gap between the size of Phase 1 results and the input range of Phase 3. It scans the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing more outliners and grouping crowded subclusters into larger ones. 26 / 32

5. BIRCH algorithm 5.1. Phase 3 Problems after Phase 1: Input order affects results. Splitting triggered by node size. Phase 3: It uses a global or semi-global algorithm to cluster all leaf entries. Adapted agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their CF vectors. 27 / 32

5. BIRCH algorithm 5.1. Phase 4 (optional) Additional passes over the data to correct inaccuracies and refine the clusters further. It uses the centroids of the clusters produced by Phase 3 as seeds, and redistributes the data points to its closest seed to obtain a set of new clusters. Converges to a minimum (no matter how many time is repeated). Option of discarding outliners. 28 / 32

5. Conclusion Pros Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets. Scans whole data only once. Handles outliers better. Superior to other algorithms in stability and scalability. 29 / 32

5. Conclusion Cons Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster. 30 / 32

T. Zhang, R. Ramakrishnan and M. Livny: 5. References T. Zhang, R. Ramakrishnan and M. Livny: BIRCH : An Efficient Data Clustering Method for Very Large Databases A New Data Clustering Algorithm and Its Applications 31 / 32

Thank you for your attention! Questions? davitkov.miroslav@gmail.com dm113116m@student.etf.rs