BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

Clustering II.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering Algorithms BIRCH and CURE
Clustering II.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis.
CSE 634 Data Mining Techniques
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Making B+-Trees Cache Conscious in Main Memory
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Hierarchical Clustering
Other Clustering Techniques
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Machine Learning for the Quantified Self
DATA MINING Spatial Clustering
Multiway Search Trees Data may not fit into main memory
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Quadtrees 1.
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
A Framework for Clustering Evolving Data Streams
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
Lecture 10 Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented by Zhirong Tao

Outline of the Paper Background Clustering Feature and CF Tree The BIRCH Clustering Algorithm Performance Studies

Background A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

Background (Contd) Given N d-dimensional data points in a cluster: {X i } where i = 1, 2, …, N, the centroid X 0, radius R and diameter D of the cluster are defined as:

Background (Contd) Given the centroids of two clusters: X 01 and X 02, The centroid Euclidean distance D0: The centroid Manhattan distance D1:

BIRCH: Hierarchical Method A distance-based approach: Assume there is a distance measurement between any two instances. Represent clusters by some kind of ‘ center ’ measure. A hierarchical clustering a sequence of partitions in which each partition is nested into the next partition in the sequence.

Clustering Feature Definition Given N d-dimensional data points in a cluster: {X i } where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.

CF Additive Theorem Assume that CF1 = (N 1, LS 1, SS 1 ), and CF2 = (N 2,LS 2, SS 2 ) are the CF entries of two disjoint subclusters. The CF entry of the subcluster formed by merging the two disjoint subclusters is: CF1 + CF2 = (N 1 + N 2, LS 1 + LS 2, SS 1 + SS 2 ) The CF entries can be stored and calculated incrementally and consistently as subclusters are merged or new data points are inserted.

CF-Tree A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T. The entry in each nonleaf node has the form [CF i, child i ] The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'. Threshold value T: the diameter (alternatively, the radius) of each leaf entry has to be less than T.

BIRCH Algorithm Overview

Phase 1

Insertion Algorithm Identifying the appropriate leaf Modifying the leaf: assume the closest leaf entry, say Li, Li can `absorb' `Ent' Add a new entry for `Ent' to the leaf Split the leaf node Modifying the path to the leaf: The parent has space for this entry Split the parent, and so on up to the root

Phase 3: Global Clustering Use an existing global or semi-global algorithm to cluster all the leaf entries across the boundaries of different nodes. This way we can overcome Anomaly 1: Anomaly 1: Depending upon the order of data input and the degree of skew, it is also possible that two subclusters that should not be in one cluster are kept in the same node.

Comparison of BIRCH and CLARANS With synthetic generated dataset:

Summary Compared with previous distance-based approached (e.g, K-Means and CLARANS), BIRCH is appropriate for very large datasets. BIRCH can work with any given amount of memory, and the I/O complexity is a little more than one scan of data.