Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Chapter 4: Trees Part II - AVL Tree
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
Searching on Multi-Dimensional Data
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
Clustering.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Other Clustering Techniques
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Spatial Data Management
DATA MINING Spatial Clustering
Indexing Structures for Files and Physical Database Design
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
Indexing and Hashing Basic Concepts Ordered Indices
A Framework for Clustering Evolving Data Streams
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
CS 685: Special Topics in Data Mining Jinze Liu
Clustering.
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 2 Introduction v A set of 2-dimensional points shown adjacent. v They clearly form three distinct groups (called clusters ). v The goal of any clustering algorithm is to find such groups in data to better understand its distribution.

Database Management Systems, R. Ramakrishnan 3 Introduction: What is Clustering? Input: –Database of objects. –A distance function that captures the notion of similarity between objects. –Number of groups. Goal: –Partition the database into the specified number of groups such that each group consists of “similar” objects.

Database Management Systems, R. Ramakrishnan 4 Goals of our clustering algorithm v Good clustering quality v Scalability v Only use a bounded amount of main memory

Database Management Systems, R. Ramakrishnan 5 Outline v Introduction v The BIRCH* framework v BIRCH for n-dimensional spaces v BUBBLE for arbitrary metric spaces v BUBBLE-FM: An improvement over BUBBLE. v Experimental evaluation v Conclusions

Database Management Systems, R. Ramakrishnan 6 BIRCH*: Introduction v BIRCH* is a framework for scalable incremental clustering algorithms. –Output is a set of sub-clusters which can further be analyzed by a more expensive domain-specific clustering algorithm. v BIRCH* can be instantiated to yield different clustering algorithms.

Database Management Systems, R. Ramakrishnan 7 BIRCH*: Incremental Algorithm v Clusters evolve as data is scanned. v A current set of clusters is always maintained in memory. v Each new object is either –inserted into the cluster to which it is “closest”, or –it forms a cluster of its own. Requirements: –a representation for clusters. –a structure to search for the closest cluster.

Database Management Systems, R. Ramakrishnan 8 BIRCH*: Important features v Cluster features (CF) –Condensed representation for a cluster of objects v CF-tree –A height-balanced index for CFs v Rebuilding algorithm –When the allocated amount of memory is exhausted, a smaller CF-tree is built from the old tree.

Database Management Systems, R. Ramakrishnan 9 BIRCH*:Cluster Feature (CF) v CFs are summarized representations of clusters. v They contain sufficient information to find –the distance between a cluster and an object. –the distance between any two clusters. v They are incrementally maintainable –when new objects are inserted in clusters. –when two clusters are merged.

Database Management Systems, R. Ramakrishnan 10 BIRCH*: CF-tree v Two parameters –Branching factor –Threshold v Each entry contains the CF of the cluster of objects in the sub-tree beneath it. v Starting from the root, the “ closest ” entry is selected to traverse downwards until a leaf node is reached.

Database Management Systems, R. Ramakrishnan 11 BIRCH*: CF-Tree insertion (contd) v At the leaf node, the closest cluster is selected to insert the object. v If the threshold criterion is satisfied, the object is absorbed into the cluster. Else, it forms a new cluster on the leaf node. v The path from the root to the leaf is updated to reflect the insertion.

Database Management Systems, R. Ramakrishnan 12 BIRCH*: CF-tree Insertion (contd) v If there is no space on the leaf node it is split and the entries are redistributed based on the “ closeness ” criterion. v A new entry is created at its parent to reflect the formation of a new leaf node.

Database Management Systems, R. Ramakrishnan 13 BIRCH*: Rebuilding Algorithm v If the CF-tree grows to occupy more space than it is allocated, the threshold is increased and the CF-tree is rebuilt. v CFs of leaf clusters are inserted into the new tree. The insertion algorithm is the same as for individual objects.

Database Management Systems, R. Ramakrishnan 14 BIRCH*: Instantiation Summary To instantiate BIRCH* we have to define: v Cluster features at leaf and non-leaf levels. v Incremental maintenance of leaf-level CFs and updates to non-leaf level CFs when new objects are inserted. v Distance measures between any two CFs to define the notion of closeness.

Database Management Systems, R. Ramakrishnan 15 BIRCH*: Instantiation of BIRCH v CF of a cluster of n k-dimensional vectors, V 1,…,V n is defined as (n, LS, SS) –n is the number of vectors –LS is the sum of vectors –SS is the sum of squares of vectors v CF 1 +CF 2 = (n 1 +n 2, LS 1 +LS 2, SS 1 +SS 2 ) –This property is used for incremental maintaining cluster features. v Distance between two clusters C1 and C2 is defined to be the distance between their centroids.

Database Management Systems, R. Ramakrishnan 16 Arbitrary metric space (AMS): Issues v Only operation allowed between objects is the distance computation. –Specifically, the notion of a centroid of a set of objects does not exist. v The distance function can be computationally very expensive. E.g., the edit distance between strings.

Database Management Systems, R. Ramakrishnan 17 Definitions Given a set O of objects O 1,…,O n v Row sum of O i is defined as v Clustroid of O is the object with the least row sum value. –Clustroid is a concept parallel to that of the centroid in the Euclidean space.

Database Management Systems, R. Ramakrishnan 18 BUBBLE: CF v The CF of a set O of objects O 1,…,O n is defined as (n, O 0, SS, R, RS). N: number of objects. O 0 : clustroid SS: sum of squared distances of all objects from O 0 R: set of representative objects ( explained later ) RS: row sum values of the representative objects

Database Management Systems, R. Ramakrishnan 19 BUBBLE: Non-leaf CFs v Non-leaf CFs direct a new object to an appropriate child node. –They capture the distribution of objects in the sub- tree below them. v A set of sample objects randomly collected from the sub-tree at a non-leaf entry forms its CF.

Database Management Systems, R. Ramakrishnan 20 BUBBLE: Incremental Maintenance (Leaf CF) Types of insertion Type I: Insertion of a single object. Type II: Insertion of a cluster of objects. v Under Type I insertion, the location of the new clustroid is within a bounded distance of the old clustroid. (The bound depends on the threshold of the cluster.) v Heuristic1: Maintain a few objects close to the clustroid.

Database Management Systems, R. Ramakrishnan 21 BUBBLE:Incremental Maintenance (Leaf CF) v Under Type II insertions, the location of the new clustroid is between the two old clustroids. v Heuristic2: Maintain a few objects farthest from the clustroid in the leaf CF. v The set of objects maintained at each leaf cluster are its representative objects.

Database Management Systems, R. Ramakrishnan 22 BUBBLE:Updates to Non-leaf CFs v The sample objects at a non-leaf entry are updated whenever its child node splits. –The distribution of clusters changes significantly whenever a node splits.

Database Management Systems, R. Ramakrishnan 23 BUBBLE: Distance measures v Distance between two leaf level clusters is defined to be the distance between their clustroids. –If C 1,C 2 are leaf clusters with clustroids O 10, O 20 then D(C 1,C 2 ) = d(O 10,O 20 ) v Distance between two non-leaf level clusters C 1, C 2 with sample objects S 1,S 2 is defined to be the average distance between S 1 and S 2. –D(C 1,C 2 ) =

Database Management Systems, R. Ramakrishnan 24 BUBBLE-FM v Distance functions in arbitrary metric spaces can be computationally expensive. v Idea: Use the Euclidean distance function instead.

Database Management Systems, R. Ramakrishnan 25 BUBBLE-FM: Non-leaf CF v Map S using FastMap into a k-d Euclidean image space. v Each non-leaf CF now contains the centroid of the image vectors of its sample objects. v New objects are mapped into the image space and the Euclidean distance function is used.

Database Management Systems, R. Ramakrishnan 26 Scalability

Database Management Systems, R. Ramakrishnan 27 Conclusions v BIRCH* framework for scalable incremental clustering algorithms. v Instantiation for n-d spaces (BIRCH). v Instantiation for AMS (BUBBLE). v FastMap to reduce the number of times the distance function is called.