Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces
Database Management Systems, R. Ramakrishnan 2 Introduction v A set of 2-dimensional points shown adjacent. v They clearly form three distinct groups (called clusters ). v The goal of any clustering algorithm is to find such groups in data to better understand its distribution.
Database Management Systems, R. Ramakrishnan 3 Introduction: What is Clustering? Input: –Database of objects. –A distance function that captures the notion of similarity between objects. –Number of groups. Goal: –Partition the database into the specified number of groups such that each group consists of “similar” objects.
Database Management Systems, R. Ramakrishnan 4 Goals of our clustering algorithm v Good clustering quality v Scalability v Only use a bounded amount of main memory
Database Management Systems, R. Ramakrishnan 5 Outline v Introduction v The BIRCH* framework v BIRCH for n-dimensional spaces v BUBBLE for arbitrary metric spaces v BUBBLE-FM: An improvement over BUBBLE. v Experimental evaluation v Conclusions
Database Management Systems, R. Ramakrishnan 6 BIRCH*: Introduction v BIRCH* is a framework for scalable incremental clustering algorithms. –Output is a set of sub-clusters which can further be analyzed by a more expensive domain-specific clustering algorithm. v BIRCH* can be instantiated to yield different clustering algorithms.
Database Management Systems, R. Ramakrishnan 7 BIRCH*: Incremental Algorithm v Clusters evolve as data is scanned. v A current set of clusters is always maintained in memory. v Each new object is either –inserted into the cluster to which it is “closest”, or –it forms a cluster of its own. Requirements: –a representation for clusters. –a structure to search for the closest cluster.
Database Management Systems, R. Ramakrishnan 8 BIRCH*: Important features v Cluster features (CF) –Condensed representation for a cluster of objects v CF-tree –A height-balanced index for CFs v Rebuilding algorithm –When the allocated amount of memory is exhausted, a smaller CF-tree is built from the old tree.
Database Management Systems, R. Ramakrishnan 9 BIRCH*:Cluster Feature (CF) v CFs are summarized representations of clusters. v They contain sufficient information to find –the distance between a cluster and an object. –the distance between any two clusters. v They are incrementally maintainable –when new objects are inserted in clusters. –when two clusters are merged.
Database Management Systems, R. Ramakrishnan 10 BIRCH*: CF-tree v Two parameters –Branching factor –Threshold v Each entry contains the CF of the cluster of objects in the sub-tree beneath it. v Starting from the root, the “ closest ” entry is selected to traverse downwards until a leaf node is reached.
Database Management Systems, R. Ramakrishnan 11 BIRCH*: CF-Tree insertion (contd) v At the leaf node, the closest cluster is selected to insert the object. v If the threshold criterion is satisfied, the object is absorbed into the cluster. Else, it forms a new cluster on the leaf node. v The path from the root to the leaf is updated to reflect the insertion.
Database Management Systems, R. Ramakrishnan 12 BIRCH*: CF-tree Insertion (contd) v If there is no space on the leaf node it is split and the entries are redistributed based on the “ closeness ” criterion. v A new entry is created at its parent to reflect the formation of a new leaf node.
Database Management Systems, R. Ramakrishnan 13 BIRCH*: Rebuilding Algorithm v If the CF-tree grows to occupy more space than it is allocated, the threshold is increased and the CF-tree is rebuilt. v CFs of leaf clusters are inserted into the new tree. The insertion algorithm is the same as for individual objects.
Database Management Systems, R. Ramakrishnan 14 BIRCH*: Instantiation Summary To instantiate BIRCH* we have to define: v Cluster features at leaf and non-leaf levels. v Incremental maintenance of leaf-level CFs and updates to non-leaf level CFs when new objects are inserted. v Distance measures between any two CFs to define the notion of closeness.
Database Management Systems, R. Ramakrishnan 15 BIRCH*: Instantiation of BIRCH v CF of a cluster of n k-dimensional vectors, V 1,…,V n is defined as (n, LS, SS) –n is the number of vectors –LS is the sum of vectors –SS is the sum of squares of vectors v CF 1 +CF 2 = (n 1 +n 2, LS 1 +LS 2, SS 1 +SS 2 ) –This property is used for incremental maintaining cluster features. v Distance between two clusters C1 and C2 is defined to be the distance between their centroids.
Database Management Systems, R. Ramakrishnan 16 Arbitrary metric space (AMS): Issues v Only operation allowed between objects is the distance computation. –Specifically, the notion of a centroid of a set of objects does not exist. v The distance function can be computationally very expensive. E.g., the edit distance between strings.
Database Management Systems, R. Ramakrishnan 17 Definitions Given a set O of objects O 1,…,O n v Row sum of O i is defined as v Clustroid of O is the object with the least row sum value. –Clustroid is a concept parallel to that of the centroid in the Euclidean space.
Database Management Systems, R. Ramakrishnan 18 BUBBLE: CF v The CF of a set O of objects O 1,…,O n is defined as (n, O 0, SS, R, RS). N: number of objects. O 0 : clustroid SS: sum of squared distances of all objects from O 0 R: set of representative objects ( explained later ) RS: row sum values of the representative objects
Database Management Systems, R. Ramakrishnan 19 BUBBLE: Non-leaf CFs v Non-leaf CFs direct a new object to an appropriate child node. –They capture the distribution of objects in the sub- tree below them. v A set of sample objects randomly collected from the sub-tree at a non-leaf entry forms its CF.
Database Management Systems, R. Ramakrishnan 20 BUBBLE: Incremental Maintenance (Leaf CF) Types of insertion Type I: Insertion of a single object. Type II: Insertion of a cluster of objects. v Under Type I insertion, the location of the new clustroid is within a bounded distance of the old clustroid. (The bound depends on the threshold of the cluster.) v Heuristic1: Maintain a few objects close to the clustroid.
Database Management Systems, R. Ramakrishnan 21 BUBBLE:Incremental Maintenance (Leaf CF) v Under Type II insertions, the location of the new clustroid is between the two old clustroids. v Heuristic2: Maintain a few objects farthest from the clustroid in the leaf CF. v The set of objects maintained at each leaf cluster are its representative objects.
Database Management Systems, R. Ramakrishnan 22 BUBBLE:Updates to Non-leaf CFs v The sample objects at a non-leaf entry are updated whenever its child node splits. –The distribution of clusters changes significantly whenever a node splits.
Database Management Systems, R. Ramakrishnan 23 BUBBLE: Distance measures v Distance between two leaf level clusters is defined to be the distance between their clustroids. –If C 1,C 2 are leaf clusters with clustroids O 10, O 20 then D(C 1,C 2 ) = d(O 10,O 20 ) v Distance between two non-leaf level clusters C 1, C 2 with sample objects S 1,S 2 is defined to be the average distance between S 1 and S 2. –D(C 1,C 2 ) =
Database Management Systems, R. Ramakrishnan 24 BUBBLE-FM v Distance functions in arbitrary metric spaces can be computationally expensive. v Idea: Use the Euclidean distance function instead.
Database Management Systems, R. Ramakrishnan 25 BUBBLE-FM: Non-leaf CF v Map S using FastMap into a k-d Euclidean image space. v Each non-leaf CF now contains the centroid of the image vectors of its sample objects. v New objects are mapped into the image space and the Euclidean distance function is used.
Database Management Systems, R. Ramakrishnan 26 Scalability
Database Management Systems, R. Ramakrishnan 27 Conclusions v BIRCH* framework for scalable incremental clustering algorithms. v Instantiation for n-d spaces (BIRCH). v Instantiation for AMS (BUBBLE). v FastMap to reduce the number of times the distance function is called.