Clustering Large Datasets in Arbitrary Metric Space by Muralikrishna Achari COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Contents Introduction to Clustering Problems in Traditional Clustering Clustering Large Datasets BIRCH* BUBBLE BUBBLE-FM Scalability Conclusion COSC6341 Information Retreival Project Presentation
Traditional Clustering Unsupervised Learning. A process of grouping similar object into groups. Distance between object is used as a common metric to assess similarity COSC6341 Information Retreival Project Presentation
Types of Clustering Algorithms Hierarchical clustering Minimal Spanning Tree Method, BIRCH, BUBBLE Partition based clustering K-means, CLARANS COSC6341 Information Retreival Project Presentation
Hierarchical Clustering A crude division of instances into groups at the top level, and each of these groups is refined further – perhaps all the way down to the individual instances. COSC6341 Information Retreival Project Presentation
Partition based clustering A desired number of clusters are assumed at the start and instances are allocated among clusters so that a particular clustering criterion is optimized (e.g. minimization of the variability within clusters). COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Applications Marketing: finding groups of customers with similar behavior. Landscapes : Characterizing different regions. Biology: classification of plants and animals given their features. Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. WWW: document classification; clustering weblog data to discover groups of similar access patterns. COSC6341 Information Retreival Project Presentation
Problem with Traditional Clustering Dealing with large number of dimensions and large number of data items can be problematic because of time complexity. COSC6341 Information Retreival Project Presentation
Requirements for a Good Clustering Algorithm Scalability. Dealing with different types of attributes. Discovering clusters with arbitrary shape. Minimal requirements for domain knowledge to determine input parameters. Ability to deal with noise and outliers. COSC6341 Information Retreival Project Presentation
Clustering Large Datasets COSC6341 Information Retreival Project Presentation
Clustering Large Datasets CLARANS Assumes all object fit in main memory, sensitive to input order. Uses R* to improve efficiency. BIRCH Minimizes memory usage and scans data only once from disk. Uses cluster representatives instead of actual data points. 1st algorithm proposed in the database area that addresses outliers. DBSCAN Uses distance based notion to clusters to discover clusters of arbitrary shapes. Sensitive to the input parameters and incurs substantial I/O cost. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Drawbacks Both BIRCH and CLARANS works well for clusters with Spherical or Convex sphape and uniform size and are unsuitable when clusters have different sizes and are non-spherical. All the three algorithms relies on vector operations which are only defined in coordinate space and are unsuitable to datasets in distance space. COSC6341 Information Retreival Project Presentation BRICH
COSC6341 Information Retreival Project Presentation Proposed Approach 2 algorithms for clustering large datasets based on BIRCH* framework. BUBBLE BUBBLE-FM COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation BIRCH* Balanced Iterative Reducing and Clustering using Hierarchies BIRCH* is generalized framework for incremental clustering algorithms. BIRCH* components can be instantiated to generate concrete clustering algorithms. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation BIRCH* Components Cluster Feature (CF*) A Summarized representation of the cluster Cluster Tree (CF*-tree) A Height balanced tree for CF*’s COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Clustering Feature CFs are summarized representations of clusters. Requirements:- Incrementally maintainable when a new object is inserted. Contain Sufficient information to compute distance between clusters and objects. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation CF*-Tree A height-balanced tree. Two parameters 1. Branching Factor, B 2. Threshold, T Non-leaf node has B entries; ( [CFi, childi], i = 1..B) Where CFi is the CF of the sub clusters represented by this child Childi is a pointer to its ith child. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation CF*-tree Leaf node Satisfies threshold T, which controls its tightness and quality. Diameter or radius < T Tree size is a function of T T increase tree size decreases. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation CF Tree COSC6341 Information Retreival Project Presentation
Functionality of CF*-tree Direct a new object, O, to the cluster closest to it. Non-leaf node: exits to guide new objects to appropriate leaf clusters. Leaf node: absorbs the new object. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation BIRCH*: Mechanism Starts with an initial T. Scans the data and inserts the objects into the tree. During scan, existing clusters are updated and new clusters are formed. If runs out of memory, M, increases T and builds a smaller CF*-tree. After inserting the old leaf entries, resumes from the point at which it was interrupted. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation CF*-tree insertion CF* tree insertion mechanism is same as that of B+ trees. Each new object, O: Reaches the leaf node, L. Inserted into a closest clusters C, if threshold, T, is not violated else forms a new Cluster If there is not enough space in L, then split into two leaf nodes and distribute entries between the two nodes. Like B+ tree, node splits might propagate till the root. The path from the root to the leaf is updated to reflect the insertion. COSC6341 Information Retreival Project Presentation
BRICH*: Instantiation Summary Cluster features at leaf and non-leaf levels. Incremental maintenance of Cluster Features at leaf and non leaf nodes Distance measure between CF* and an object , and between CF*s. Threshold requirement. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation BUBBLE BIRCH* instantiated in distance space. No concept of Centroids. For a given set of objects O = {O1…On} Defines:- Rowsum (O)= Clustroid (O ) is and object O’ with least Rowsum value. Radius, r (O ) = Clustroid distance, D0 (O1 , O 2) = d(O1’,O2’) COSC6341 Information Retreival Project Presentation
BUBBLE: CF at leaf nodes For a set of objects O = {O1…On} and cluster C . CF is a five tupple defined as: (n, O’, R, RS, r) n: Number of objects in C . O’: Clustroid of C . R : representatives of the Cluster C , (R C ). RS: The Rowsum of all the representatives. r: Radius of the Clusters C . COSC6341 Information Retreival Project Presentation
BUBBLE: CF at non-leaf node A set of sample objects, S (NLi), randomly collected from the subtree NLi form its CFi . CF at NL = S (NLi) Each child node will have at least one representative in S(NL). If CFi is leaf node then S (NLi) are randomly picked from the clustroids of CFi. COSC6341 Information Retreival Project Presentation
BUBBLE: Incremental Maintenance of CF at leaf Types of Insertions Type I: Insertion of a single object Type II: Insertion of a cluster of objects. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Type I Insertion Inserting an object into the leaf If |C| is small, maintain all the cluster objects and calculate the new clustroid. If |C | is large, maintain a subset of C of size R that are close to the clustroid. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Type II Insertion Inserting a cluster of objects:- C1 and C2 must be non-overlapping but close clusters. The location of the new clustroid is between the two old clustroids. By maintaining few objects far away from the clustriods of C1 and C2 the new clustroid can be calculated. COSC6341 Information Retreival Project Presentation
Incremental Maintenance of CF at non leaf The sample objects at a non-leaf entry are updated whenever its child node splits. The distribution of clusters changes significantly whenever a node splits. To reflect changes in the distribution at all children nodes, update the sample objects at all entries of NL. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Drawbacks of BUBBLE BUBBLE computes distance between sample objects which could be expensive. E.g. edit distance on string COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation BUBBLE-FM Transforms the distance space into an approximate vector space. Maintains all the sample objects at each non leaf node in vector space. For a new object O, transforms O to Vector space and uses Euclidean distance metric. Doesn’t use transformation at leaf node. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Scalability COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Scalability COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation Conclusion Presented the BIRCH* framework for scalable incremental pre clustering algorithms. BUBBLE for datasets in arbitrary metric space Fast map to reduce the number of calls to an expensive distance function. COSC6341 Information Retreival Project Presentation
COSC6341 Information Retreival Project Presentation References Primary Source: Clustering Large Datasets in Arbitrary Metric Spaces (1999) Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences. Secondary Sources: BIRCH: An Efficient Data Clustering Method for Very Large Databases (1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny CURE: An Efficient Clustering Algorithm for Large Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim. COSC6341 Information Retreival Project Presentation