Clustering Large Datasets in Arbitrary Metric Space

Clustering Large Datasets in Arbitrary Metric Space
by Muralikrishna Achari COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation
Contents Introduction to Clustering Problems in Traditional Clustering Clustering Large Datasets BIRCH* BUBBLE BUBBLE-FM Scalability Conclusion COSC6341 Information Retreival Project Presentation

Traditional Clustering
Unsupervised Learning. A process of grouping similar object into groups. Distance between object is used as a common metric to assess similarity COSC6341 Information Retreival Project Presentation

Types of Clustering Algorithms
Hierarchical clustering Minimal Spanning Tree Method, BIRCH, BUBBLE Partition based clustering K-means, CLARANS COSC6341 Information Retreival Project Presentation

Hierarchical Clustering
A crude division of instances into groups at the top level, and each of these groups is refined further – perhaps all the way down to the individual instances. COSC6341 Information Retreival Project Presentation

Partition based clustering
A desired number of clusters are assumed at the start and instances are allocated among clusters so that a particular clustering criterion is optimized (e.g. minimization of the variability within clusters). COSC6341 Information Retreival Project Presentation

Applications Marketing: finding groups of customers with similar behavior. Landscapes : Characterizing different regions. Biology: classification of plants and animals given their features. Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. WWW: document classification; clustering weblog data to discover groups of similar access patterns. COSC6341 Information Retreival Project Presentation

Problem with Traditional Clustering
Dealing with large number of dimensions and large number of data items can be problematic because of time complexity. COSC6341 Information Retreival Project Presentation

Requirements for a Good Clustering Algorithm
Scalability. Dealing with different types of attributes. Discovering clusters with arbitrary shape. Minimal requirements for domain knowledge to determine input parameters. Ability to deal with noise and outliers. COSC6341 Information Retreival Project Presentation

Clustering Large Datasets
COSC6341 Information Retreival Project Presentation

Clustering Large Datasets
CLARANS Assumes all object fit in main memory,  sensitive to input order. Uses R* to improve efficiency. BIRCH Minimizes memory usage and scans data only once from disk. Uses cluster representatives instead of actual data points. 1st algorithm proposed in the database area that addresses outliers. DBSCAN Uses distance based notion to clusters to discover clusters of arbitrary shapes. Sensitive to the input parameters and incurs substantial I/O cost. COSC6341 Information Retreival Project Presentation

Drawbacks Both BIRCH and CLARANS works well for clusters with Spherical or Convex sphape and uniform size and are unsuitable when clusters have different sizes and are non-spherical. All the three algorithms relies on vector operations which are only defined in coordinate space and are unsuitable to datasets in distance space. COSC6341 Information Retreival Project Presentation BRICH

Proposed Approach 2 algorithms for clustering large datasets based on BIRCH* framework. BUBBLE BUBBLE-FM COSC6341 Information Retreival Project Presentation

BIRCH* Balanced Iterative Reducing and Clustering using Hierarchies BIRCH* is generalized framework for incremental clustering algorithms. BIRCH* components can be instantiated to generate concrete clustering algorithms. COSC6341 Information Retreival Project Presentation

BIRCH* Components Cluster Feature (CF*) A Summarized representation of the cluster Cluster Tree (CF*-tree) A Height balanced tree for CF*’s COSC6341 Information Retreival Project Presentation

Clustering Feature CFs are summarized representations of clusters. Requirements:- Incrementally maintainable when a new object is inserted. Contain Sufficient information to compute distance between clusters and objects. COSC6341 Information Retreival Project Presentation

CF*-Tree A height-balanced tree. Two parameters 1. Branching Factor, B 2. Threshold, T Non-leaf node has B entries; ( [CFi, childi], i = 1..B) Where CFi is the CF of the sub clusters represented by this child Childi is a pointer to its ith child. COSC6341 Information Retreival Project Presentation

CF*-tree Leaf node Satisfies threshold T, which controls its tightness and quality. Diameter or radius < T Tree size is a function of T T increase tree size decreases. COSC6341 Information Retreival Project Presentation

CF Tree COSC6341 Information Retreival Project Presentation

Functionality of CF*-tree
Direct a new object, O, to the cluster closest to it. Non-leaf node: exits to guide new objects to appropriate leaf clusters. Leaf node: absorbs the new object. COSC6341 Information Retreival Project Presentation

BIRCH*: Mechanism Starts with an initial T. Scans the data and inserts the objects into the tree. During scan, existing clusters are updated and new clusters are formed. If runs out of memory, M, increases T and builds a smaller CF*-tree. After inserting the old leaf entries, resumes from the point at which it was interrupted. COSC6341 Information Retreival Project Presentation

CF*-tree insertion CF* tree insertion mechanism is same as that of B+ trees. Each new object, O: Reaches the leaf node, L. Inserted into a closest clusters C, if threshold, T, is not violated else forms a new Cluster If there is not enough space in L, then split into two leaf nodes and distribute entries between the two nodes. Like B+ tree, node splits might propagate till the root. The path from the root to the leaf is updated to reflect the insertion. COSC6341 Information Retreival Project Presentation

BRICH*: Instantiation Summary
Cluster features at leaf and non-leaf levels. Incremental maintenance of Cluster Features at leaf and non leaf nodes Distance measure between CF* and an object , and between CF*s. Threshold requirement. COSC6341 Information Retreival Project Presentation

BUBBLE BIRCH* instantiated in distance space. No concept of Centroids. For a given set of objects O = {O1…On} Defines:- Rowsum (O)= Clustroid (O ) is and object O’ with least Rowsum value. Radius, r (O ) = Clustroid distance, D0 (O1 , O 2) = d(O1’,O2’) COSC6341 Information Retreival Project Presentation

BUBBLE: CF at leaf nodes
For a set of objects O = {O1…On} and cluster C . CF is a five tupple defined as: (n, O’, R, RS, r) n: Number of objects in C . O’: Clustroid of C . R : representatives of the Cluster C , (R C ). RS: The Rowsum of all the representatives. r: Radius of the Clusters C . COSC6341 Information Retreival Project Presentation

BUBBLE: CF at non-leaf node
A set of sample objects, S (NLi), randomly collected from the subtree NLi form its CFi . CF at NL =  S (NLi) Each child node will have at least one representative in S(NL). If CFi is leaf node then S (NLi) are randomly picked from the clustroids of CFi. COSC6341 Information Retreival Project Presentation

BUBBLE: Incremental Maintenance of CF at leaf
Types of Insertions Type I: Insertion of a single object Type II: Insertion of a cluster of objects. COSC6341 Information Retreival Project Presentation

Type I Insertion Inserting an object into the leaf If |C| is small, maintain all the cluster objects and calculate the new clustroid. If |C | is large, maintain a subset of C of size R that are close to the clustroid. COSC6341 Information Retreival Project Presentation

Type II Insertion Inserting a cluster of objects:- C1 and C2 must be non-overlapping but close clusters. The location of the new clustroid is between the two old clustroids. By maintaining few objects far away from the clustriods of C1 and C2 the new clustroid can be calculated. COSC6341 Information Retreival Project Presentation

Incremental Maintenance of CF at non leaf
The sample objects at a non-leaf entry are updated whenever its child node splits. The distribution of clusters changes significantly whenever a node splits. To reflect changes in the distribution at all children nodes, update the sample objects at all entries of NL. COSC6341 Information Retreival Project Presentation

Drawbacks of BUBBLE BUBBLE computes distance between sample objects which could be expensive. E.g. edit distance on string COSC6341 Information Retreival Project Presentation

BUBBLE-FM Transforms the distance space into an approximate vector space. Maintains all the sample objects at each non leaf node in vector space. For a new object O, transforms O to Vector space and uses Euclidean distance metric. Doesn’t use transformation at leaf node. COSC6341 Information Retreival Project Presentation

Scalability COSC6341 Information Retreival Project Presentation

Conclusion Presented the BIRCH* framework for scalable incremental pre clustering algorithms. BUBBLE for datasets in arbitrary metric space Fast map to reduce the number of calls to an expensive distance function. COSC6341 Information Retreival Project Presentation

References Primary Source: Clustering Large Datasets in Arbitrary Metric Spaces (1999) Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences. Secondary Sources: BIRCH: An Efficient Data Clustering Method for Very Large Databases (1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny CURE: An Efficient Clustering Algorithm for Large Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim. COSC6341 Information Retreival Project Presentation

Clustering Large Datasets in Arbitrary Metric Space

Similar presentations

Presentation on theme: "Clustering Large Datasets in Arbitrary Metric Space"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Large Datasets in Arbitrary Metric Space

Similar presentations

Presentation on theme: "Clustering Large Datasets in Arbitrary Metric Space"— Presentation transcript:

Similar presentations

About project

Feedback