Clustering Large Datasets in Arbitrary Metric Space

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Clustering Algorithms BIRCH and CURE
Overview Of Clustering Techniques D. Gunopulos, UCR.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Clustering.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Other Clustering Techniques
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
GROUP 6 KIIZA FELIX 2013/BIT/110 MUHANGUZI EUSTUS 2013/BIT/104/PS TUGIROKWIKIRIZA FLAVIA 2013/BIT/111/PS HAMSTONE NATOSHA 2013/BIT/122/PS GILBERT MUMBERE.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
What Is Cluster Analysis?
DATA MINING Spatial Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Overview Of Clustering Techniques
Hierarchical clustering approaches for high-throughput data
Fuzzy Clustering.
Dr. Unnikrishnan P.C. Professor, EEE
CSE572, CBS598: Data Mining by H. Liu
Clustering.
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Text Categorization Berlin Chen 2003 Reference:
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Clustering Large Datasets in Arbitrary Metric Space by Muralikrishna Achari COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Contents Introduction to Clustering Problems in Traditional Clustering Clustering Large Datasets BIRCH* BUBBLE BUBBLE-FM Scalability Conclusion COSC6341 Information Retreival Project Presentation

Traditional Clustering Unsupervised Learning. A process of grouping similar object into groups. Distance between object is used as a common metric to assess similarity COSC6341 Information Retreival Project Presentation

Types of Clustering Algorithms Hierarchical clustering Minimal Spanning Tree Method, BIRCH, BUBBLE Partition based clustering K-means, CLARANS COSC6341 Information Retreival Project Presentation

Hierarchical Clustering A crude division of instances into groups at the top level, and each of these groups is refined further – perhaps all the way down to the individual instances. COSC6341 Information Retreival Project Presentation

Partition based clustering A desired number of clusters are assumed at the start and instances are allocated among clusters so that a particular clustering criterion is optimized (e.g. minimization of the variability within clusters). COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Applications Marketing: finding groups of customers with similar behavior. Landscapes : Characterizing different regions. Biology: classification of plants and animals given their features. Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. WWW: document classification; clustering weblog data to discover groups of similar access patterns. COSC6341 Information Retreival Project Presentation

Problem with Traditional Clustering Dealing with large number of dimensions and large number of data items can be problematic because of time complexity. COSC6341 Information Retreival Project Presentation

Requirements for a Good Clustering Algorithm Scalability. Dealing with different types of attributes. Discovering clusters with arbitrary shape. Minimal requirements for domain knowledge to determine input parameters. Ability to deal with noise and outliers. COSC6341 Information Retreival Project Presentation

Clustering Large Datasets COSC6341 Information Retreival Project Presentation

Clustering Large Datasets CLARANS Assumes all object fit in main memory,  sensitive to input order. Uses R* to improve efficiency. BIRCH Minimizes memory usage and scans data only once from disk. Uses cluster representatives instead of actual data points. 1st algorithm proposed in the database area that addresses outliers. DBSCAN Uses distance based notion to clusters to discover clusters of arbitrary shapes. Sensitive to the input parameters and incurs substantial I/O cost. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Drawbacks Both BIRCH and CLARANS works well for clusters with Spherical or Convex sphape and uniform size and are unsuitable when clusters have different sizes and are non-spherical. All the three algorithms relies on vector operations which are only defined in coordinate space and are unsuitable to datasets in distance space. COSC6341 Information Retreival Project Presentation BRICH

COSC6341 Information Retreival Project Presentation Proposed Approach 2 algorithms for clustering large datasets based on BIRCH* framework. BUBBLE BUBBLE-FM COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation BIRCH* Balanced Iterative Reducing and Clustering using Hierarchies BIRCH* is generalized framework for incremental clustering algorithms. BIRCH* components can be instantiated to generate concrete clustering algorithms. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation BIRCH* Components Cluster Feature (CF*) A Summarized representation of the cluster Cluster Tree (CF*-tree) A Height balanced tree for CF*’s COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Clustering Feature CFs are summarized representations of clusters. Requirements:- Incrementally maintainable when a new object is inserted. Contain Sufficient information to compute distance between clusters and objects. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation CF*-Tree A height-balanced tree. Two parameters 1. Branching Factor, B 2. Threshold, T Non-leaf node has B entries; ( [CFi, childi], i = 1..B) Where CFi is the CF of the sub clusters represented by this child Childi is a pointer to its ith child. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation CF*-tree Leaf node Satisfies threshold T, which controls its tightness and quality. Diameter or radius < T Tree size is a function of T T increase tree size decreases. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation CF Tree COSC6341 Information Retreival Project Presentation

Functionality of CF*-tree Direct a new object, O, to the cluster closest to it. Non-leaf node: exits to guide new objects to appropriate leaf clusters. Leaf node: absorbs the new object. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation BIRCH*: Mechanism Starts with an initial T. Scans the data and inserts the objects into the tree. During scan, existing clusters are updated and new clusters are formed. If runs out of memory, M, increases T and builds a smaller CF*-tree. After inserting the old leaf entries, resumes from the point at which it was interrupted. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation CF*-tree insertion CF* tree insertion mechanism is same as that of B+ trees. Each new object, O: Reaches the leaf node, L. Inserted into a closest clusters C, if threshold, T, is not violated else forms a new Cluster If there is not enough space in L, then split into two leaf nodes and distribute entries between the two nodes. Like B+ tree, node splits might propagate till the root. The path from the root to the leaf is updated to reflect the insertion. COSC6341 Information Retreival Project Presentation

BRICH*: Instantiation Summary Cluster features at leaf and non-leaf levels. Incremental maintenance of Cluster Features at leaf and non leaf nodes Distance measure between CF* and an object , and between CF*s. Threshold requirement. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation BUBBLE BIRCH* instantiated in distance space. No concept of Centroids. For a given set of objects O = {O1…On} Defines:- Rowsum (O)= Clustroid (O ) is and object O’ with least Rowsum value. Radius, r (O ) = Clustroid distance, D0 (O1 , O 2) = d(O1’,O2’) COSC6341 Information Retreival Project Presentation

BUBBLE: CF at leaf nodes For a set of objects O = {O1…On} and cluster C . CF is a five tupple defined as: (n, O’, R, RS, r) n: Number of objects in C . O’: Clustroid of C . R : representatives of the Cluster C , (R C ). RS: The Rowsum of all the representatives. r: Radius of the Clusters C . COSC6341 Information Retreival Project Presentation

BUBBLE: CF at non-leaf node A set of sample objects, S (NLi), randomly collected from the subtree NLi form its CFi . CF at NL =  S (NLi) Each child node will have at least one representative in S(NL). If CFi is leaf node then S (NLi) are randomly picked from the clustroids of CFi. COSC6341 Information Retreival Project Presentation

BUBBLE: Incremental Maintenance of CF at leaf Types of Insertions Type I: Insertion of a single object Type II: Insertion of a cluster of objects. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Type I Insertion Inserting an object into the leaf If |C| is small, maintain all the cluster objects and calculate the new clustroid. If |C | is large, maintain a subset of C of size R that are close to the clustroid. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Type II Insertion Inserting a cluster of objects:- C1 and C2 must be non-overlapping but close clusters. The location of the new clustroid is between the two old clustroids. By maintaining few objects far away from the clustriods of C1 and C2 the new clustroid can be calculated. COSC6341 Information Retreival Project Presentation

Incremental Maintenance of CF at non leaf The sample objects at a non-leaf entry are updated whenever its child node splits. The distribution of clusters changes significantly whenever a node splits. To reflect changes in the distribution at all children nodes, update the sample objects at all entries of NL. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Drawbacks of BUBBLE BUBBLE computes distance between sample objects which could be expensive. E.g. edit distance on string COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation BUBBLE-FM Transforms the distance space into an approximate vector space. Maintains all the sample objects at each non leaf node in vector space. For a new object O, transforms O to Vector space and uses Euclidean distance metric. Doesn’t use transformation at leaf node. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Scalability COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Scalability COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation Conclusion Presented the BIRCH* framework for scalable incremental pre clustering algorithms. BUBBLE for datasets in arbitrary metric space Fast map to reduce the number of calls to an expensive distance function. COSC6341 Information Retreival Project Presentation

COSC6341 Information Retreival Project Presentation References Primary Source: Clustering Large Datasets in Arbitrary Metric Spaces (1999)  Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences. Secondary Sources: BIRCH: An Efficient Data Clustering Method for Very Large Databases (1996)  Tian Zhang, Raghu Ramakrishnan, Miron Livny CURE: An Efficient Clustering Algorithm for Large Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim. COSC6341 Information Retreival Project Presentation