Cluto – Clustering toolkit by G. Karypis, UMN

Slides:



Advertisements
Similar presentations
Section 2.5: Graphs and Trees
Advertisements

Max- coloring in trees SRIRAM V.PEMMARAJU AND RAJIV RAMAN BY JAYATI JENNIFER LAW.
Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.
BY ROSELINE ANTAI CLUTO A Clustering Toolkit. What is CLUTO? CLUTO is a software package which is used for clustering high dimensional datasets and for.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Data Structures Using C++
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
ITEC200 – Week 12 Graphs. 2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS 410 Applied Algorithms Applied Algorithms Lecture #3 Data Structures.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.
Chapter 8 Arrays and Strings
Graph Implementations Chapter 29 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Important Problem Types and Fundamental Data Structures
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
A Presentation on the Implementation of Decision Trees in Matlab
Data Structures Using C++ 2E
Introduction to Data Structures. Definition Data structure is representation of the logical relationship existing between individual elements of data.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 8 Arrays and Strings
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Data Structures TREES.
The all-pairs shortest path problem (APSP) input: a directed graph G = (V, E) with edge weights goal: find a minimum weight (shortest) path between every.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Prepared by: Mahmoud Rafeek Al-Farra
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Arrays Declaring arrays Passing arrays to functions Searching arrays with linear search Sorting arrays with insertion sort Multidimensional arrays Programming.
Two dimensional arrays A two dimensional m x n array A is a collection of m. n elements such that each element is specified by a pair of integers (such.
VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
(University of Minnesota)
DECISION TREES An internal node represents a test on an attribute.
Computer Programming BCT 1113
CS 367 – Introduction to Data Structures
Mean Shift Segmentation
Clustering Evaluation The EM Algorithm
Chapter 1.
CSE 4705 Artificial Intelligence
CS200: Algorithm Analysis
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Shortest Path Algorithms
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Dimension reduction : PCA and Clustering
Backtracking and Branch-and-Bound
Important Problem Types and Fundamental Data Structures
Presentation transcript:

Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy

CLUstering Toolkit for very large, high dimensional & sparse datasets http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download Main characteristics Seeks to optimize a particular clustering criterion function Identifies the features that best describe and discriminate each cluster Allows for visually examining relations between clusters, objects, and features Handles sparsity and requires memory as roughly linear in the input size Analysis Goals To understand relations between objects assigned to each cluster and relations between the different clusters To visualize the discovered clustering solution Distributions Stand-alone programs (vcluster and scluster) Library via an application program can access CLUTO algorithms What is CLUTO?

Clustering algorithms Programs: vcluster: takes as input a multidimensional representation of the objects to be clustered scluster: takes as input the object similarity graph Parameter: -clmethod=string Partitional Direct k-way clustering (direct) Bisecting k-way clustering (rb, rbr) Agglomerative hierarchical (agglo) Partitional-based agglomerative hierarchical (bagglo) Graph-partitioning-based (graph) Rb: the desired k-way clustering solution is computed by performing a sequence of k − 1 repeated bisections. In this approach, the matrix is first clustered into two groups, then one of these groups is selected and bisected further. This process continues until the desired number of clusters is found. During each step, the cluster is bisected so that the resulting 2-way clustering solution optimizes a particular clustering criterion function (which is selected using the -crfun parameter). Note that this approach ensures that the criterion function is locally optimized within each bisection, but in general is not globally optimized. The cluster that is selected for further partitioning is controlled by the -cstype parameter. By default, vcluster uses this approach to find the k-way clustering solution. biased agglomerative : use a partitional sqrt(n)-way clustering solution to bias the agglomeration process. The key motivation behind these algorithms is to use a partitional clustering solution that optimizes a global criterion function to limit the number of errors performed during the early stages of the agglomerative algorithms. Extensive experiments with these algorithms on document datasets show that they lead to superior clustering solution. Graph: by first modeling the objects using a nearest-neighbor graph (each object becomes a vertex, and each object is connected to its most similar other objects), and then splitting the graph into k-clusters using a min-cut graph partitioning algorithm. Note that if the graph contains more than one connected component, then vcluster and scluster return a (k + m)-way clustering solution, where m is the number of connected components in the graph. Clustering algorithms

Usage MatrixFile: the file that stores the objects to be clustered GraphFile: the file that stores the adjacency matrix of the object similarity graph NClusters: the number of clusters Optional parameters: categorized into three groups specified using –paramname or –paramname=value categorized into three groups control various aspects of the clustering algorithm control type of analysis and reporting that is performed computed clusters control visualization of the clusters Output clustering solution is stored in a file named File.clustering.NClusters vcluster [option parameters] MatrixFile NClusters scluster [option parameters] GraphFile NClusters Usage

Input file format: matrix file Plain text with n+1 lines storing the data matrix for n m-dimensional objects Dense format Metadata (in the first line): #rows, #columns Each remaining line contains space-separated float values Sparse format Metadata (in the first line): #rows, #columns, #nonzero entries Each row represents a single object, and the various columns correspond to object attributes Input file format: matrix file

Input file format: graph file Plain text with n+1 lines storing the adjacency matrix of the graph that specifies the similarity between the n objects Dense format: Metadata (in the first line): #vertices (n) Each of the remaining n lines stores n space-separated floating point values such that the ith value corresponds to the similarity to the ith vertex of the graph Sparse format: Metadata (in the first line): #vertices (n) and #edges Each of the remaining n lines contains the index of the adjacent vertex followed by the similarity of the corresponding edge Each pair contains the number of the adjacent vertex followed by the similarity of the corresponding edge. vertex numbers are assumed to be integer and similarity are assumed to be floating point number Input file format: graph file

Input file format: labels Row label file: Stores the label for each of the rows of the matrix (objects) -rlabelfile param Column label file: Stores the label for each of the columns of the matrix (attributes) -clabelfile param Row class label file Stores the class-label for each of the rows of the matrix (objects) -rclassfile param Input file format: labels

Output file format Clustering solution file Tree file n lines, with a single number per line ith line contains the cluster number that the ith object/row/vertex belongs to Cluster numbers run from zero to the number of clusters minus one If –zscores is specified, each line of this file contains two additional numbers right after the cluster number internal z-score, external z-score Tree file produced by performing AHC on top of a k-way clustering solution stored into a file in the form of a parent array: 2k-1 lines such that the ith line contains the parent of the ith node of the tree In the case of the root node, which is stored in the last line of the file, the parent is set to –1. Output file format

Output example Matrix/Graph information Settings Clustering/Clusters quality statistics Timing information Output example

Internal clustering quality

External clustering quality Comparison with reference classification (via –rclassfile) Overall Entropy and Purity For each cluster Local entropy and purity Object distribution over the classes External clustering quality

Determine the best set of descriptive & discriminating features for each cluster (via –showfeatures) Top-L most descriptive features, with % of the within cluster sim. Top-L most discriminating features, with % of the dissim. between the cluster and the rest of the objects Cluster description

Cluster tree (1/2) via –showtree Displayed in a rotated fashion First column as the root, the tree grows from left to right The leaves are numbered from Nclusters to 2*Nclusters -2 If –rclassfile is specified: prints information about how the objects of the various classes are distributed in each cluster Cluster tree (1/2)

Cluster tree (2/2) via –showtree and -laveltree Further statistics on each of the the clusters Size Isim Xsim: avg sim between the objects of each pair of clusters that are children of the same node of the tree Gain: change in the value of a particular clustering criterion function Cluster tree (2/2)

Cluster visualization Example 3. produced when –plotcluster is specified for a sparse matrix. color-intensity plot of the relations between the different clusters of documents and features. A subset of the features is displayed: union of the most descriptive and discriminating features of each cluster. Features are re-ordered according to a HAC. A brighter red cell correspond ing to a pair feature-cluster indicates higher power of that feature to be, for that cluster, descriptive (i.e., the fraction of within-cluster similarity that this feature can explain) and discriminating (i.e., the fraction of dissimilarity between the cluster and the rest of the objects this feature can explain.) The width of each cluster-column is proportional to the logarithm of the corresponding cluster's size. Cluster visualization

Example 1. produced when –plotmatrix is specified for a sparse matrix. : row of input matrix re-ordered in such a way so that the rows assigned to each one of the ten clusters each non-zero positive element of matrix is displayed by a different shade of red. (b) : row and columns are also re-ordered according to a hierarchical clustering (c) : 10-way clustering obtained by scluster.

Cluster visualization Example 4. produced when –plottree is specified entire hierarchical tree Leaves of the tree are labeled with the particular row-id(or row label if available) Cluster visualization