CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.
BY ROSELINE ANTAI CLUTO A Clustering Toolkit. What is CLUTO? CLUTO is a software package which is used for clustering high dimensional datasets and for.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Unsupervised learning
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS 410 Applied Algorithms Applied Algorithms Lecture #3 Data Structures.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Cluster Analysis.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Important Problem Types and Fundamental Data Structures
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Cluto – Clustering toolkit by G. Karypis, UMN
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Microarrays.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Clustering.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Presented by Ho Wai Shing
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Computational Biology
Unsupervised Learning
(University of Minnesota)
Mean Shift Segmentation
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Hierarchical clustering approaches for high-throughput data
Information Organization: Clustering
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
DATA MINING Introductory and Advanced Topics Part II - Clustering
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions

 Clustering algorithms group a set of documents into subsets or clusters.  Documents within a cluster should be as similar as possible.  Documents in one cluster should be as dissimilar as possible from documents in other clusters.  Clustering can be classified into:  Flat Clustering and Hierarchical Clustering  Hard Clustering and Soft Clustering

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 CLUTO is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters.  CLUTO provides three different classes of clustering algorithms that operate either directly in the object’s feature space or in the similarity space.  Algorithms are based on the partitional, agglomerative, and graph partitioning paradigms.  CLUTO provides a total of seven different criterion functions.  CLUTO provides tools for analyzing the discovered clusters to understand the relations between the objects assigned to each cluster and the relations between the different clusters

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 CLUTO was developed at Karypis lab, University Of Minnesota Twin Cities.  Ver: 1.5- Added the features of agglomerative clustering algorithms, cluster visualization capability, dense input file support.  Ver 2.0- New clustering programs called scluster and vcluster, added graph-partitioning based clustering algorithms.  Ver 2.1- Added an agglomerative algorithm that uses partitional- clustering to bias the agglomeration.  Ver Reduced the memory requirements of the rb-based clustering methods.  Ver Experimental support for multi-core processors and SMPs using OpenMP for MS Windows and Linux-i686  Ver 2.1.2a- Included build for Windows X86_64.

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

scluster vcluster Clustering Algorithms Similarity Function Criterion Function Graph file Matrix File Cluster solution file Tree file Row label file Row class label file column label file

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 Digital Libraries - To Cluster documents (objects) based on the terms (dimensions) they contain.  Customer Services - Amazon.com may group customers (objects) based on the types of products (nooks, music products - dimensions) they purchase etc.  Genetics - To cluster genes (objects) based on their expression levels (dimensions)  Biochemistry - To cluster proteins (objects) based on the motifs (dimensions) they contain.

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 Multiple classes of clustering algorithms: partitional, agglomerative, & graph-partitioning based.  Multiple similarity/distance functions: Euclidean distance, cosine, correlation coefficient, extended Jaccard, user-defined.  Numerous novel clustering criterion functions and agglomerative merging schemes.  Traditional agglomerative merging schemes: single-link, complete- link, UPGMA

 Extensive cluster visualization capabilities and output options: postscript, SVG, gif, xfig, etc.  Multiple methods for effectively summarizing the clusters: most descriptive and discriminating dimensions, cliques, and frequent item sets.  Can scale to very large datasets containing hundreds of thousands of objects and tens of thousands of dimensions.  CLUTO provides access to its various clustering and analysis algorithms via the vcluster and scluster stand-alone programs.  Vcluster takes as input the actual multi-dimensional representation of the objects that need to be clustered.  Scluster takes as input the similarity matrix (or graph) between these objects.  Their overall calling sequence is as follows: ◦ vcluster [optional parameters] MatrixFile NClusters ◦ scluster [optional parameters] GraphFile Nclusters

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

ChapterClustering concept Chapter 3Jaccard Coefficient Chapter 6Cosine similarity measure, Euclidean distance Chapter 7Cluster pruning Chapter 16Flat Clustering Chapter 17Hierarchical Clustering

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

Matrix Format:  This is the primary input for CLUTO’s vcluster program.  Each row of this matrix represent a single object  Columns correspond to the dimensions (i.e., features) of the objects.  Matrix format can be sparse or dense

 Dense Matrix format:  The first line of the matrix file contains exactly two numbers, all of which are integers. The first integer is the number of rows in the matrix (n) and the second integer is the number of columns in the matrix (m).  Each line contains exactly m space-separated floating point values, such that the ith value corresponds to the ith column of A.

Number of columns Number of rows

 Sparse matrix format  The first line contains information about the size of the matrix, while the remaining n lines contain information for each row of A. In CLUTO’s sparse matrix format only the non-zero entries of the matrix are stored.  The first line of the matrix file contains exactly three numbers, all of which are integers.  The first integer is the number of rows in the matrix (n), the second integer is the number of columns in the matrix (m), and the third integer is the total number of non-zeros entries in the n × m matrix

Graph Files:  This is the primary input for CLUTO’s vcluster program. It is a square matrix.  It specifies the similarity between the objects to be clustered.  A value at the (i, j ) location of this matrix indicates the similarity between the ith and the jth object.

 Sparse Graph Format:  The first line of the file contains exactly two numbers, all of which are integers. The first integer is the number of vertices in the graph (n) and the second integer is the number of edges in the graph.  The (i + 1)st line of the file contains information about the adjacency structure of the ith vertex.  The adjacency structure of each vertex is specified as a space-separated list of pairs. Each pair contains the number of the adjacent vertex followed by the similarity of the corresponding edge.

Number of edges Number of vertices

 Dense Graph Format:  The first line of the file contains exactly one number, which is the number of vertices n of the graph.  Each line contains exactly n space-separated floating point values, such that the ith value corresponds to the similarity to the ith vertex of the graph.

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 Clustering Solution File ◦ The clustering file of a matrix with n rows consists of n lines with a single number per line. The ith line of the file contains the cluster number that the ith object/row/vertex belongs to. Cluster numbers run from zero to the number of clusters minus one. ◦ Eg.  Tree File ◦ The tree produced by performing a hierarchical agglomerative clustering on top of the k-way clustering solution produced by vcluster is stored in a file in the form of a parent array. ◦ The ith line contains the parent of the ith node of the tree.

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

Concept Map

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

Demo

 Using WinSCP

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: An efficient clustering algorithm for large databases. In Proc. Of 1998 ACM-SIGMOD Int. Conf. on Management of Data,  G. Karypis, E.H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75,  G. Karypis and V. Kumar. hMETIS 1.5: A hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota, Available on the WWW at URL  G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. Technical report, Department of Computer Science, University of Minnesota, Available on the WWW at URL  Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM,  Ying Zhao and George Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN, Available on the WWW at

 What is clustering?  What is CLUTO?  History of CLUTO  CLUTO Schematic Diagram  Application areas of CLUTO  Features of CLUTO  Relation to IR Concepts  Input file formats in CLUTO  Output file formats in CLUTO  Concept Map  Demo  Resources  Other Features  Questions OUTLINE

 gCLUTO ◦ is a cross-platform graphical application for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. ◦ gCLUTO provides tools for visualizing the resulting clustering solutions using tree, matrix, and an OpenGL-based mountain visualization.

 gCLUTO

 wCLUTO ◦ Is a web-enabled data clustering application that is designed for the clustering and data-analysis requirements of gene-expression analysis. ◦ Users can upload their datasets, select from a number of clustering methods, perform the analysis on the server, and visualize the final results. ◦ The wCLUTO web-server is hosted by the Center of Computational Genomics and Bioinformatics at the University of Minnesota.