Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Unsupervised learning
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Introduction to Bioinformatics Algorithms Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
An Overview of Clustering Methods With Applications to Bioinformatics Sorin Istrail Informatics Research Celera Genomics.
DATA MINING CLUSTERING K-Means.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Unsupervised learning introduction
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Clustering.
Hierarchical clustering approaches for high-throughput data
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Clustering.
Presentation transcript:

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Definition –Assignment of a set of observations into subsets so that observations in the same subset are similar in some sense clustering

Flat vs. Hierarchical –Flat: clusters are flat –Hierarchical: clusters form a tree Agglomerative Divisive

Data Representations for clustering Input data to algorithm is usually a vector (also called a “tuple” or “record”)Input data to algorithm is usually a vector (also called a “tuple” or “record”) Types of dataTypes of data –Numerical –Categorical –Boolean Example: Clinical Sample DataExample: Clinical Sample Data –Age (numerical) –Weight (numerical) –Gender (categorical) –Diseased? (boolean) Must also include a method for computing similarity of or distance between vectorsMust also include a method for computing similarity of or distance between vectors

K-means: The Algorithm Given a set of numeric points in d dimensional space, and integer kGiven a set of numeric points in d dimensional space, and integer k Algorithm generates k (or fewer) clusters as follows:Algorithm generates k (or fewer) clusters as follows: Assign all points to a cluster at random Repeat until stable: Compute centroid for each cluster Reassign each point to nearest centroid

K-means: Example, k = 3

K-means: Sample Application Gene clusteringGene clustering –Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line –Normalization allows comparisons across microarrays. –Produce clusters of genes which vary in similar ways over time –Hypothesis: genes which vary in the same way may be co- regulated and/or participate in the same pathway

K-means: Weaknesses Must choose parameter k in advance, or try many values.Must choose parameter k in advance, or try many values. Data must be numerical and must be compared via Euclidean distance (there is a variant called the k- medians algorithm to address these concerns)Data must be numerical and must be compared via Euclidean distance (there is a variant called the k- medians algorithm to address these concerns) The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found.The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found. The algorithm is sensitive to outliers---points which do not belong in any cluster. These can distort the centroid positions and ruin the clustering.The algorithm is sensitive to outliers---points which do not belong in any cluster. These can distort the centroid positions and ruin the clustering.

Hierarchical Clustering Hierarchical clustering takes as input a set of pointsHierarchical clustering takes as input a set of points It creates a tree in which the points are leaves and the internal nodes reveal the similarity structure of the points.It creates a tree in which the points are leaves and the internal nodes reveal the similarity structure of the points. –The tree is often called a “dendogram.” The method is summarized below:The method is summarized below: Place all points into their own clusters.While there is more than one cluster, do Merge the closest pair of clusters  The behavior of the algorithm depends on how “closest pair of clusters” is defined

Hierarchical Clustering: Merging Clusters

Hierarchical Clustering: Example

Hierarchical Clustering: Sample Application Multiple sequence alignmentMultiple sequence alignment –Given a set of sequences, produce a global alignment of all sequences against the others –NP-hard –One popular heuristic is to use hierarchical clustering The hierarchical clustering approachThe hierarchical clustering approach –Each cluster is represented by its consensus sequence –When clusters are merged, their consensus sequences are aligned via optimal pairwise alignment –The heuristic uses hierarchical clustering to merge the most similar sequences first, the idea being to minimize potential errors in the alignment. –A slightly more sophisticated version of this method is implemented by the popular clustalw program

Hierarchical Clustering: Weaknesses The most commonly used type, single-link clustering, is particularly greedy.The most commonly used type, single-link clustering, is particularly greedy. –If two points from disjoint clusters happen to be near each other, the distinction between the clusters will be lost. –On the other hand, average- and complete-link clustering methods are biased towards spherical clusters in the same way as k-means Does not really produce clusters; the user must decide where to split the tree into groups.Does not really produce clusters; the user must decide where to split the tree into groups. –Some automated tools exist for this As with k-means, sensitive to noise and outliersAs with k-means, sensitive to noise and outliers

Challenges in Clustering Similarity CalculationSimilarity Calculation –Results of algorithms depend entirely on similarity used –Clustering systems provide little guidance on how to pick similarity –Computing similarity of mixed-type data is hard –Similarity is very dependent on data representation. Should one Normalize? Represent one’s data numerically, categorically, etc.? Cluster on only a subset of the data? The computer should do more to help the user figure this out! Parameter selectionParameter selection –Current algorithms require too many arbitrary, user-specified parameters

Conclusion Clustering is a useful way of exploring data, but is still very ad hocClustering is a useful way of exploring data, but is still very ad hoc Good results are often dependent on choosing the right data representation and similarity metricGood results are often dependent on choosing the right data representation and similarity metric –Data: categorical, numerical, boolean –Similarity: distance, correlation, etc. Many different choices of algorithms, each with different strengths and weaknessesMany different choices of algorithms, each with different strengths and weaknesses –k-means, hierarchical, graph partitioning, etc.