Download presentation
1
What is Cluster Analysis
Clustering– Partitioning a data set into several groups (clusters) such that Homogeneity: Objects belonging to the same cluster are similar to each other Separation: Objects belonging to different clusters are dissimilar to each other. Three fundamental elements of clustering The set of objects The set of attributes Distance measure
2
Supervised versus Unsupervised Learning
Supervised learning (classification) Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on training set Unsupervised learning (clustering) Class labels of training data are unknown Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data
3
What Is Good Clustering?
Good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity Quality of a clustering method is also measured by its ability to discover some or all of hidden patterns Quality of a clustering result depends on both the similarity measure used by the method and its implementation
4
Requirements of Clustering in Data Mining
Scalability Ability to deal with different types of attributes Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Discovery of clusters with arbitrary shape Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
5
Application Examples A stand-alone tool: explore data distribution
A preprocessing step for other algorithms Pattern recognition, spatial data analysis, image processing, market research, WWW, … Cluster documents Cluster web log data to discover groups of similar access patterns
6
Co-expressed Genes Gene Expression Data Matrix
Gene Expression Patterns Co-expressed Genes Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation.
7
Gene-based Clustering
Examples of co-expressed genes and coherent patterns in gene expression data The gene-based clustering regards each gene as a data object, and the experiment conditions as attributes. To illustrate the problem, first we look at the parallel corrdinates for a time-series gene expression data. The x-axis represents the time point while the y-axis represents the gene expression level. The expression profile of a gene during the time-series is represented by a horizontal poly-line in the parallel coordinates. We can see the figure is in a mess, and we cannot see anything interesting in this figure. However, there exist some groups of genes in the data set. such that the expression levels of those genes rise and fall cocordantly during the whole time-series. Those genes are called co-expressed genes and the common trend of expression levels shared by the genes in the same group are called coherent patterns. Biologists are interested in the co-expressed genes and coherent patterns because co-expressed genes may have similar gene functions and maybe regulated by the same mechanisms and coherent patterns may correspond to some important cellular processes. The purpose of gene-based clustering is to identify co-expressed genes and coherent expression patterns in the data set. Iyer’s data [2] [2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.
8
Data Matrix For memory-based clustering
Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table
9
Two-way Clustering of Micoarray Data
sample 1 sample 2 sample 3 sample 4 sample … gene 1 0.13 0.72 0.1 0.57 gene 2 0.34 1.58 1.05 1.15 gene 3 0.43 1.1 0.97 1 gene 4 1.22 0.85 gene 5 -0.89 1.21 1.29 1.08 gene 6 1.45 1.44 1.12 gene 7 0.83 gene 8 0.87 1.32 1.35 1.13 gene 9 -0.33 1.01 1.38 gene 10 0.10 1.03 gene … Clustering genes Samples are attributes Find genes with similar function Clustering samples Genes are attributes. Find samples with similar phenotype, e.g. cancers. Feature selection. Informative genes. Curse of dimensionality.
10
Dissimilarity Matrix For memory-based clustering
Also called object-by-object structure Proximities of pairs of objects d(i,j): dissimilarity between objects i and j Nonnegative Close to 0: similar
11
Distance Matrix Distance Matrix Original Data Matrix s 1 s 2 s 3 s 4 …
0.13 0.72 0.1 0.57 g 2 0.34 1.58 1.05 1.15 g 3 0.43 1.1 0.97 1 g 4 1.22 0.85 g 5 -0.89 1.21 1.29 1.08 g 6 1.45 1.44 1.12 g 7 0.83 g 8 0.87 1.32 1.35 1.13 g 9 -0.33 1.01 1.38 g 10 0.10 1.03 g 1 g 2 g 3 g 4 … D(1,2) D(1,3) D(1,4) D(2,3) D(2,4) D(3,4) Distance Matrix Original Data Matrix
12
How Good Is the Clustering?
Dissimilarity/similarity depends on distance function Different applications have different functions Inter-clusters distance maximization Intra-clusters distance minimization Judgment of clustering quality is typically highly subjective
13
Types of Data in Clustering
Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types
14
Interval-valued Variables
Continuous measurements of a roughly linear scale Weight, height, latitude and longitude coordinates, temperature, etc. Effect of measurement units in attributes Smaller unit larger variable range larger effect to the result Standardization + background knowledge
15
Standardization Calculate the mean absolute deviation ,
The mean is not squared, so the effect of outliers is reduced. Calculate the standardized measurement (z-score) Mean absolute deviation is more robust The effect of outliers is reduced but remains detectable
16
Minkowski Distance Minkowski distance: a generalization
If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance xi Xi (1,7) 12 8.48 q=2 q=1 6 6 xj Xj(7,1)
17
Properties of Minkowski Distance
Nonnegative: d(i,j) 0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j) d(i,k) + d(k,j) i j k
18
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure
19
Clustering Algorithms
If we “clustering” the clustering algorithms Clustering algorithms Partition-based Hierarchical clustering Density-based Model-based … Centroid-based Medoid-based K-means PAM CLARA CLARANS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.