Download presentation
Presentation is loading. Please wait.
Published byRosamund Watkins Modified over 9 years ago
1
Data Mining Course 2007 Eric Postma Clustering
2
Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering 2.Distance preservation Sammon mapping, Isomap, SPE 3.Maximum likelihood density estimation Gaussian Mixtures
3
These datasets have identical statistics up to 2 nd order
4
1. Minimization of reconstruction error
5
Illustration of PCA (1) Face dataset (Rice database)
6
Illustration of PCA (2) Average face
7
Illustration of PCA (3) Top 10 Eigenfaces
8
Each 39-dimensional data item describes different aspects of the welfare and poverty of one country. 2D PCA projection
9
Non-linear PCA Using neural networks (to be discussed tomorrow)
10
2. Distance preservation
11
Sammon mapping Given a data set X. The distance between any two samples is defined as D ij We consider the projection on a two dimensional plane where the projected points are separated by d ij Define an Error function
12
Sammon mapping
13
Main limitations of Sammon The Sammon mapping procedure is a gradient descent method Main limitation: local minima MDS may be preferred because it finds global minima (being based on PCA) Both methods have difficulty with “curved or curly subspaces”
14
Isomap Tenenbaum Build a graph in which each node represents a data point Compute shortest distances along the graph (e.g., Dijkstra’s algorithm) Store all distances in a matrix D Perform MDS on the matrix D
15
Illustration of Isomap (1) For two arbitrary points on the manifold Euclidean distance does not always reflect similarity (cf. dashed blue line)
16
Illustration of Isomap (2) Isomap finds the appropriate shortest path along the graph (red curve, for K=7, N=1000)
17
Illustration of Isomap (3) Two-dimensional embedding (red line is the shortest path along the graph, blue line is the true distance in the embedding.
18
Illustration of Isomap (4) Isomaps (●) ability to find the intrinsic dimensionality as compared to PCA and MDS (∆ and o).
19
Illustration of Isomap (5)
20
Illustration of Isomap (6)
21
Illustration of Isomap (7) Interpolation along a straight line
22
Stochastic Proximity Embedding SPE algorithm Agrafiotis, D.K. and Xu, H. (2002). A self-organizing principle for learning nonlinear manifolds. Proceedings of the National Academy of Sciences U.S.A.
23
Stress function Output proximity between points i and jInput proximity between points i and j
24
Swiss roll data set Original 3D set2D embedding obtained by SPE
25
Stress as a function of embedding dimension (averaged over 30 runs)
26
Scalability (# steps for four set sizes) Linear scaling
27
Conformations of methylpropylether C 1 C 2 C 3 O 4 C 5
28
Diamine combinatorial library
29
Clustering Minimize the total within-cluster variance (reconstruction error) k ic = 1 if a data point belongs to cluster c K-means clustering 1.Random selection of C cluster centres 2.Partition the data by assigning them to the clusters 3.The mean of each partitioning is the new cluster centre A distance threshold may be used…
30
Effect of distance threshold on the number of clusters
31
Main limitation of k-means clustering Final partitioning and cluster centres depend on initial configuration Discrete partitioning may introduce errors Instead of minimizing the reconstruction error, we may maximize the likelihood of the data (given some probabilistic model)
32
Neural algorithms related to k-means Kohonen self-organizing feature maps Competitive learning networks
33
3. Maximum likelihood
34
Gaussian Mixtures Model the pdf of the data using a mixture of distributions K is the number of kernels (<< # data points) Common choice for the component densities p(x|i):
35
Illustration of EM applied to GM model The solid line gives the initialization of the EM algorithm: two kernels, P(1) = P(2) = 0:5, μ1 = 0.0752; μ2 = 1.0176, σ1 = σ2 = 0:2356
36
Convergence after 10 EM steps..
37
Relevant literature L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik (submitted). Dimensionality Reduction: A Comparative Review. http://www.cs.unimaas.nl/l.vandermaaten
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.