Download presentation
Presentation is loading. Please wait.
Published byKathlyn Porter Modified over 9 years ago
1
Clustering and Dimensionality Reduction Brendan and Yifang April 21 2015
2
Pre-knowledge We define a set A, and we find the element that minimizes the error We can think of as a sample of Where is the point in C closest to X.
3
Clustering methods K-means clustering Hierarchical clustering Agglomerative clustering Divisive clustering Level set clustering Modal clustering
4
K-partition clustering In a K-partition problem, our goal is to find k points: We define So C are cluster centers. We partition the space into k sets, where
5
K-partition clustering, cont’d Given the data set, our goal is to find Where
6
K-means clustering 1. Choose k centers at random from the data. 2. Form the clusters where is the closest center to 3. Let denotes the number of points in a partition 4. Repeat Step 2 until convergence.
7
Circular data Vs Spherical data Question: Why K-means clustering is good for spherical data?(Grace)
8
Question: What is relationship between K-means and Naïve Bayes? Answer: They have the followings in common: 1. Both of them estimate a probability density function. 2. Assign the closest category/label to the target point (Naïve Bayes), assign the closest centroid to the target point (K-means) They are different in these aspects: 1. Naïve bayes is supervised algorithm, K-means is an unsupervised method. 2. K-means is optimization task, and it is an iterative process, but Naïve bayes is not. 3. K-means is like to a multiple runs of Naïve Bayes, and in each run, the labels are adaptively adjusted.
9
Question: Why K-means does not work well for Figure 35.6? Why spectral clustering helps with it? (Grace) Answer: Special clustering maps data points in R d to data points in R k. Circle-shaped data points in R d will be spherical- shape in R k. But special clustering involves matrix decomposition, is rather time-consuming.
10
Agglomerative clustering Requires pairwise distance among clusters. There are three commonly employed distance. Single linkage Complete linkage (Max Linkage) Average linkage 1.Start with each point in a separate cluster 2.Merge the two closest clusters. 3.Go back to step 1
11
An Example, Single Linkage Question: Is the Figure 35.6 to illustrate when one type of linkage is better than another? (Brad)
12
An example, Complete Linkage
13
Divisive clustering Starts with one large cluster and then recursively divide the larger clusters into small clusters, with any feasible clustering algorithms. A divisive algorithm example 1.Build a Minimum Spanning Tree 2.Create a new clustering by removing a link corresponding to the largest distance 3.Go back to Step 2
14
Level set clustering For a fixed non-negative number, define the level set We decompose into a collection of bounded, connected disjoined sets:
15
Level Set Clustering, Cont’d Estimate density distribution function: Using KDE Decide : fix a small number Decide :
16
Cuevas, Fraiman Algorithm Set j = 0 and I = { 1, …, n } 1. Choose a point from X i, and call this point X 1,. Find the nearest point to X 1 and call this point X 2. Let r 1 = || X 1 - X 2 || 2. r 1 > 2e: set j <- j + 1, remove i from I, and go to Step 1. 3. If r 1 <= 2e, let X 3 be closest to the set X 1, X 2, and let Min{ || X 3 - X 1 ||, || X 3 - X 2 || }. 4. If r 2 > 2e: set j 2e. Then Set j <- j+1 and go back to Step 1.
17
An example Question: Can you give an example to illustrate Level set clustering? (Tavish) 2 1 5 4 3 7 8 12 9 10 6 11
18
Modal Clustering A point x belongs to T j if and only if the steepest ascent path beginning at x leads to m j. Finally the data are clustered to its closest mode. However, p may not have a finite number of modes. A refinement is introduced. P h is a smoothed out version of p using a Gaussian kernel.
19
Mean shift algorithm 1. Choose a number of points x 1, …, x N. Set t = 0. 2. Let t = t + 1. For j = 1,…,N set 3. Repeat until convergence.
20
Question: Can you point out the differences and similarities between different clustering algorithms?(Tavish) Can you compare the pros and cons of the clustering algorithms, and what are suitable situations for each of them?(Sicong) What is the relationship between the clustering algorithms? What assumptions do they have?(Yuankai) Answer: K-means Pros: 1. Simple, very intuitive. Applicable to almost any scenario, any dataset. 2. Fast algorithm Cons: 1. does not work for density-varying data
21
Contour matters K-means Cons: 2. Does not work well when data groups present special contours
22
K-means Cons: 3. Does not work well on outliers 4. Requires K
23
Hierarchical clustering Pros: 1. Its clustering result has the clusters at any level granularity. Any number of clusters could be achieved by cutting the dendrogram at a corresponding level. 2. It provides a dendrogram, which could be visualized hierarchical tree. 3. Does not requires a specified K. Cons: 1. Slower than K-means 2. Hard to decide where to cut off the dendrogram
24
Level set clustering Pros: 1. Work wells when data group presents special contours, e.g., circle. 2. Handle outliers well, because we get a density function. 3. Handle varying density well. Cons: 1. Even slower than hierarchical clustering. KDE is n 2, and Cuevas and Fraiman algorithm is also n 2.
25
Question: Does K-means clustering guarantee convergence? (Jiyun) Answer: Yes. Its time complexity upper bound is O(n 4 ) Question: In Cuevas-fraiman algorithm, does the choice of the start vertex matter? (Jiyun) Answer: The choice of start vertex does no matter. Question: Does not the choice of X j in Mean shit algorithm matter? Answer: No. The X j converges to the modes in the iterative process. The initial value does not matter.
26
Dimension Reduction
27
Motivation
28
Question – Dimension Reduction Benefits Dimensionality reduction aims at reducing the number of random variables in the data before processing. However, this seems counterintuitive as it can reduce distinct features in the data set leading to poor results in succeeding steps. So, how does it help? - Tavish Implicit assumption is that our data contains more features than are useful/necessary (ie highly correlated or purely noise) Common in big data Common when data is naively recorded Reducing the number of dimensions produces a more compact representation and helps with the curse of dimensionality Some methods (ie manifold-based) avoid loss
29
Principal Component Analysis (PCA)
30
Question – Linear Subspace In Principal Component Analysis, it projects the data onto linear subspaces. Could you explain a bit about what is a linear subspace? - Yuankai A linear subspace is a vector space subset of a higher dimensional vector subspace
31
Example – Linear Subspace
32
Example – Subspace projection
33
PCA Objective
34
PCA Algorithm
35
Question – Choosing a Good Dimension
37
Example – PCA: d=2, k=1
38
Multidimensional Scaling
39
Example – Multidimensional Scaling
40
Kernel PCA
41
The Kernel Trick
42
Kernel PCA Algorithm
43
Local Linear Embedding (LLE)
44
Question - Manifolds I wanted to know what exactly "manifold" referred to. – Brad “A manifold is a topological space that is locally Euclidean” – Wolfram i.e. the Earth appears flat on a human scale, but we know it’s roughly spherical Maps are useful because they preserve all the surface features despite being a projection
45
Example – Manifolds
46
LLE Algorithm
47
LLE Objective
48
Example – LLE Toy Examples
49
Isomap Similar to LLE in its preservation of the original structure Provides a “manifold” representation of the higher dimensional data Assesses object similarity differently (distance, as a metric, is computed using graph path length) Constructs the low-dimensional mapping differently (uses metric multi-dimensional scaling)
50
Isomap Algorithm
51
Laplacian Eigenmaps
52
Estimating the Manifold Dimension
53
Estimating the Manifold Dimension Cont.
54
Principal Curves and Manifolds
55
Principal Curves and Manifolds Cont.
56
Random Projection
57
Question – Making a Random Projection
58
Question – Distance Randomization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.