Download presentation
Presentation is loading. Please wait.
1
Clustering 1 (Introduction and kmean)
COMP1942 Clustering 1 (Introduction and kmean) Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan’s notes Screenshots of XLMiner Captured by Kai Ho CHAN Presented by Raymond Wong COMP1942
2
Clustering Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 …
Cluster 2 (e.g. High Score in History and Low Score in Computer) Computer History Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 … Cluster 1 (e.g. High Score in Computer and Low Score in History) Problem: to find all clusters COMP1942
3
Why Clustering? Clustering for Utility Summarization Compression
4
Why Clustering? Clustering for Understanding Applications Biology
Group different species Psychology and Medicine Group medicine Business Group different customers for marketing Network Group different types of traffic patterns Software Group different programs for data analysis COMP1942
5
Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942
6
K-mean Clustering Suppose that we have n example feature vectors x1, x2, …, xn, and we know that they fall into k compact clusters, k < n Let mi be the mean of the vectors in cluster i. we can say that x is in cluster i if distance from x to mi is the minimum of all the k distances. COMP1942
7
Procedure for finding k-means
Make initial guesses for the means m1, m2, .., mk Until there is no change in any mean Assign each data point to the cluster whose mean is the nearest Calculate the mean of each cluster For i from 1 to k Replace mi with the mean of all examples for cluster i COMP1942
8
Procedure for finding k-means
1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 Assign each objects to most similar center 5 Update the cluster means 4 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign k=2 Arbitrarily choose k means Update the cluster means COMP1942
9
Initialization of k-means
The way to initialize the means was not specified. One popular way to start is to randomly choose k of the examples The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points COMP1942
10
Disadvantages of k-means
In a “bad” initial guess, there are no points assigned to the cluster with the initial mean mi. The value of k is not user-friendly. This is because we do not know the number of clusters before we want to find clusters. COMP1942
11
Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942
12
Sequential k-Means Clustering
Another way to modify the k-means procedure is to update the means one example at a time, rather than all at once. This is particularly attractive when we acquire the examples over a period of time, and we want to start clustering before we have seen all of the examples Here is a modification of the k-means procedure that operates sequentially COMP1942
13
Sequential k-Means Clustering
Make initial guesses for the means m1, m2, …, mk Set the counts n1, n2, .., nk to zero Until interrupted Acquire the next example, x If mi is closest to x Increment ni Replace mi by mi + (1/ni) . (x – mi) COMP1942
14
Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942
15
Forgetful Sequential k-means
This also suggests another alternative in which we replace the counts by constants. In particular, suppose that a is a constant between 0 and 1, and consider the following variation: Make initial guesses for the means m1, m2, …, mk Until interrupted Acquire the next example x If mi is closest to x, replace mi by mi+a(x-mi) COMP1942
16
Forgetful Sequential k-means
The result is called the “forgetful” sequential k-means procedure. It is not hard to show that mi is a weighted average of the examples that were closest to mi, where the weight decreases exponentially with the “age” to the example. That is, if m0 is the initial value of the mean vector and if xj is the j-th example out of n examples that were used to form mi, then it is not hard to show that mn = (1-a)n m0 +a k=1n (1-a)n-kxk COMP1942
17
Forgetful Sequential k-means
Thus, the initial value m0 is eventually forgotten, and recent examples receive more weight than ancient examples. This variation of k-means is particularly simple to implement, and it is attractive when the nature of the problem changes over time and the cluster centres “drift”. COMP1942
18
Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942
19
How to use the data mining tool
We can use XLMiner for performing k-mean clustering Open “cluster.xls” COMP1942
20
COMP1942
21
COMP1942
22
COMP1942
23
COMP1942
24
Workbook Data source Worksheet Data range COMP1942
25
COMP1942
26
First row contains headers Variables In Input Data
# Cols: 10 # Rows : Variables 3 First row contains headers Variables In Input Data Name Computer History COMP1942
27
COMP1942
28
COMP1942
29
COMP1942
30
Normalize input data Parameters # Clusters: 10 2 # Iterations:
COMP1942
31
Options Set Seed 5 Fixed Start Random Starts No. of Starts: 12345
COMP1942
32
Show distances from each cluster center
Show data summary Show distances from each cluster center COMP1942
33
COMP1942
34
COMP1942
35
COMP1942
36
COMP1942
37
COMP1942
38
COMP1942
39
COMP1942
40
COMP1942
41
COMP1942
42
COMP1942
43
COMP1942
44
COMP1942
45
COMP1942
46
COMP1942
47
COMP1942
48
COMP1942
49
COMP1942
50
COMP1942
51
COMP1942
52
COMP1942
53
COMP1942
54
COMP1942
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.