Presentation is loading. Please wait.

Presentation is loading. Please wait.

KDD: Part II Clustering

Similar presentations


Presentation on theme: "KDD: Part II Clustering"— Presentation transcript:

1 KDD: Part II Clustering
Dae-Won Kim School of Computer Science & Engineering Chung-Ang University

2 What is Data Mining? Too much data and not enough information — this is a problem facing many businesses and industries. A solution lies here, with data mining. Most businesses have an enormous amount of data, with a great deal of information hiding within it, but "hiding" is usually exactly what it is doing: So much data exists that it overwhelms traditional methods of data analysis. Data mining provides a way to get at the information buried in the data. Data mining finds hidden patterns in large, complex collections of data, patterns that elude traditional statistical approaches to analysis. - Oracle Data Mining Solution

3 Issues Classification vs. Clustering vs. Rule Mining

4 What is Classification?
Tell me what is the name of this fish?

5 Fish: xT = (x1, x2) = (Lightness, Width)
What is Classification? Construct a classifier for making a decision Fish: xT = (x1, x2) = (Lightness, Width)

6 Classification Definition
“The act of taking in raw data and making an action based on the category of the pattern.” “ Build a machine that can recognize or predict patterns: Character, Speech, Face, Cancer, Protein, and DNA sequence etc.” An example: Predict the patients by cancer subtype or treatment Leukemia: (Golub et al., Science, 1999) - 2 class problem - 38 patients samples - 34 normal samples - 6,817 genes

7 What is Clustering? Cluster analysis discovers (b) from (a)

8 What is Clustering?

9 Clustering vs. Classification
Clustering (cluster analysis) Unsupervised pattern classification No training patterns, no prior knowledge Discovers homogeneous groups in data based on proximity Classification (discriminant analysis) Supervised pattern classification Labeled training patterns, the groups are known a priori Constructs rules for classifying new data into the known groups

10 Interchangeable Terms
Cluster Analysis is the preferred generic term Clustering in Computer Science Numerical taxonomy in Biology Q analysis in psychology Segmentation in Market researchers Cluster is the preferred generic term group or class are also often used Proximity is the preferred generic term (dis)similarity or distance are also often used

11 Image Segmentation

12 Gene Function Prediction

13 More Examples Intrusion Detection Systems Atopic Dermatitis

14 Notation Given data set : ‘n x d’ pattern (or data) matrix
Air pollution in US cities ‘d’ variables (features, attributes, dimensions, fields) City SO2 TEMP WIND DAYS Phoenix 10 70.3 6.0 36 Miami 75.5 8.8 128 Seattle 29 51.1 9.4 164 Detroit 49.9 8.4 113 ‘n’ data patterns (objects, observations, vectors) Time_1 Time_2 Time_d Gene_1 2.1 3.6 -2.6 Gene_2 3.5 7.1 -2.1 Gene_n -1.2 8.9 6.5 Time_1 Time_2 Time_d Sample_1 Sample_2 Sample_n Cond_1 Cond_2 Cond_d Gene_1 Gene_2 Gene_n Time_1 Time_2 Time_d EEG_1 EEG_2 EEG_n

15 Q: cluster data into three groups
Observation A = (1, 1) B = (3, 1) C = (3, 2) D = (9, 2) E = (11, 2) F = (9, 6) G = (11, 6)

16 Hierarchical Clustering Algorithm
Basic Algorithms for Cluster Analysis Hierarchical Clustering Algorithm Dae-Won Kim School of Computer Science and Engineering CAU

17 Idea AB

18 Hierarchical Algorithm
1. Start with each point as its own cluster 2. At each iteration, merge two clusters with the smallest distance Feature 1 Feature 3 Feature 2

19 Hierarchical Algorithm
Eventually all points will be linked into a single cluster Feature 1 Feature 3 Feature 2

20 Hierarchical Algorithm
The sequence of mergers is represented in a hierarchical tree g f a b c d e f g a b c e d

21 Hierarchical Algorithm
Agglomerative vs. Divisive

22 Example Observation from Students’ heights and weights
Data 1 = (180, 70), Data 2 = (180, 71), Data 3 = (180, 73), …

23 Example Initial Distance Matrix 1 2 3 4 5 6 7 10 9

24 Example: Single-Link Method
Step1 : (1,2) Group identified Step 2 : (1,2,3) identified 1 2 3 4 5 6 7 10 9 1,2 3 4 5 6 2 7 9 Step 4 : (4,5,6) identified Step 3 : (4,5) identified 1,2,3 4,5 6 5 7 4 1,2,3 4 5 6 3 7

25 Example Step 5 : finalized 1,2,3 4,5,6 5

26 Variant: distance between two groups
# based on the distance between two closest elements Feature 2 Feature 2 4 2 2 3 6 1 1 Feature 1 Feature 1 Element-wise distance Group-wise distance

27 Variant: Average-Link Method
# based on the average of all pairs of distances Feature 2 Feature 2 4 2 2 3 6 1 1 Feature 1 Feature 1 Element-wise distance Group-wise distance

28 Quiz: Find two groups using
Hierarchical algorithm using single-link Hierarchical algorithm using average-link 3 8 3 1 2 4 11 9 5 4

29 Example Student 1 = (180, 70) Student 2 = (176, 70) … 1 2,3 4 5 9
Typical 1 2,3 4 5 9 Initial distance matrix 3 8 3 1 2 3 4 8 5 9 11 1 2 4 11 9 Variant 5 1 2,3 4 6 5 10 4

30 K-Means Clustering Method
Basic Algorithms for Cluster Analysis K-Means Clustering Method Dae-Won Kim School of Computer Science and Engineering CAU

31 Review: Hierarchical 1. Hierarchical algorithm
“Yields a dendrogram representing the nested grouping and similarity levels at which groupings change.” 2. Three popular schemes “Two clusters are merged to a larger cluster based on minimum distance criteria.” 1) single-link: minimum distance of all pairs of patterns from two clusters 2) complete-link: maximum distance 3) average-link: average distance Cluster B Cluster A

32 Review: Hieararchical
3. Single-link vs. complete-link Single-link is more versatile than complete-link Complete-link does not suffer from a “chaining effect” single-link complete-link

33 Review: Hieararchical
4. Pros and cons More versatile than partitional algorithms Dendrogram provides a visual inspection Computationally prohibitive: O(n2) Can not repair the faults from previous steps May produce large chunks of clusters Most widely used due to ease of use and visualization Average-link algorithm showed the superior performance In some reports, the performance was close to random

34 Review: Hieararchical
5. Graph-theoretic clustering is a hierarchical clustering approach - Single-link clusters are subgraphs of the minimum spanning tree - Complete-link clusters are subgraphs of the maximal spanning tree Using the minimal spanning tree to form clusters

35 Review: Hierarchical Tip. To speed up in implementation, please use the dissimilarity matrix and indexing structure. Dissimilarity matrix 1 2 3 4 5 6 7 10 9 1,2 3 4 5 6 2 7 9 1,2,3 4 5 6 3 7 1,2,3 4,5 6 5 7 4 1,2,3 4,5,6 5

36 K-Means Clustering Algorithm
Fast clustering Cluster centers representing each cluster Each cluster center is obtained by the average of its members Clustering is calculated using the distance to its cluster center

37 K-Means Clustering Algorithm
1. Objective function of K-means algorithm “Yields a single partition of data at each iteration using the distance between patterns and centroids, leading to intracluster compactness and intercluster separation.” 2. Incremental Greedy Procedure 1) Select an k-initial centroids 2) Assign each pattern to its closest cluster centroid 3) Update cluster centroids 4) Repeat these steps until no improvement in J(X,k)

38 Procedure Step 1. Guess the ‘K’ centers by random
Step 2. Classify data into the K groups (centers) Step 3. Update the centers Step 4. if no change in centers, then stop; otherwise go to Step 2 center2 center1 x x

39 Key Steps Classify data: assign each datum to the closest group
Update centers: compute the average of data in each group 4 1 c2 New c1 = ( ) / 4 c1 x x 5 3 6 2 7

40 Example: Cluster the data into K=2 groups
1. Initialize : G1-Center= A = (1, 1), G2-center = B = (2, 1) 2. Classify Data: G1={A, C}, G2={B, D, E, F, G, H} Update Centers: G1-Center =(1.0, 1.5), G2-Center =(3.6, 3.5) Check Stop: Change(Yes) -> Continue 3. Classify Data: G1={A, B, C, D}, G2={E, F, G, H} Update Centers: G1-Center =(1.5, 1.5), G2-Center =(4.5, 4.5) Check Stop: Change(Yes) -> Continue 4. Classify Data: G1={A, B, C, D}, G2={E, F, G, H} Check Stop: Change(No) -> Stop y G H 5 E F 4 3 C D 2 A B 1 x 1 2 3 4 5

41 Example: Cluster the data into K=2 groups
1. Initialize: G1-Center=(0, 0), G2-Center=(1, 0) 2. Classify Data: G1={x1, x3}, G2={x2, x4, …, x20} Update Centers: G1-Center =(0.0, 0.5), G2-Center =(5.67, 5.33) Check Stop: Change(Yes) -> Continue 3. Classify Data: G1={x1, .., x8}, G2={x9, …, x20} Update Centers: G1-Center =(1.25, 1.13), G2-Center =(7.67, 7.33) 4. Classify Data: G1={x1, .., x8}, G2={x9, …, x20} Check Stop: Change(Yes) -> Stop x2 x19 x20 9 x16 x17 x18 8 x12 x15 x13 x14 7 x9 x10 x11 6 5 4 3 x6 x7 x8 2 x3 x4 x5 1 x1 x2 x1 1 2 3 4 5 6 7 8 9

42 Issues in K-means Pros and cons
The simplest and most commonly used algorithm Computationally efficient: O(nks) Tends working well with isolated and compact clusters Induce fixed shapes of clusters depending on distance measure Sensitive to the initial selection of centroids 1) ‘ellipsoidal’ clustering results come from the selection of {A,B,C} 2) ‘rectangular’ clustering results come from the selection of {A,D,F} 3) Density-based initial selection is popular: mountain clustering Two different clustering results (group 7 data into 3 clusters)

43 Issues in K-means Tip. Why does k-means-type algorithm remain so popular? 1) The cluster solution is affected by the choice of internal parameters : K-means algorithm is easy to use in all applications due to its small number of parameters 2) Mathematically sound : Convergence of K-means-type algorithms in a finite number of iterations is proved : Local optimality of the partial optimal solution is proved Tip. In implementation, a ‘dead’ cluster arises due to random initialization. Thus, please always check the number data belonging to each cluster in each iteration Dead cluster

44 How to guess the desirable initial centers in K-means algorithm?
Other Issues in K-means How to guess the desirable initial centers in K-means algorithm?

45 How to cluster the uncertain data which have vague boundaries?
Other Issues in K-means How to cluster the uncertain data which have vague boundaries?

46 How to know the optimal number of clusters in K-means algorithm?
Other Issues in K-means How to know the optimal number of clusters in K-means algorithm?

47 K-means type algorithms detect only sphere-shaped clusters?
Other Issues in K-means K-means type algorithms detect only sphere-shaped clusters?

48 How to cluster the symbolic categorical data?
Other Issues in K-means How to cluster the symbolic categorical data?


Download ppt "KDD: Part II Clustering"

Similar presentations


Ads by Google