Download presentation
Presentation is loading. Please wait.
Published byAlexia Barnett Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang KBS, 2011
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines Motivation Objectives Methodology Experiments Conclusions Comments
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance. We can’t guarantee the number of clusters we select are the best. 3
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives 4 To propose an initialization method to find initial cluster centers and the number of clusters. The method can efficiently deal with large categorical data in linear time.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 5 Data Set Construct a potential exemplars set S Set the estimated number of clusters K-modes-type algorithm The clustering result 12 3 4 5 67
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology The k-modes algorithm 6 Hamming distance: Differences between two codes(using XOR) ex:10001001 XOR 10110001 ------------------------ 00111000 → Hamming distance = 3
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology New cluster centers initialization method Finding the number of clusters 7
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology New cluster centers initialization method. 8
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 9
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 10
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 11
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Finding the number of clusters ─ We need to input a value k’ which is a estimated number of clusters ─ If k’ can’t be determined, we set k’ = |S| 12
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 13
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 14
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology More than 1 knee point of the function P(k) More than 1 peak of the function C(k) 15
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Performance analysis ─ Soybean dada (4 diseases) ─ Lung cancer data (3 classes) ─ Zoo data (7 classes which has 3 big clusters and 4 small clusters) ─ Mushroom data (2 classes) Scalability analysis 16
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Performance analysis 17
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments 18
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Scalability analysis ─ 67557 data points and 42 categorical attribute 19
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters The time complexity has been analyzed in linear time 20
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Comments Advantages ─ Improve the old method about setting the two parameters Applications ─ Data clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.