Download presentation
Presentation is loading. Please wait.
Published byEmery Stevenson Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type symbolic data Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31, 2010, pp. 430–443. Presenter : Wei-Shen Tai 2010/3/10
2
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Dynamic clustering algorithms for mixed feature-type symbolic data Cluster interpretation Experimental evaluation Conclusion remarks Comments
3
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Partitioning dynamical cluster algorithms None of these former dynamic clustering models are able to manage mixed feature-type symbolic data.
4
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Dynamic clustering methods for mixed feature-type symbolic data based on suitable adaptive squared Euclidean distances. To obtain a suitable homogenization of the mixed feature-type symbolic data into histogram-valued symbolic data prior to preprocessing.
5
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Partitioning dynamical clustering Iterative two-step relocation algorithms Construct clusters and identify a suitable representation or prototype for each cluster at each iteration. Optimize a criterion based on a measure of fitting between the clusters and their prototypes Adaptive dynamic clustering algorithm Those distances, that compare clusters and their prototypes, can be different from one cluster to another.
6
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Data homogenization pre-processing Set-valued and list-valued variables An ordered list-valued variable Interval-valued variables
7
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 Interval-valued variables X 1 is the minimum and the maximum of the gross national product (in millions) The set of elementary intervals Country 1 X 1 [10, 30] I 1 =>l([10, 25] ∩[10, 30] ) / l([10, 30]) = 15/ 20 = 0.75 I 2 =>l([25, 30] ∩[10, 30] / l([10, 30]) = 5/ 20 = 0.25 Q 2 = 0.75+0.25 = 1.0
8
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Set-valued and list-valued variables Set A 2 = {A=agriculture, C=chemistry, Co=commerce, E=engineering, En=energy, I=information} Country 1, X 2 ={A, Co} => {A, C, Co, E, En, I}(½, 0, ½, 0, 0, 0) If A 9 = {worst, bad, fair, good, best} Country 1 A 9 =good => (0, 0, 0, 1, 1)
9
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Squared adaptive Euclidean distances Single squared adaptive Euclidean distances (global) The weight vector of each cluster where is the same for all clusters Cluster squared adaptive Euclidean distances (local) The weight vectors of each cluster is different from one cluster to another.
10
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Algorithm schema Pre-processing step: data homogenization Initialization step: Randomly choose a partition or randomly choose K distinct objects belonging to X. Step 1: definition of the best prototypes Determine the vector weight of each cluster for single squared adaptive Euclidean distances for all clusters. (global) Step 2: definition of the best vector of weights Determine the vector weight of each cluster for cluster squared adaptive Euclidean distances for each cluster. (local difference) Step 3: definition of the best partition Each Prototype (input ) finds the cluster with the closest distance, and update the vector weight of cluster’s representative prototype. Stopping criterion No prototype changes its belonged cluster.
11
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Experimental results Measurement of the quality of the results Overall error rate of classification (OERC) Corrected Rand (CR) Let U ={u 1,…. u i,…. u R }and V ={v 1,…. v j,…. u C }be two partitions of the same data set having respectively R and C clusters. The corrected Rand index is:
12
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Conclusions and remarks Clustering for mixed feature-type symbolic data based on dynamic clustering methodology with adaptive distances. It can recognize clusters of different shapes and sizes. A solution for the best prototype of each cluster with the best adaptive distance for each cluster.
13
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Comments Advantage This proposed framework provides a solution for mixed feature-type symbolic data clustering. It also provides an alternative for the similarity measurement between cluster and input in categorical data via dynamic adaptive distance. Drawback If a categorical attribute possesses a larger value set, it will be the determinative attribute in the clustering after they were transformed to histogram. The hierarchical relationship between categorical data is not considered in this method. Application Mixed feature-type data clustering.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.