Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type symbolic data Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31, 2010, pp. 430–443. Presenter : Wei-Shen Tai 2010/3/10

2 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Dynamic clustering algorithms for mixed feature-type symbolic data Cluster interpretation Experimental evaluation Conclusion remarks Comments

3 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Partitioning dynamical cluster algorithms  None of these former dynamic clustering models are able to manage mixed feature-type symbolic data.

4 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Dynamic clustering methods for mixed feature-type symbolic data based on suitable adaptive squared Euclidean distances.  To obtain a suitable homogenization of the mixed feature-type symbolic data into histogram-valued symbolic data prior to preprocessing.

5 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Partitioning dynamical clustering Iterative two-step relocation algorithms  Construct clusters and identify a suitable representation or prototype for each cluster at each iteration.  Optimize a criterion based on a measure of fitting between the clusters and their prototypes Adaptive dynamic clustering algorithm  Those distances, that compare clusters and their prototypes, can be different from one cluster to another.

6 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Data homogenization pre-processing  Set-valued and list-valued variables  An ordered list-valued variable  Interval-valued variables

7 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 Interval-valued variables  X 1 is the minimum and the maximum of the gross national product (in millions) The set of elementary intervals Country 1 X 1 [10, 30]  I 1 =>l([10, 25] ∩[10, 30] ) / l([10, 30]) = 15/ 20 = 0.75  I 2 =>l([25, 30] ∩[10, 30] / l([10, 30]) = 5/ 20 = 0.25  Q 2 = 0.75+0.25 = 1.0

8 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Set-valued and list-valued variables  Set A 2 = {A=agriculture, C=chemistry, Co=commerce, E=engineering, En=energy, I=information} Country 1, X 2 ={A, Co}  => {A, C, Co, E, En, I}(½, 0, ½, 0, 0, 0)  If A 9 = {worst, bad, fair, good, best} Country 1 A 9 =good  => (0, 0, 0, 1, 1)

9 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Squared adaptive Euclidean distances Single squared adaptive Euclidean distances (global)  The weight vector of each cluster where is the same for all clusters Cluster squared adaptive Euclidean distances (local)  The weight vectors of each cluster is different from one cluster to another.

10 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Algorithm schema  Pre-processing step: data homogenization  Initialization step: Randomly choose a partition or randomly choose K distinct objects belonging to X.  Step 1: definition of the best prototypes Determine the vector weight of each cluster for single squared adaptive Euclidean distances for all clusters. (global)  Step 2: definition of the best vector of weights Determine the vector weight of each cluster for cluster squared adaptive Euclidean distances for each cluster. (local difference)  Step 3: definition of the best partition Each Prototype (input ) finds the cluster with the closest distance, and update the vector weight of cluster’s representative prototype.  Stopping criterion No prototype changes its belonged cluster.

11 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Experimental results  Measurement of the quality of the results Overall error rate of classification (OERC) Corrected Rand (CR)  Let U ={u 1,…. u i,…. u R }and V ={v 1,…. v j,…. u C }be two partitions of the same data set having respectively R and C clusters. The corrected Rand index is:

12 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Conclusions and remarks Clustering for mixed feature-type symbolic data based on dynamic clustering methodology with adaptive distances.  It can recognize clusters of different shapes and sizes.  A solution for the best prototype of each cluster with the best adaptive distance for each cluster.

13 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Comments Advantage  This proposed framework provides a solution for mixed feature-type symbolic data clustering.  It also provides an alternative for the similarity measurement between cluster and input in categorical data via dynamic adaptive distance. Drawback  If a categorical attribute possesses a larger value set, it will be the determinative attribute in the clustering after they were transformed to histogram.  The hierarchical relationship between categorical data is not considered in this method. Application  Mixed feature-type data clustering.


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type."

Similar presentations


Ads by Google