Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng Chen, Kun-Ta Chuang, Member, and Ming-Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 2008/11/4

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Related work Model of MARDL (MAximal Resemblance Data Labeling) Experimental results Conclusions Comments

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Sampling  Scales down the size of the database and speed up clustering algorithms.  Problem comes from how to allocate the unclustered data into appropriate clusters. Large Database Sampled data Sampling Clustering Unclustered data Labeling ?

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Data Labeling  Gives each unclustered data point the most appropriate cluster label.  MARDL is independent of clustering algorithms, and any categorical clustering algorithm can be utilized in this framework.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Categorical cluster representative Node  Attribute name + attribute value. E.g. [A 1 =a], [A 2 =m] is an node. N-nodeset  A set of n nodes, in which every node is a member of the distinct attribute Aa. E.g. {[A 1 =a], [A 2 =m]} is a 2-nodeset. Independent nodesets  Two nodesets do not contain nodes from the same attributes are said to be independent with each other in a represented cluster.  E.g. {[A 1 =a], [A 2 =m]} and {[A 3 =c]}  p({[A 1 =a], [A 2 =m],[A 3 =c]}) = p({[A 1 =a], [A 2 =m]})*p({[A 3 =c]})

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Node and n-nodeset importance Information theorem  Entropy

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 N-nodeset importance representative(NNIR) NNIR tree constructing and pruning  An Apriori-like algorithm. Initialization Computing candidate nodeset importance and pruning Generating candidate nodeset  Pruning Threshold  Importance of t nodeset is less than a predefined θ. Relative maximum  Importance of (t+1) nodeset is larger than importance of t nodeset. Hybrid

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Maximal resemblance data labeling Goal of MARDL  Decide the most appropriate cluster label c i for the unlabeled data point. A unclustered data point {[A 1 =a], [A 2 =m],[A 3 =c ]} to the combination {[A 1 =a], [A 2 =m]} and {[A 3 =c ]} in Cluster c 1.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Approximate algorithm for MARDL Only one combination is considered and utilized  Tree nodes are queued and sorted by importance value.  The nodeset with maximal importance is selected.  Those nodesets which are not independent with the selected nodeset are removed from the queue. A unclustered data point {[A 1 =a], [A 2 =m],[A 3 =c ]} and a tree nodeset queue.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Experimental results

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Conclusions MARDL  Allocates unlabeled data point into appropriate clusters when the sampling technique is utilized to cluster a very large categorical database. NIR  A categorical cluster representative technique. NNIR  A more powerful representative than NIR while the combinations of attribute values are considered.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Comments Advantage  A good method to assign unclustered data to appropriate trained clusters in categorical data sampling clustering methods.  The concept, derived from existed method (Apriori and information theorem), is easy to understand and accept.  MARDL is independ of clustering methods and any categorical clustering algorithm can be utilized in this framework. Drawback  It spends much time to construct the tree of each cluster and the tree is quite complex to represent cluster.  Because the importance of t+1 nodeset may be larger than the importance of t nodeset, it will take much time to process the hybrid pruning in computing all of candidate t+1 nodeset. Application  Unclustered data classification while the sampling technique is utilized to cluster a very large categorical database.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng."— Presentation transcript:

Similar presentations

About project

Feedback