Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On Rival Penalization Controlled Competitive Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology HE-Tree: a framework for detecting changes in clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Topology Preservation in Self-Organizing Feature Maps: Exact.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
國立雲林科技大學 National Yunlin University of Science and Technology Self-organizing map learning nonlinearly embedded manifoldsmanifolds Author :Timo Simila.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The Evolving Tree — Analysis and Applications Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A new data clustering approach- Generalized cellular automata.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive Clustering for Multiple Evolving Streams Graduate.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology IEEE EC1 Generating War Game Strategies Using A Genetic.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan Dash, Kiseok Choi, Peter Scheuermann, Huan Liu Feature selection for clustering IEEE 2002

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Properties of Feature Selection Distance-based Entropy Measure and Its Efficient Calculation Feature Selection Algorithm Experimental Evaluation Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation There are only a few methods proposed for feature selection for clustering. Most methods are ‘wrapper’ techniques. Heavy reliance on clustering algorithms Lock of clustering criteria to evaluate clustering in different subspaces.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a method that is independent of any clustering algorithm.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Why Feature Selection is important ? Removing unimportant features Data sizes reduce Learning accuracy improve The drawbacks of Wrapper method: No generally acceptable criterion to estimate the accuracy of clustering. Rely on clustering algorithms.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction We propose and evaluate a ‘filter’ method for feature selection. Independent of clustering algorithms It is based on the observation that data with clusters has very different point-to-point distance histogram than data without cluster.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection Here, we show an example using synthetic data in(3,2,1)- dimensional spaces. We noted down following two distinct scenarios. 1. A single feature defines clusters independently. 2. Individual feature don’t define clusters but correlated features do.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection Distance Histogram Has 100 buckets Distance are normalized into [0…1] X value is the bucket number and Y value is the frequency of bucket X

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection An important distinction between two histogram is: Histogram for data without clusters has a bell shape. Histogram for data with clusters has a different distribution. Typically, if the dataset consists of some clusters, then Majority of intra-cluster distances is smaller than the majority of inter-cluster distances.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Entropy is defined as following: Where Xi is the i th data point, X ik as the k th feature value of the i th point, i=1…N, and K=1…M

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Because we don’t know the probability of points, we use following proxy method to estimate the entropy. Where D ij is the normalized distance in the range[ ] between Xi and Xj. Entropy is 0.0 when distance = 0.0 or 1.0 Entropy is 1.0 when mean distance=0.5 Its goal is to assign low entropy to intra and inter-cluster distances, and to assign a higher entropy to noisy distances.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Following figure shows entropy-distance relationship But it still has two drawbacks. The meeting point(u) of two sides, can be an inter-cluster distance, but still it is assigned the highest entropy Entropy increases rapidly for very small distances

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation For first drawback, we set the meeting point(u) so as to separate the intra-cluster and inter-cluster distances more accurately. The second drawback can be overcome by incorporating a coefficient ( β ) in the equation.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Figure 3(a) shows that increasing the β value will decrease the entropy. Figure 3(b) shows varying μ has the effect of shifting the meeting point of the two sides of the entropy.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Among different β value we experimented, it works well when it is set to 10. The way to estimate μ is based on the observation Estimating the range of intra-cluster is easy and more accurate than that of inter-cluster. Because intra-cluster distances occupy the lowest portion of the complete range of distances.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Using the intra-cluster distance range to estimate μ:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Figure 4 explain the above procedure. A notable different the two figures is: Data with clusters entropy is low for many bins having considerably high frequency count.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Distance calculation required to calculate the entropy is O(N 2 ) Until now, we want to assign low entropy to intra- cluster and inter-cluster distances. Try to minimize the entropy for data with clusters. A large portion of distances have very low entropy whose total make a less than significant contribution to the over-all entropy.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation By considering distance having entropy higher than a threshold entropy. We can find a range of distance Rc with higher entropy than threshold Entropy for any distance outside this range can be set to a very low constant value.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Feature Selection Algorithm The process has two main step: Search the optimal subset whose entropy is minimum. Evaluation of subsets of features. Here We use forward selection method First finds the best feature, and then using the already selected features finds the next best feature, and so on.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Evaluating a feature selection method for clustering is to check the correctness of the selected features. We evaluate the proposed method over synthetic datasets for which we know the important feature. Next we evaluate over benchmark and real datasets for which important features are known or can be found out by visualization.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Synthetic Datasets and Real Datasets

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions The result of experiment show our method correctly finds the most important subsets.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion ……