Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Gianfranco Chicco, Roberto Napoli Federico Piglione, Petru Postolache.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Chapter 11 Multiple Regression.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive FIR Neural Model for Centroid Learning in Self-Organizing.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Data Transformation: Normalization
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CH 5: Multivariate Methods
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised Feature Selection Using Feature Similarity

Intelligent Database Systems Lab Outline Motivation Objective Introduction Feature Similarity Measure Feature Selection method Feature Evaluation indices Experimental Results and Comparisons Conclusions Personal Opinion Review N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Motivation Conventional method of feature selection have high- computational complexity problem in both dimension and size. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Objective Propose an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Introduction 1/3 The sequential floating searches provide better results, though at the cost of a higher computational complexity. Broadly classified existing methods into two categories: Maximization of clustering performance Sequential unsupervised feature selection 、 maximum entropy 、 neuro-fuzzy approach… Based on feature dependency and relevance Correlation coefficients 、 measures of statistical redundancy 、 linear dependence N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Introduction 2/3 We propose an unsupervised algorithm which uses feature dependency/similarity for redundancy reduction, but requiring no search. A new similarity measure call maximal information compression index, is used in clustering. Its comparison with correlation coefficient and least-square regression error is made. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Introduction 3/3 The proposed algorithm is geared toward to two goals: Minimizing the information loss. Minimizing the redundancy present in the reduced feature subset. The feature selection algorithm unlike most conventional algorithms, search for best subset, its can be computed in much less time compared to many indices used in other supervised and unsupervised feature selection method. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure There are two approaches for measuring similarity between two random variables: 1. To nonparametrically test the closeness of probability distributions of the variables. 2. To measure the amount of functional dependency between the variables. We discuss below two existing linear dependency measures: 1. Correlation Coefficient 2. Least Square Regression Error(e) N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure Correlation Coefficient ( ) var() the variance of a variable cov() the covariance between two variables if x and y are linearly related. 3. (symmetric). 4. if and for some constants a,b,c,d,then the measure is invariant to scaling and translation of the variables 5. the measure is sensitive to rotation of the scatter diagram in (x,y) plane N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure Least Square Regression Error (e) the error predicting y from the linear model y = a + bx. a and b are the regression coefficients obtained by minimizing the mean square error. The coefficients are given by, and and the mean square error e(x,y) is given by N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure Least Square Regression Error (e) e(x,y)=0 if x and y are linearly related 3. (unsymmetric). 4. if u=x/c and v = y/d for some constant a,b,c,d, then e(x,y)=d 2 e(u,v). the measure e is sensitive to scaling of the variables. 5. the measure e is sensitive to rotation of the scatter diagram in x-y plane. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure maximal information compression index ( ) Let be the covariance matrix of random variables x and y. Define maximal information compression index as smallest eigenvalue of =0 when the features are linearly dependent and increases as the amount of dependency decreases N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure The corresponding loss of information in reconstruction of the pattern is equal to the eigenvalue along the direction normal to the principal component. hence, is the amount of reconstruction error committed if the data is projected to a reduced dimension in the best possible way. there fore, it’s a measure of the minimum amount of information loss or maximum amount of information compression. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure the significance of can also be explained geometrically in terms of linear regression. the value of is equal to the sum of the squares of the perpendicular distance of the points (x,y) to the best fit line The coefficients of such a best fit line are given by and where N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure has the following properties: N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Similarity Measure N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Selection method The task of feature selection involves two step: 1. partition the original feature set into a number of homogeneous subsets (clusters) 2. selecting a representative feature from each such cluster The partition of the features is based on K-NN principle 1. compute the k nearest features of each feature. 2. among them the feature having the most compact subset is selected, and its k neighboring features are discarded. 3. the process is repeated for the remaining features until all of them are either selected or discarded N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Selection method Determining the k nearest-neighbors of features, we assign a constant error threshold ( ) which is set equal to the distance of the kth nearest-neighbor of the feature select in first iteration. if greater than, then we decrease the value of k. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Selection method D : original number of features the original feature set be O={F i, i=1,…,D} the dissimilarity between features F i and F j represent by S(F i,F j ). Let represent the dissimilarity between feature Fi and its kth nearest-neighbor feature in R. N.Y.U.S.T. I.M. r k i

Intelligent Database Systems Lab Feature Selection method N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Selection method with respect to the dimension (D), the method has complexity O(D 2 ) evaluation of the similarity measure for a feature pair is of complexity O(l), thus, the feature selection scheme has overall complexity O(D 2 l) k acts as a scale parameter which controls the degree of details in a more direct manner. this algorithm is nonmetric nature of similarity measure. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Now se describe some indices below: need class information 1. class seperability 2. K-NN classification accuracy 3. naïve Bayes classification accuracy do not need class information 1. entropy 2. fuzzy feature evaluation index 3. representation entropy N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Class Separability S w is the within class scatter matrix S b is the between class scatter matrix. is the a priori probability that a pattern belongs to class w j. is he sample mean vector of class w j. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices K-NN Classification Accuracy use the K-NN rule for evaluating the effectiveness of the reduced set for classification. we randomly select 10% of data as training set and classify the remaining 90% point. Ten such independent runs are performed and average accuracy on test set. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Naïve Bayes Classification Accuracy Used Bayes maximum likelihood classifier,assuming normal distribution of classes to evaluating the classification performance. Mean and covariance of the classes are estimated from a randomly selected 10% training sample and the remaining 90% used as test set. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Entropy x p,j denotes feature value for p along jth direction. similarity between p,q is given by is a positive constant, a possible value of is is the average distance between data points computed over the entire data set. if the data is uniformly distributed in the feature space, entropy is maximum. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Fuzzy Feature Evaluation Index (FFEI) are the degree that both patterns p and q belong to the same cluster in the feature spaces respectively membership function may be defined as the value of FFEI decreases as the intercluster distances increase. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Feature Evaluation indices Representation Entropy let the eigenvalues of the d*d covariance matrix of a feature set of size d be has similar properties like probability, and this is equivalent to the amount of redundancy present in that particular representation of the data set. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons Three categories of real-life public domain data sets are used: low-dimensional (D<=10) medium-dimensional (10<D<=100) high-dimensional (D>100) Use nine UCI data set include : 1. Isolet 2. Multiple Features 3. Arrhythmia 4. Spambase 5. Waveform N.Y.U.S.T. I.M. 6.Ionosphere 7.Forest Cover Type 8.Wisconsin Cancer 9.Iris

Intelligent Database Systems Lab Experimental Results and Comparisons We use four indices to measure classification and clustering performance: 1. Branch and Bound Algorithm (BB) 2. Sequential Forward Search (SFS) 3. Sequential Floating Forward Search (SFFS) 4. Stepwise Clustering (SWC) * using correlation coefficient in our experiments, we have mainly used entropy as the feature selection criterion with first three search algorithm. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Experimental Results and Comparisons N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Conclusions An algorithm for unsupervised feature selection using feature similarity measures is described. our algorithm is based on pairwise feature similarity measure, which are fast to compute. It unlike other approaches, which are based on optimizing either classification or clustering performance explicitly. We have defined a feature similarity measure called maximal information compression index. It also demonstrated through extensive experiments that representation entropy can be used as an index for quantifying both redundancy reduction and information loss in a feature selection method. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Personal Opinion We can learning this method to help our experimental of feature selection. This similarity measure is valid only for numeric features, we can think about how to use in categorical. N.Y.U.S.T. I.M.

Intelligent Database Systems Lab Review 1. compute the k nearest features of each feature. 2. Among them the feature having the most compact subset is selected, and its k neighboring features are discarded. 3. repeated this process for the remaining feature until all of them are either selected or discarded. N.Y.U.S.T. I.M.