Chapter 13 (Prototype Methods and Nearest-Neighbors )

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
INTRODUCTION TO Machine Learning 3rd Edition
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Lecture 3 Nonparametric density estimation and classification
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ)
Chapter 4: Linear Models for Classification
Pattern recognition Professor Aly A. Farag
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
NN Cont’d. Administrivia No news today... Homework not back yet Working on it... Solution set out today, though.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Principles of Pattern Recognition
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Non-Parameter Estimation 主講人:虞台文. Contents Introduction Parzen Windows k n -Nearest-Neighbor Estimation Classification Techiques – The Nearest-Neighbor.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Jakob Verbeek December 11, 2009
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
Linear Models for Classification
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Flat clustering approaches
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Mete Ozay, Fatos T. Yarman Vural —Presented by Tianxiao Jiang
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
KNN & Naïve Bayes Hongning Wang
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Ch8: Nonparametric Methods
Classification of unlabeled data:
K Nearest Neighbor Classification
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
EM Algorithm and its Applications
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006년 2월 24일

Contents Prototype Methods and Nearest-Neighbors 1. Introduction 2.1 K-means Clustering 2.2 Learning Vector Quantization 2.3 Gaussian Mixtures 3. k-Nearest-Neighbor Classifiers 3.1 Example: A Comparative Study 3.2 Example: k-Nearest-neighbors and Image Scene Classification 3.3. Invariant Metrics and Tangent Distance 4. Adaptive Nearest-Neighbor Methods 4.1 Example 4.2 Global Dimension Reduction for Nearest-Neighbors 5. Computational Considerations

2. Prototype Methods Prototypes: a set of representative points, not examples from the training sample. Prototype Methods: Classify a query point x to the class of the closest prototype. -“Closest” is usually defined by Euclidean distance in the feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample. The main issue: To figure out how many prototypes to use and where to put them.

2. Prototype Methods 2.1 K-means Clustering K-means clustering : a method for finding clusters and cluster centers (R) in a set of unlabeled data. -K-means algorithm 1. Initialize set of centers 2. For each center we identify the subset of training points that is closer to it than any other center 3. The means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster. Iterate this until convergence. (Typically the initial centers are R randomly chosen observations from the training data)

2. Prototype Methods - K-means classifier (labeled data) 1. Apply K-means clustering to the training data in each class separately, using R prototypes per class 2. Assign a class label to each of the K by R prototypes 3. Classify a new feature x to the class of the closest prototype Drawbacks: For each class, the other classes do not have a say in the positioning of the prototypes for that class.  a number of the prototypes are near the class boundaries, leading to potential misclassification errors for points near these boundaries.  use all of the data to position all prototypes  LVQ Figure 13.1 (upper panel)

2. Prototype Methods 2.2 Learning Vector Quantization (LVQ1) Figure 13.1 (lower panel) Drawbacks: the fact that they are defined by algorithms, rather than optimization of some fixed criteria. It’s difficult to understand their properties.

2. Prototype Methods -In the lower panel, the LVQ algorithm moves the prototypes away from the decision boundary.

2. Prototype Methods 2.3 Gaussian Mixtures (Soft clustering method) -Each cluster is described in terms of a Gaussian density, which has a centroid, and a covariance matrix. Estimation of means and covariance (apply the EM algorithm in each class): 1.In the E-step, observations close to the center of a cluster will most likely get weight 1 for that cluster, and weight 0 for every other cluster. 2. In the M-step, each observation contributes to the weighted means for every cluster. Classification based on posterior probabilities: (Soft clustering method)  Classification rule

2. Prototype Methods Figure 13.2 -Although the decision boundaries are roughly similar, those for the mixture model are smoother.

3. k-Nearest-Neighbor Classifiers - No model to be fit. Given a query point x0 , we find the k training points x(r) ,r = 1,…,k closest in distance to x0 , and then classify using majority vote among the k neighbors. (For simplicity we will assume that the features are real-valued, and we use Euclidean distance in feature space: ) -Ties are broken at random. -In 1-nearest-neighbor classification, each training point is a prototype. -k large  high bias, low variance ( smooth boundaries) k small  low bias, high variance ( wiggly boundaries)

3. k-Nearest-Neighbor Classifiers Figure 13.3 -The decision boundary is fairly smooth compared to the lower panel, where a 1-nearest-neighbor classifier was used. -The upper panel shows the misclassification errors as a function of neighborhood size. Figure 13.4

3. k-Nearest-Neighbor Classifiers Test error bound for 1-NN ( cover and hart 1967) -Asymptotically the error rate of the 1-nearest-neighbor classifier is never more than twice the Bayes rate. At x let k* be the dominant class, and pk (x) the true conditional probability for class k. Then, The asymptotic 1-NN error rate is that of random rule:

3. k-Nearest-Neighbor Classifiers 3.1 Example : A Comparative Study -Comparison of K-NN, K-means, LVQ There are 10 independent features Xj , each uniformly distributed on [0,1]. The two-class 0-1 target variable is defined as follows: -We see that K-means and LVQ give nearly identical results. For the best choices of their tuning parameters, K-means and LVQ outperform nearest-neighbors for the first problem & second problem. Notice that the best value of each tuning parameter is clearly situation dependent. The results underline the importance of using an objective, data-based method like cross-validation to estimate the best value of a tuning parameter.

3. k-Nearest-Neighbor Classifiers Figure 13.5

3. k-Nearest-Neighbor Classifiers 3.2 Example : K-nearest-Neighbors and Image Scene Classification Figure 13.6

3. k-Nearest-Neighbor Classifiers

3. k-Nearest-Neighbor Classifiers 3.3 Example : Invariant Metrics and Tangent Distance -In some problems, the training features are invariant under certain natural transformations. The nearest-neighbor classifier can exploit such invariances by incorporating them into metric used to measure the distances between objects. -invariant metric( large transformation problems)  use of tangent distances -“hints” A simpler way to achieve this invariance would be to add into the training set” a number of rotated versions of each training image, and then just use a standard nearest-neighbor classifier.

3. k-Nearest-Neighbor Classifiers Figure 13.10 Figure 13.11

4. Adaptive Nearest-Neighbor Methods high-dimensional problem In a high-dimensional feature space, the nearest neighbors of a point can be very far away, causing bias and degrading the performance of the rule. - Median R Consider N data points uniformly distributed in the unit cube . Let R be the radius of a 1-nearest-neighborhood centered at the origin. Then where is the volume of the sphere of radius r in p dimensions. -We see that median radius quickly approaches 0.5 , the distance to the edge of the cube. Figure 13.12

4. Adaptive Nearest-Neighbor Methods

4. Adaptive Nearest-Neighbor Methods DANN (discriminant adaptive nearest neighbor) Here W is the pooled within-class covariance matrix and B is the between class covariance matrix , with W and B computed using only the 50 nearest neighbors around x0. After computation of the metric, it is used in nearest-neighbor rule at x0 .  DANN metric Where,

4. Adaptive Nearest-Neighbor Methods 4.1 Example Figure 13.14 Figure 13.15

4. Adaptive Nearest-Neighbor Methods 4.2 Global dimension Reduction for nearest-Neighbors Variation of DANN-local dimension reduction (Hastie and Tibshirani) At each training point xi , the between-centroids sum of squares matrix Bi is computed, and then these matrices are averaged over all training points: Let e1, e2,…, ep be the eigenvectors of the matrix , ordered from largest to smallest eigenvalue . Then derivation is based on the fact that the best rank-L approximation to , , solves the least squares problem since each Bi contains information on (a) the local discriminant subspace. And (b) the strength of discrimination in that subspace, 13.11 can be seen as a way of finding the best approximating subspace of dimension L to a series of N subspaces by weighted least squares. (3.11)

5. Computational Considerations -Drawback of nearest-neighbor rules in general is the computational loads. computations for tangent distance (Hastie and Simard) The multi-edit algorithm (Devijver and Kittler) The condensing procedure (Hart)