Lecture Notes for CMPUT 466/551 Nilanjan Ray

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
Kernels CMPUT 466/551 Nilanjan Ray. Agenda Kernel functions in SVM: A quick recapitulation Kernels in regression Kernels in k-nearest neighbor classifier.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kernel methods - overview
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Statistical Decision Theory, Bayes Classifier
Linear Methods for Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
PATTERN RECOGNITION AND MACHINE LEARNING
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Jakob Verbeek December 11, 2009
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Kernel Methods Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Classifier Team teaching.
CHAPTER 8: Nonparametric Methods Alpaydin transparencies significantly modified, extended and changed by Ch. Eick Last updated: March 4, 2011.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
Machine learning, pattern recognition and statistical data modelling
K Nearest Neighbor Classification
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Test #1 Thursday September 20th
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Lecture Notes for CMPUT 466/551 Nilanjan Ray Kernel Methods Lecture Notes for CMPUT 466/551 Nilanjan Ray

Kernel Methods: Key Points Essentially a local regression (function estimation/fitting) technique Only the observations (training set) close to the query point are considered for regression computation While regressing, an observation point gets a weight that decreases as its distance from the query point increases The resulting regression function is smooth All these features of this regression are made possible by a function called kernel Requires very little training (i.e., not many parameters to compute offline from the training set, not much offline computation needed) This kind of regression is known as memory based technique as it requires entire training set to be available while regressing

One-Dimensional Kernel Smoothers We have seen that k-nearest neighbor directly estimates Pr(Y|X=x) k-nn assigns equal weight to all points in neighborhood The average curve is bumpy and discontinuous Rather than give equal weight, assign weights that decrease smoothly with distance from the target points

Nadaraya-Watson Kernel-weighted Average N-W kernel weighted average: K is a kernel function: where Typically K is also symmetric about 0

Some Points About Kernels hλ(x0) is a width function also dependent on λ For the N-W kernel average hλ(x0) = λ For k-nn average hλ(x0) = |x0-x[k]|, where x[k] is the kth closest xi to x0 λ determines the width of local neighborhood and degree of smoothness λ also controls the tradeoff between bias and variance Larger λ makes lower variance but higher bias (Why?) λ is computed from training data (how?)

Example Kernel functions Epanechnikov quadratic kernel (used in N-W method) tri-cube kernel Gaussian kernel Kernel characteristics Compact – vanishes beyond a finite range (such as Epanechnikov, tri-cube) Everywhere differentiable (Gaussian, tri-cube)

Local Linear Regression In kernel-weighted average method estimated function value has a high bias at the boundary This high bias is a result of the asymmetry at the boundary The bias can also be present in the interior when the x values in the training set are not equally spaced Fitting straight lines rather than constants locally helps us to remove bias (why?)

Locally Weighted Linear Regression Least squares solution: Note that the estimate is linear in yi The weights li(xi) are sometimes referred to as the equivalent kernel Ex.

Bias Reduction In Local Linear Regression Local linear regression automatically modifies the kernel to correct the bias exactly to the first order Write a Taylor series expansion of f(xi) Ex. 6.2 in [HTF]

Local Polynomial Regression Why have a polynomial for the local fit? What would be the rationale? We will gain on bias; however we will pay the price in terms of variance (why?)

Bias and Variance Tradeoff As the degree of local polynomial regression increases, bias decreases and variance increases Local linear fits can help reduce bias significantly at the boundaries at a modest cost in variance Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain So, would it be helpful have a mixture or linear and quadratic local fits?

Local Regression in Higher Dimensions We can extend 1D local regression to higher dimensions Standardize each coordinates in the kernel, because Euclidean (square) norm is affected by scaling

Local Regression: Issues in Higher Dimensions The boundary poses even a greater problem in higher dimensions Many training points are required to reduce the bias; Sample size should increase exponentially in p to match the same performance. Local regression becomes less useful when dimensions go beyond 2 or 3 It’s impossible to maintain localness (low bias) and sizeable samples (low variance) in the same time

Combating Dimensions: Structured Kernels In high dimensions, input variables (i.e., x variables) could be very much correlated. This correlation could be a key to reduce the dimensionality while performing kernel regression. Let A be a positive semidefinite matrix (what does that mean?). Let’s now consider a kernel that looks like: If A=-1, the inverse of the covariance matrix of the input variables, then the correlation structure is captured Further, one can take only a few principal components of A to reduce the dimensionality

Combating Dimensions: Low Order Additive Models ANOVA (analysis of variance) decomposition: One-dimensional local regression is all needed:

Probability Density Function Estimation In many classification or regression problems we desperately want to estimate probability densities– recall the instances So can we not estimate a probability density, directly given some samples from it? Local methods of Density Estimation: This estimate is typically bumpy, non-smooth (why?)

Smooth PDF Estimation using Kernels Parzen method: Gaussian kernel: In p-dimensions Kernel density estimation

Using Kernel Density Estimates in Classification Posterior probability density: In order to estimate this density, we can estimate the class conditional densities using Parzen method where is the jth class conditional density Class conditional densities Ratio of posteriors

Naive Bayes Classifier In Bayesian Classification we need to estimate the class conditional densities: What if the input space x is multi-dimensional? If we apply kernel density estimates, we will run into the same problems that we faced in high dimensions To avoid these difficulties, assume that the class conditional density factorizes: In other words we are assuming here that the features are independent – Naïve Bayes model Advantages: Each class density for each feature can be estimated (low variance) If some of the features are continuous, some are discrete this method can seamlessly handle the situation Naïve Bayes classifier works surprisingly well for many problems (why?) Discriminant function is now generalized linear additive

Key Points Local assumption Usually Bandwidth () selection is more important than kernel function selection Low bias, low variance usually not guaranteed in high dimensions Little training and high online computational complexity Use sparingly: only when really required, like in the high-confusion zone Use when model may not be used again: No need for the training phase