Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Parametric Estimation Parametric (single global model) Advantage: It reduces the problem of estimating a probability density function (pdf), discriminant, or regression function to estimating the values of a small number of parameters. Disadvantage: This assumption does not always hold and we may incur a large error if it does not. Semiparametric (small number of local models) Mixture model (Chap 7) The density is written as a disjunction of a small number of parametric models. 3
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Nonparametric Estimation Assumption: Similar inputs have similar outputs Functions (pdf, discriminant, regression) change smoothly Keep the training data;“let the data speak for itself” Given x, find a small number of closest training instances and interpolate from these Nonparametric methods are also called memory-based or instance-based learning algorithms. 4
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Density Estimation Given the training set X ={x t } t drawn iid (independent and identically distributed) from p(x) Divide data into bins of size h Histogram estimator: (Figure 8.1) PS. In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent. ( 5
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6 Histogram estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Density Estimation Given the training set X ={x t } t drawn iid from p(x) x is always at the center of a bin of size h Naive estimator: or (Figure 8.2) 7
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) h=0.5 h=1 Naïve estimator: h=2 8 Naive estimator:
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Kernel Estimator Kernel function, e.g., Gaussian kernel: Kernel estimator (Parzen windows): Figure 8.3 If K is Gaussian, then will be smooth having all the derivatives. 9
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10 Kernel estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) k-Nearest Neighbor Estimator Instead of fixing bin width h and counting the number of instances, fix the instances (neighbors) k and check bin width d k (x): distance to kth closest instance to x 11
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12 k-Nearest Neighbor Estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Generalization to Multivariate Data Kernel density estimator with the requirement that Multivariate Gaussian kernel spheric ellipsoid 13
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Nonparametric Classification Estimate p(x| C i ) and use Bayes’ rule Kernel estimator The discriminant function 14
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Nonparametric Classification k-nn estimator For the special case of k-nn estimator where k i : the number of neighbors out of the k nearest that belong to C i V k (x) : the volume of the d-dimensional hypersphere centered at x, with radius c d : the volume of the unit sphere in d dimensions For example, 15
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Nonparametric Classification k-nn estimator From Then 16
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Condensed ( 壓縮的 ) Nearest Neighbor Time/space complexity of k-NN is O (N) Find a subset Z of X that is small and is accurate in classifying X (Hart, 1968) Error function The error on X storing Z The cardinality of Z 17
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Condensed Nearest Neighbor Incremental algorithm: Add instance if needed A greedy method and a local search The results depend on the order of the training instances. 18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Nonparametric Regression Smoothing models: Nonparametric regression estimator also called a smoother Running mean smoother regressogram vs naive estimator Kernel smoother Running line smoother Regressogram: Figure
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 Regressogram
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Running mean smoother: naive estimator Figure 8.8 Running Mean Smoother 22
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23 Running mean smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Kernel Smoother Kernel smoother: Figure 8.9 where K( ) : Gaussian kernel 24
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25 Kernel smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Running Line Smoother Instead of taking an average and giving a constant fit at a point, we can take into account one more term in the Taylor expansion and calculate a linear fit. F(x) = F’(a)/1!+F’’(a) (x-a)/2!+ F’’’(a) (x-a) 2 /3!+… Running line smoother: Figure 8.10 Use the data points in the neighborhood, as defined by h or k Fit a local regression line 26
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27 Running line smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) How to Choose k or h? When k or h is small: Single instances matter; Bias is small, variance is large Undersmoothing High complexity As k or h increases: Average over more instances Variance decreases but bias increases Oversmoothing Low complexity Cross-validation is used to finetune k or h. 28
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29 Kernel estimate for various bin lengths for a two-class problem. Plotted are the conditional densities, p(x|C i ). It seems that the top one oversmooths and the bottom undersmooths, but whichever is the best will depend on where the validation data points are.