Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.

Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Supervised learning Training data set: several features and outcome Build a learner based on training data sets Predict the future unseen outcome from seen features of data

An example of supervised learning Email spam Normal Emails … Spam … Learner … New emails Normal emails Spam Known Unknown

Input & Output Input = predictor = independent variable Output = response = dependent variable

Output Types Quantitative >> regression  Ex) stock price, temperature, age Qualitative >> classification  Ex) Yes/No,

Input Types Quantitative Qualitative Ordered categorical  Ex) small, medium, big

Terminology X : input  X j : j th component  X : matrix  x j : j th observed value Y : quantitative output  Y : prediction G : qualitative output ^

General model Given input X, output Y Want to estimate the function f based on known data set (training data) unknown

Two simple methods Linear model, linear regression Nearest neighbor method

Linear model Give a vector of input features X = (X 1 …X p ) Assume the linear relationship: Least squares standard: min -2

Classification example in two dimensions -1

Nearest neighbor method Majority vote within the k nearest neighbors K= 1: brown K= 3: green new

Classification example in two dimensions -2

Linear model vs. K-nearest neighbor Linear model #parameters: p Stable, smooth Low variance, high bias K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Each method has its own situations for which it works best.

Misclassification curves

Enhanced Methods Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex models Projection & neural network

Statistical decision theory (1) Given input X in R p, output Y in R Joint distribution: Pr (X,Y) Looking for predicting function: f (X) Squared error loss: Nearest-neighbor methods : min EPE ^

Statistical decision theory (2) k-Nearest neighbor If N,k  , k/N  0 Insufficient samples! Curse of dimensionality! Linear model But, the true function might not be linear!

Statistical decision theory (3) If Robust But, discontinuous in their derivatives ^

Statistical decision theory (4) G : categorical output variable L : Loss Function EPE = E[L(G, G(X))] Bayesian Classifier ^

References Reading group on "elements of statistical learning” – overview.ppt  http://sifaka.cs.uiuc.edu/taotao/stat.html http://sifaka.cs.uiuc.edu/taotao/stat.html Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf  http://www.stat.ohio-state.edu/~goel/STATLEARN/ http://www.stat.ohio-state.edu/~goel/STATLEARN/ The Matrix Cookbook  http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf A First Course in Probability

2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. The curse of dimensionality  To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. The expected edge length =  All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:

2.5 Local Methods in High Dimensions Example 1-NN vs. Linear  1-NN As p increases, MSE & bias tends to 1.0.  Linear model Expecting on x 0, the expected EPE increases linearly as a function of p. Variance Sq. Bias = 0. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.

2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to function that underlies the predictive relationship between the inputs and outputs.  Supervised learning: machine learning point of view  Function approximation: mathematics and statistics point of view

2.7 Structured Regression Models Nearest-neighbor and other local methods face problems in high dimensions.  They may be inappropriate even in low dimensions.  Need for structured approaches. Difficulty of the problem   Infinitely many solutions to minimizing RSS.  Unique solution comes from restrictions on f.

2.8 Classes of Restricted Estimators Methods categorized by the nature of the restrictions.  Roughness penalty and Bayesian methods Penalizing functions that too rapidly vary over small regions of input space.  Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel function). Need adaptation in high dimensions.  Basis functions and dictionary methods Linear expansion of basis functions.

2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity parameter to be determined  Multiplier of the penalty term  Width of the kernel  Number of basis functions

Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff!

Bias-Variance tradeoff in kNN

Model complexity Training error Test error Model Complexity Low High Prediction Error High Bias Low Variance Low Bias High Variance

Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.

Similar presentations

Presentation on theme: "Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.

Similar presentations

Presentation on theme: "Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20."— Presentation transcript:

Similar presentations

About project

Feedback