Download presentation
Presentation is loading. Please wait.
Published byFelicia Nicholson Modified over 9 years ago
1
Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20
2
Supervised learning Training data set: several features and outcome Build a learner based on training data sets Predict the future unseen outcome from seen features of data
3
An example of supervised learning Email spam Normal Emails … Spam … Learner … New emails Normal emails Spam Known Unknown
4
Input & Output Input = predictor = independent variable Output = response = dependent variable
5
Output Types Quantitative >> regression Ex) stock price, temperature, age Qualitative >> classification Ex) Yes/No,
6
Input Types Quantitative Qualitative Ordered categorical Ex) small, medium, big
7
Terminology X : input X j : j th component X : matrix x j : j th observed value Y : quantitative output Y : prediction G : qualitative output ^
8
General model Given input X, output Y Want to estimate the function f based on known data set (training data) unknown
9
Two simple methods Linear model, linear regression Nearest neighbor method
10
Linear model Give a vector of input features X = (X 1 …X p ) Assume the linear relationship: Least squares standard: min -2
11
Classification example in two dimensions -1
12
Nearest neighbor method Majority vote within the k nearest neighbors K= 1: brown K= 3: green new
13
Classification example in two dimensions -2
14
Linear model vs. K-nearest neighbor Linear model #parameters: p Stable, smooth Low variance, high bias K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Each method has its own situations for which it works best.
15
Misclassification curves
16
Enhanced Methods Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex models Projection & neural network
17
Statistical decision theory (1) Given input X in R p, output Y in R Joint distribution: Pr (X,Y) Looking for predicting function: f (X) Squared error loss: Nearest-neighbor methods : min EPE ^
18
Statistical decision theory (2) k-Nearest neighbor If N,k , k/N 0 Insufficient samples! Curse of dimensionality! Linear model But, the true function might not be linear!
19
Statistical decision theory (3) If Robust But, discontinuous in their derivatives ^
20
Statistical decision theory (4) G : categorical output variable L : Loss Function EPE = E[L(G, G(X))] Bayesian Classifier ^
21
References Reading group on "elements of statistical learning” – overview.ppt http://sifaka.cs.uiuc.edu/taotao/stat.html http://sifaka.cs.uiuc.edu/taotao/stat.html Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf http://www.stat.ohio-state.edu/~goel/STATLEARN/ http://www.stat.ohio-state.edu/~goel/STATLEARN/ The Matrix Cookbook http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf A First Course in Probability
22
2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. The curse of dimensionality To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. The expected edge length = All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:
23
2.5 Local Methods in High Dimensions Example 1-NN vs. Linear 1-NN As p increases, MSE & bias tends to 1.0. Linear model Expecting on x 0, the expected EPE increases linearly as a function of p. Variance Sq. Bias = 0. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.
24
2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to function that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of view Function approximation: mathematics and statistics point of view
25
2.7 Structured Regression Models Nearest-neighbor and other local methods face problems in high dimensions. They may be inappropriate even in low dimensions. Need for structured approaches. Difficulty of the problem Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f.
26
2.8 Classes of Restricted Estimators Methods categorized by the nature of the restrictions. Roughness penalty and Bayesian methods Penalizing functions that too rapidly vary over small regions of input space. Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel function). Need adaptation in high dimensions. Basis functions and dictionary methods Linear expansion of basis functions.
27
2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions
28
Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff!
29
Bias-Variance tradeoff in kNN
30
Model complexity Training error Test error Model Complexity Low High Prediction Error High Bias Low Variance Low Bias High Variance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.