Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

The Simple Regression Model
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Prediction with Regression
Pattern Recognition and Machine Learning
Ch11 Curve Fitting Dr. Deshi Ye
The General Linear Model. The Simple Linear Model Linear Regression.
Data mining in 1D: curve fitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Curve-Fitting Regression
Maximum likelihood (ML) and likelihood ratio (LR) test
1 MF-852 Financial Econometrics Lecture 6 Linear Regression I Roy J. Epstein Fall 2003.
Topic4 Ordinary Least Squares. Suppose that X is a non-random variable Y is a random variable that is affected by X in a linear fashion and by the random.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Maximum likelihood (ML)
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Collaborative Filtering Matrix Factorization Approach
PATTERN RECOGNITION AND MACHINE LEARNING
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Physics 114: Exam 2 Review Lectures 11-16
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Curve-Fitting Regression
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Geology 5670/6670 Inverse Theory 21 Jan 2015 © A.R. Lowry 2015 Read for Fri 23 Jan: Menke Ch 3 (39-68) Last time: Ordinary Least Squares Inversion Ordinary.
INTRODUCTION TO Machine Learning 3rd Edition
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Lecture 8: Ordinary Least Squares Estimation BUEC 333 Summer 2009 Simon Woodcock.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Geology 5670/6670 Inverse Theory 4 Feb 2015 © A.R. Lowry 2015 Read for Fri 6 Feb: Menke Ch 4 (69-88) Last time: The Generalized Inverse The Generalized.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH Part II: Theory and Estimation of Regression Models Chapter 5: Simple Regression Theory.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Geology 5670/6670 Inverse Theory 6 Feb 2015 © A.R. Lowry 2015 Read for Mon 9 Feb: Menke Ch 5 (89-114) Last time: The Generalized Inverse; Damped LS The.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Bias and Variance of the Estimator
10701 / Machine Learning Today: - Cross validation,
Loss.
Biointelligence Laboratory, Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Regression

We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent variable y. We assume we have a dataset D={(x i,t i )} from which to estimate this mapping. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2

In these slides, we will see:  the expected loss in regression;  suitable loss functions (e.g. squared error between the estimate and the actual values): (f(x) – t i ) 2  that best estimate for f(x) that will minimize the squared error is to let y(x) = E[t|x]  the concept of inherent noise  simple linear regression as a specific example You should understand everything (except hidden slides or slides marked as ADVANCED). Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3

4 Loss for Regression

5 Decision Theory for Regression Loss function for regression:

6 Regression Lets first define the conditional expectation of t given x: E[t|x] =  t.p(t|x) t

7 The Squared Loss Function If we used the squared loss as loss function: After some calculations (next slides), we can show that: here var[t|x] = Integral {E[t|x]-t} 2 dt

8 Explanation: ADVANCED Consider the first term inside the loss: This is equal to: since dx doesn’t depend on t, we can move out of the integral; then the integral ∫p(x,t)dt amounts to 1 as we are summing prob.s through all possible t

9 Explanation: ADVANCED Consider the second term inside the loss: This is equal to zero: since doesn’t depend on t, we can move out of the integral

10 Explanation for last step: ADVANCED E[t|x] does not vary with different values of t, so it can be moved out. Notice that you could also immediately see that the expected value of differences from the mean for the random variable t is 0 (first line of the formula).

Explanation: ADVANCED Consider the third term: 11

12 IMPORTANT RESULTS Hence we have: The first term is minimized when we select y(x) as The second term is independent of y(x) and represents the intrinsic variability of the target  It is called the intrinsic error.

13 Alternative approach/explanation Using the squared error as the loss function: We want to choose y(x) to minimize the expected loss:

14 Solving for y(x), we get:

15 Inverse Problems

Linear Regression Some content from Milos Hauskrecht in this section.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) y x y i =h(x i ) + noise

We have a regression problem where we would like to estimate the scalar dependent variable y in terms of a linear function of the independent variable x as: We can put these together for the whole dataset to obtain: y = X  +  Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 In vector notation with extended input vector x:

Note the similarity to a linear neuron: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21

Many alternative solutions with different assumptions or slightly different results:  Ordinary Least Squares minimizes the sum of squared residuals ||y-X  || 2 to find  as  = (X T X) -1 X T y = X + y where y=X  +  and X+ is pseudo-inverse of X.  Maximum Likelihood estimation  Gradient descent ... Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23

24 Bias Variance Decomposition REST NOT COVERED

25 The Bias-Variance Decomposition (1) Recall the expected squared loss, where We said that the second term corresponds to the noise inherent in the random variable t. What about the first term?

26 The Bias-Variance Decomposition (2) Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x; D). Consider the error in the estimation:

27 The Bias-Variance Decomposition (3) Taking the expectation over D yields:

28 The Bias-Variance Decomposition (4) Thus we can write where

29 Bias measures how much the prediction (averaged over all data sets) differs from the desired regression function. Variance measures how much the predictions for individual data sets vary around their average. There is a trade-off between bias and variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)

30 bias variance f gigi g f