Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
Pattern Recognition and Machine Learning
Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
The Nature of Statistical Learning Theory by V. Vapnik
x – independent variable (input)
Classification and risk prediction
Curve-Fitting Regression
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Binary Classification Problem Learn a Classifier from the Training Set
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Lecture 10: Support Vector Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Today Wrap up of probability Vectors, Matrices. Calculus
An Introduction to Support Vector Machines Martin Law.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
PATTERN RECOGNITION AND MACHINE LEARNING
Outline Separating Hyperplanes – Separable Case
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support vector machines
Support Vector Machine
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CSE 4705 Artificial Intelligence
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Linear regression Fitting a straight line to observations.
10701 / Machine Learning Today: - Cross validation,
Biointelligence Laboratory, Seoul National University
Support vector machines
Parametric Methods Berlin Chen, 2005 References:
Machine learning overview
Linear Discrimination
Presentation transcript:

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression

Today Calculus –Lagrange Multipliers Linear Regression 1

Optimization with constraints What if I want to constrain the parameters of the model. –The mean is less than 10 Find the best likelihood, subject to a constraint. Two functions: –An objective function to maximize –An inequality that must be satisfied 2

Lagrange Multipliers Find maxima of f(x,y) subject to a constraint. 3

General form Maximizing: Subject to: Introduce a new variable, and find a maxima. 4

Example Maximizing: Subject to: Introduce a new variable, and find a maxima. 5

Example 6 Now have 3 equations with 3 unknowns.

Example 7 Eliminate LambdaSubstitute and Solve

Basics of Linear Regression Regression algorithm Supervised technique. In one dimension: –Identify In D-dimensions: –Identify Given: training data: –And targets: 8

Graphical Example of Regression 9 ?

10

Graphical Example of Regression 11

Definition In linear regression, we assume that the model that generates the data involved only a linear combination of input variables. 12 Where w is a vector of weights which define the D parameters of the model

Evaluation How can we evaluate the performance of a regression solution? Error Functions (or Loss functions) –Squared Error –Linear Error 13

Regression Error 14

Empirical Risk Empirical risk is the measure of the loss from data. By minimizing risk on the training data, we optimize the fit with respect to the loss function 15

Model Likelihood and Empirical Risk Two related but distinct ways to look at a model. 1.Model Likelihood. 1.“What is the likelihood that a model generated the observed data?” 2.Empirical Risk 1.“How much error does the model have on the training data?” 16

Model Likelihood 17 Assuming Independently Identically Distributed (iid) data. What is the likelihood that a model with some parameters generated an observed sample

Understanding Model Likelihood 18 Substitution for the eqn of a gaussian Apply a log function Let the log dissolve products into sums

Understanding Model Likelihood 19 Optimize the weights. (Maximum Likelihood Estimation) Log Likelihood Empirical Risk w/ Squared Loss Function

Maximizing Log Likelihood (1-D) Find the optimal settings of w. 20

Maximizing Log Likelihood 21 Partial derivative Set to zero Separate the sum to isolate w 0

Maximizing Log Likelihood 22 Partial derivative Set to zero Separate the sum to isolate w 0

Maximizing Log Likelihood 23 From previous partial From prev. slide Substitute Isolate w 1

Maximizing Log Likelihood Clean and easy. Or not… Apply some linear algebra. 24

Likelihood using linear algebra Representing the linear regression function in terms of vectors. 25

Likelihood using linear algebra Stack x T into a matrix of data points, X. 26 Representation as vectors Stack the data into a matrix and use the Norm operation to handle the sum

Likelihood in multiple dimensions This representation of risk has no inherent dimensionality. 27

Maximum Likelihood Estimation redux 28 Decompose the norm FOIL – linear algebra style Differentiate Combine terms Isolate w

Extension to polynomial regression 29

Extension to polynomial regression Polynomial regression is the same as linear regression in D dimensions 30

Generate new features 31 Standard Polynomial with coefficients, w Risk

Generate new features 32 Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from x i Then standard linear regression in D dimensions

How is this still linear regression? The regression is linear in the parameters, despite projecting x i from one dimension to D dimensions. Now we fit a plane (or hyperplane) to a representation of x i in a higher dimensional feature space. This generalizes to any set of functions 33

Basis functions as feature extraction These functions are called basis functions. –They define the bases of the feature space Allows linear decomposition of any type of function to data points Common Choices: –Polynomial –Gaussian –Sigmoids –Wave functions (sine, etc.) 34

Training data vs. Testing Data Evaluating the performance of a classifier on training data is meaningless. With enough parameters, a model can simply memorize (encode) every training point To evaluate performance, data is divided into training and testing (or evaluation) data. –Training data is used to learn model parameters –Testing data is used to evaluate performance 35

Overfitting 36

Overfitting 37

Overfitting performance 38

Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting, and a bad model? 39

Possible detection of overfitting Stability –An appropriately fit model is stable under different samples of the training data –An overfit model generates inconsistent performance Performance –A good model has low test error –A bad model has high test error 40

What is the optimal model size? The best model size generalizes to unseen data the best. Approximate this by testing error. One way to optimize parameters is to minimize testing error. –This operation uses testing data as tuning or development data Sacrifices training data in favor of parameter optimization Can we do this without explicit evaluation data? 41

Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models 42