Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Regularized risk minimization
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
EM Algorithm Jur van den Berg.
Supervised Learning Recap
Machine Learning Week 2 Lecture 1.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Review of Lecture Two Linear Regression Normal Equation
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Probability theory: (lecture 2 on AMLbook.com)
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
EM and expected complete log-likelihood Mixture of Experts
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Lecture 2: Statistical learning primer for biologists
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Chapter 3: Maximum-Likelihood Parameter Estimation
Empirical risk minimization
Lecture 04: Logistic Regression
Data Mining Lecture 11.
Probabilistic Models for Linear Regression
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Empirical risk minimization
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
Presentation transcript:

Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic regression

Logistic Regression (lecture 9 on amlbook.com)

Neuron analogy Dot product w T x is a way of combining attributes into a scalar signal s. How signal is used defines the hypothesis set.

In logistic regression, signal become argument of a function with properties like a probability distribution

Objective: find w such that risk score >> 0 for patients that had a heart attack (  (s) ~ 1) and risk score << 0 for those who have not (  (s) ~ 0). Application: risk of heart attack

More specifically (see text p91) Dataset drawn from a distribution function P(y|x), which is related to hypothesis h(x) by P(y n |x n ) = h(x n ) if y n = +1; P(y n |x n ) = 1 - h(x n ) if y n = -1 Logistic function has the property  (-s) = 1 –  (s) Hence, both relationships are satisfied by P(y n |x n ) =  (y n w T x n ) Now use maximum likelihood estimation (MLE) to derive an error function that we minimize to find the optimum w

Recall that MLE is used to Estimate parameters of a probability distribution given a sample X drawn from that distribution In logistic regression, parameters are the weights Likelihood of w given the sample X l (w| X ) = p ( X |w) = ∏ t p(x t |w) Log likelihood L (w| X ) = log( l (w| X )) = ∑ t log p(x t |w) In logistic regression, p( x t |w) =  (y n w T x n )

Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood) Text also normalizes by dividing by N; hence error function becomes

Error function of logistic regression (called cross entropy) has the desired properties. If x n are attributes of person who has had a heart attack, w T x n >> 0 and y n > 0 so contribution to E in (w) is small. If x n are attributes of person who has not had a heart attack, w T x n << 0 and y n < 0 so contribution to E in (w) is again small.

Error function of linear regression allows “1-step” optimization. Not true for error function of logistic regression Optimization is iterative; method is “steepest decent”

Method of steepest (gradient) decent: Fixed step size  w(1) = w(0) +  v hat Unit vector in the direction of the gradient

Method of steepest (gradient) decent: Fixed leaning rate  w(1) = w(0) + delta w Weights change fastest where gradient is largest For E in = cross entropy, gradient is analytical

Logistics regression algorithm

How to compute gradient of E in

How to known when to stop

Assignment 6: Due