Mathematical Foundations of BME Reza Shadmehr

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Prof. Navneet Goyal CS & IS BITS, Pilani
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning 3rd Edition
Integration of sensory modalities
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Classification and risk prediction
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Linear Methods for Classification
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 14 Analysis.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
© Copyright McGraw-Hill 2000
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Linear Models for Classification
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Local Likelihood & other models, Kernel Density Estimation & Classification, Radial Basis Functions.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating standard error using bootstrap
Logistic Regression APKC – STATS AFAC (2016).
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
Model Inference and Averaging
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Classification of unlabeled data:
The Maximum Likelihood Method
Lecture 04: Logistic Regression
Data Mining Lecture 11.
Statistical Learning Dong Liu Dept. EEIS, USTC.
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Integration of sensory modalities
EM for Inference in MV Data
Regression Chapter 8.
Lecture 04: Logistic Regression
Mathematical Foundations of BME
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
EM for Inference in MV Data
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME
Linear Discrimination
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

580.704 Mathematical Foundations of BME Reza Shadmehr logistic regression, iterative re-weighted least squares

Logistic regression In the last lecture we classified by computing a posterior probability. The posterior was calculated by modeling the likelihood and prior for each class. To compute the posterior, we modeled the right side of the equation below by assuming that they were Gaussians and computed their parameters (or used a kernel estimate of the density). In logistic regression, we want to directly model the posterior as a function of the variable x. In practice, when there are k classes to classify, we model:

Classification by maximizing the posterior distribution In this example we assume that the two distributions for the classes have equal variance. Suppose we want to classify a person as male or female based on height. What we have: What we want: Probability of height, given that you have observed a female. Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let y be an indicator variable for being a female. Then the conditional distribution of x (the height becomes):

Posterior probability for classification when we have two classes: Probability of the subject being female, given that you have observed height x. Note that here we are assuming that sigmas are equal for the two distributions. Only in this condition the x variable in the denominator appears linearly inside the exponential. If the two distributions have different sigmas, then the x-variable will be a squared term.

Computing the probability that the subject is female, given that we observed height x. a logistic function In the denominator, x appears linearly inside the exponential 120 140 160 180 200 220 0.2 0.4 0.6 0.8 1 So if we assume that the class membership densities p(x/y) are normal with equal variance, then the posterior probability will be a logistic function. Posterior:

Logistic regression with assumption of equal variance among density of classes implies a linear decision boundary -4 -2 2 4 6 Class 0

Logistic regression: problem statement Assumption of equal variance among the clusters The goal is to find parameters w that maximize the log-likelihood.

Some useful properties of the logistic function

Online algorithm for logistic regression

Batch algorithm: Iteratively Re-weighted Least Squares

Iteratively Re-weighted Least Squares IRLS 0.2 0.4 0.6 0.8 5 6 7 8 9 10 11 certain certain Sensitivity to error uncertain

Iteratively Re-weighted Least Square: Example -6 -4 -2 2 4 6 -6 -4 -2 2 x1 4 x2 0.25 0.5 0.75 1

Modeling the posterior when the densities have unequal variance (uni-variate case with two classes)

Logistic regression with basis functions By using non-linear bases, we can deal with clusters having unequal variance. Estimated posterior probability -8 -6 -4 -2 -1 1 2 3 4

Logistic function for multiple classes with equal variance Rather than modeling the posterior directly, let us pick the posterior for one class as our reference and then model the ratio of the posterior for all other classes with respect to that class. Suppose we have k classes:

Logistic function for multiple classes with equal variance: soft-max A “soft-max” function

Classification of multiple classes with equal variance 160 180 200 220 0.005 0.01 0.015 0.02 0.025 160 180 200 220 0.0025 0.005 0.0075 0.01 0.0125 0.015 Posterior probabilities 160 180 200 220 0.2 0.4 0.6 0.8 1 160 180 200 220 -15 -10 -5 5 10 15