CSLT ML Summer Seminar (2)

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Part 2: Unsupervised Learning

Component Analysis (Review)

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Pattern Recognition and Machine Learning

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

Pattern Recognition and Machine Learning

Dimension reduction (1)

Linear Models for Classification: Probabilistic Methods

Chapter 4: Linear Models for Classification

Computer vision: models, learning and inference

Pattern Recognition and Machine Learning

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Lecture 5: Learning models using EM

Today Linear Regression Logistic Regression Bayesians v. Frequentists

L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Linear Models for Classification

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Machine Learning 5. Parametric Methods.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Data Modeling Patrice Koehl Department of Biological Sciences

Lecture 2. Bayesian Decision Theory

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

CEE 6410 Water Resources Systems Analysis

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Ch 12. Continuous Latent Variables ~ 12

Probability Theory and Parameter Estimation I

Background on Classification

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

LECTURE 10: DISCRIMINANT ANALYSIS

CH 5: Multivariate Methods

Linear Regression (continued)

Non-Parametric Models

Special Topics In Scientific Computing

Statistical Learning Dong Liu Dept. EEIS, USTC.

PCA vs ICA vs LDA.

Modelling data and curve fitting

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Machine Learning

Biointelligence Laboratory, Seoul National University

Feature space tansformation methods

Generally Discriminant Analysis

LECTURE 09: DISCRIMINANT ANALYSIS

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Mathematical Foundations of BME

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Multivariate Methods Berlin Chen, 2005 References:

Presentation transcript:

CSLT ML Summer Seminar (2) Linear Model Dong Wang

Content Linear regression Logistic regression Linear Gaussian Bayesian model and regularization

Why linear model? Easy in model training Fast in inference Robustness

Start from polynomial fitting Task: Given training data {<xi,yi>}, predict y for a given x. Model structure: linear interpolation Training objective: Training approach: close-form solution or numerical solution Inference approach: multiplication and sum Formulated as polynomial fitting.

Move to a probabilistic framework The essential difficiulty of ML tasks resides in uncertainty Uncertainty in underlying force/mechanism/corruption Probabilistic model address the uncertainty by high-level assumption (distribution) We prefer a probabilistic interpretation Unified understanding Unified design Unified optimization Unified generalization(extension, modification, etc..)

From polynomial fitting to linear regression Assume linear predictive model with respect to the parameter w. Assume Gaussian noise, which is unexplained residual for the data Prediction becomes probabilistic! Note that complex feature transform does not change the linearity Maximum likelihood prediction

Maximum likelihood training Train parameters to maiximize the probability of the training data. Mean sequare error (MSE)!

Think one minute before going forward We assume a Gaussian distribution on the noise, obtain the polynomial fit. A strong support why use MSE How about if we use other noises? Alternative assumptions on the parameters and noises lead to different and related models, with cost functions that it is not easy to find any correlation among them.

Now train the model

Logistic regression: a bernoli noise A binary classification task Assume the prediction structure y=σ(wx) Assume Bernoli distribution on the target t with parameter y. The same as in linear regression!

Softmax regression: a multinomial noise A multi-classification task Assume the prediction structure y=σ(wx), where σ() is a softmax Assume multinomial distribution on the target t with parameter y. Cross entropy

Probit regression: a probit noise Assume a more implict noise derived from a cumulative distribution function.

If the assumption is wrong Bishop PRML, Fig 4.4. Magenta line is linear regression, green is logistic regression

Extension to hierarchical structure Static model Y=m+wX+ϵ Y=m+wU+vV + ϵ Y(i,j)=m+w(i)U + v(i,j)V + ϵ(i,j) Y(i,j)=m+U(w(i) + v(i,j)) + ϵ(i,j) Dynamic model X(t)=Ax(t-1)+ ϵ; y(t)=Bx(t)+v X(t)=Ax(t-1)+ ϵ(t); y(t)=Bx(t)+v(t), {ϵ(t)} {v(t)} are Gaussian process

Probabilistic PCA Traditional PCA find a orthogonal transform that map data to lower dimensional space, by preserving as much as the variance or minimizing the reconstruction cost. {xi}, i=1,..N, xi ∈RD, V= {vi} ∈RMXD, {yi}, i=1,..N, yi=Vxi ∈RM The goal is to find a V so that {yi} posses the maximal variation. V is called a loading matrix. Begin with the first basis v1, let v1Tv1 = 1, compute the variance of the projection.

Probabilistic PCA v1 is an eigen vector.λ1is the corresponding eigenvalue.

Probabilistic PCA x = Wz + μ + ϵ ; z: N(z|0,I), ϵ: N(z|0, σ2 I)

Probabilistic PCA The task is: given the training data set {xi}, estimate the parameters W, μ,σ. P(z) and p(x|z) are both Gaussians, this leads to p(x) being Gaussian as well. Note there is redundancy in W. WWT is invariance with any orthogonal transform R

Probabilistic PCA Estimate W,μ,σ,by maximum likelihood(ML). Set derivatives to 0 with respect to W, μ, σ, obtain: UMLin RDxM, is composed by any subset of M eigen vectors of covariance S. L is a MxM diagonal matrix, involving corresponding eigen values. R is an arbitrary orthogonal matrix.

Probabilistic PCA Posterior probability p(z|x). Projection of x is implemented as: If σ→0, we recover the standard PCA.

Probabilistic LDA Traditional LDA seeks a linear transformation y=wTx, y ϵ R such that the separation of the projection {yi} between the two groups is maximized The separation is measured by Fisher discriminant

W is an eigen vector of SW-1SB! Probabilistic LDA W is an eigen vector of SW-1SB!

Probabilistic LDA X(i,j)=m+A(v(i)+ ϵ(i,j)) v: N(0,ψ), ϵ: N(0,I) Note that sample a single v for each people in data generation

Training PLDA Sample a single v for each people in model training.

Other famous linear Gaussian models ICA NMF factor analysis

Other famous (pesudo) linear (non-gaussian) models Word2vect RBM Max entropy Perceptron

Bayesian treatment Introduce prior on parameters to estimate A way of integrate knowledge and experience The estimation for parameters will be not point wise, but a distribution The prediction is a distribution as well (not calculating the target noise) MAP

A more specific case Set a simple Gaussian prior on W It is a conjugate prior that ensure the posterior has the same for as the prior. Regularization!

PRML, fig.3.7

Bayesian step 1: From MLE to MAP The maximum a posterior (MAP) can be obtained by maximizing the likelihood function with respect w: P(w|x)∝P(x|w)P(w) This is different from the MLE estimation P(x|w), which does not consider the knowledge on w. Better generalization

Bayesian step 2: From point-wise prediction to marginal prediction

PRML, fig.3.8

Why we want Bayesain It is a way of integrate knowledge. Sometimes it is equal to impose regularization, which we know effective. It propages more knowlege from training to test.

Wrap up Linear model is a large fmaily of models, and we are particularly interested in probabilistic interpretations Linear models are simple yet powerful. Put them in the priority when you face a problem. Bayesian linear model is highly useful, but Bayesian is not limited to linear models.

Q&A