CSLT ML Summer Seminar (2)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Part 2: Unsupervised Learning
Component Analysis (Review)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Pattern Recognition and Machine Learning
Dimension reduction (1)
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Pattern Recognition and Machine Learning
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Lecture 5: Learning models using EM
Today Linear Regression Logistic Regression Bayesians v. Frequentists
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
PATTERN RECOGNITION AND MACHINE LEARNING
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Models for Classification
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Modeling Patrice Koehl Department of Biological Sciences
Lecture 2. Bayesian Decision Theory
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Linear Regression (continued)
Non-Parametric Models
Special Topics In Scientific Computing
Statistical Learning Dong Liu Dept. EEIS, USTC.
PCA vs ICA vs LDA.
Modelling data and curve fitting
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

CSLT ML Summer Seminar (2) Linear Model Dong Wang

Content Linear regression Logistic regression Linear Gaussian Bayesian model and regularization

Why linear model? Easy in model training Fast in inference Robustness

Start from polynomial fitting Task: Given training data {<xi,yi>}, predict y for a given x. Model structure: linear interpolation Training objective: Training approach: close-form solution or numerical solution Inference approach: multiplication and sum Formulated as polynomial fitting.

Move to a probabilistic framework The essential difficiulty of ML tasks resides in uncertainty Uncertainty in underlying force/mechanism/corruption Probabilistic model address the uncertainty by high-level assumption (distribution) We prefer a probabilistic interpretation Unified understanding Unified design Unified optimization Unified generalization(extension, modification, etc..)

From polynomial fitting to linear regression Assume linear predictive model with respect to the parameter w. Assume Gaussian noise, which is unexplained residual for the data Prediction becomes probabilistic! Note that complex feature transform does not change the linearity Maximum likelihood prediction

Maximum likelihood training Train parameters to maiximize the probability of the training data. Mean sequare error (MSE)!

Think one minute before going forward We assume a Gaussian distribution on the noise, obtain the polynomial fit. A strong support why use MSE How about if we use other noises? Alternative assumptions on the parameters and noises lead to different and related models, with cost functions that it is not easy to find any correlation among them.

Now train the model

Logistic regression: a bernoli noise A binary classification task Assume the prediction structure y=σ(wx) Assume Bernoli distribution on the target t with parameter y. The same as in linear regression!

Softmax regression: a multinomial noise A multi-classification task Assume the prediction structure y=σ(wx), where σ() is a softmax Assume multinomial distribution on the target t with parameter y. Cross entropy

Probit regression: a probit noise Assume a more implict noise derived from a cumulative distribution function.

If the assumption is wrong Bishop PRML, Fig 4.4. Magenta line is linear regression, green is logistic regression

Extension to hierarchical structure Static model Y=m+wX+ϵ Y=m+wU+vV + ϵ Y(i,j)=m+w(i)U + v(i,j)V + ϵ(i,j) Y(i,j)=m+U(w(i) + v(i,j)) + ϵ(i,j) Dynamic model X(t)=Ax(t-1)+ ϵ; y(t)=Bx(t)+v X(t)=Ax(t-1)+ ϵ(t); y(t)=Bx(t)+v(t), {ϵ(t)} {v(t)} are Gaussian process

Probabilistic PCA Traditional PCA find a orthogonal transform that map data to lower dimensional space, by preserving as much as the variance or minimizing the reconstruction cost. {xi}, i=1,..N, xi ∈RD, V= {vi} ∈RMXD, {yi}, i=1,..N, yi=Vxi ∈RM The goal is to find a V so that {yi} posses the maximal variation. V is called a loading matrix. Begin with the first basis v1, let v1Tv1 = 1, compute the variance of the projection.

Probabilistic PCA v1 is an eigen vector.λ1is the corresponding eigenvalue.

Probabilistic PCA x = Wz + μ + ϵ ; z: N(z|0,I), ϵ: N(z|0, σ2 I)

Probabilistic PCA The task is: given the training data set {xi}, estimate the parameters W, μ,σ. P(z) and p(x|z) are both Gaussians, this leads to p(x) being Gaussian as well. Note there is redundancy in W. WWT is invariance with any orthogonal transform R

Probabilistic PCA Estimate W,μ,σ,by maximum likelihood(ML). Set derivatives to 0 with respect to W, μ, σ, obtain: UMLin RDxM, is composed by any subset of M eigen vectors of covariance S. L is a MxM diagonal matrix, involving corresponding eigen values. R is an arbitrary orthogonal matrix.

Probabilistic PCA Posterior probability p(z|x). Projection of x is implemented as: If σ→0, we recover the standard PCA.

Probabilistic LDA Traditional LDA seeks a linear transformation y=wTx, y ϵ R such that the separation of the projection {yi} between the two groups is maximized The separation is measured by Fisher discriminant

W is an eigen vector of SW-1SB! Probabilistic LDA W is an eigen vector of SW-1SB!

Probabilistic LDA X(i,j)=m+A(v(i)+ ϵ(i,j)) v: N(0,ψ), ϵ: N(0,I) Note that sample a single v for each people in data generation

Training PLDA Sample a single v for each people in model training.

Other famous linear Gaussian models ICA NMF factor analysis

Other famous (pesudo) linear (non-gaussian) models Word2vect RBM Max entropy Perceptron

Bayesian treatment Introduce prior on parameters to estimate A way of integrate knowledge and experience The estimation for parameters will be not point wise, but a distribution The prediction is a distribution as well (not calculating the target noise) MAP

A more specific case Set a simple Gaussian prior on W It is a conjugate prior that ensure the posterior has the same for as the prior. Regularization!

PRML, fig.3.7

Bayesian step 1: From MLE to MAP The maximum a posterior (MAP) can be obtained by maximizing the likelihood function with respect w: P(w|x)∝P(x|w)P(w) This is different from the MLE estimation P(x|w), which does not consider the knowledge on w. Better generalization

Bayesian step 2: From point-wise prediction to marginal prediction

PRML, fig.3.8

Why we want Bayesain It is a way of integrate knowledge. Sometimes it is equal to impose regularization, which we know effective. It propages more knowlege from training to test.

Wrap up Linear model is a large fmaily of models, and we are particularly interested in probabilistic interpretations Linear models are simple yet powerful. Put them in the priority when you face a problem. Bayesian linear model is highly useful, but Bayesian is not limited to linear models.

Q&A