Generative verses discriminative classifier

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Bayes rule, priors and maximum a posteriori
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
The General Linear Model. The Simple Linear Model Linear Regression.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Crash Course on Machine Learning Part II
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Lecture 2: Statistical learning primer for biologists
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
Generative verses discriminative classifier
Ch3: Model Building through Regression
Linear Regression (continued)
ECE 5424: Introduction to Machine Learning
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Generative verses discriminative classifier Machine Learning Generative verses discriminative classifier Eric Xing Lecture 2, August 12, 2010 © Eric Xing @ CMU, 2006-2010 Reading:

Generative and Discriminative classifiers Goal: Wish to learn f: X ® Y, e.g., P(Y|X) Generative: Modeling the joint distribution of all data Discriminative: Modeling only points at the boundary © Eric Xing @ CMU, 2006-2010

Generative vs. Discriminative Classifiers Goal: Wish to learn f: X ® Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers (e.g., logistic regression) Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Estimate parameters of P(Y|X) directly from training data Yn Xn Yn Xn © Eric Xing @ CMU, 2006-2010

Suppose you know the following … Class-specific Dist.: P(X|Y) Class prior (i.e., "weight"): P(Y) This is a generative model of the data! Bayes classifier: © Eric Xing @ CMU, 2006-2010

Optimal classification Theorem: Bayes classifier is optimal! That is How to learn a Bayes classifier? Recall density estimation. We need to estimate P(X|y=k), and P(y=k) for all k © Eric Xing @ CMU, 2006-2010

Gaussian Discriminative Analysis learning f: X ® Y, where X is a vector of real-valued features, Xn= < Xn,1…Xn,m > Y is an indicator vector What does that imply about the form of P(Y|X)? The joint probability of a datum and its label is: Given a datum xn, we predict its label using the conditional probability of the label given the datum: Yn Xn © Eric Xing @ CMU, 2006-2010

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write e.g., Equivalent to: © Eric Xing @ CMU, 2006-2010

Naïve Bayes Classifier When X is multivariate-Gaussian vector: The joint probability of a datum and it label is: The naïve Bayes simplification More generally: Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density Yn Xn Yn Xn,1 Xn,2 Xn,m … © Eric Xing @ CMU, 2006-2010

The predictive distribution Understanding the predictive distribution Under naïve Bayes assumption: For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be a logistic function © Eric Xing @ CMU, 2006-2010

The decision boundary The predictive distribution The Bayes decision rule: For multiple class (i.e., K>2), * correspond to a softmax function © Eric Xing @ CMU, 2006-2010

Generative vs. Discriminative Classifiers Goal: Wish to learn f: X ® Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers: Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Estimate parameters of P(Y|X) directly from training data Yi Xi Yi Xi © Eric Xing @ CMU, 2006-2010

Linear Regression The data: Both nodes are observed: X is an input vector Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator) A regression scheme can be used to model p(y|x) directly, rather than p(x,y) Yi Xi N © Eric Xing @ CMU, 2006-2010

Linear Regression Assume that Y (target) is a linear function of X (features): e.g.: let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be: then we have the following general representation of the linear function: Our goal is to pick the optimal . How! We seek that minimize the following cost function: © Eric Xing @ CMU, 2006-2010

The Least-Mean-Square (LMS) method Consider a gradient descent algorithm: Now we have the following descent rule: For a single training point, we have: This is known as the LMS update rule, or the Widrow-Hoff learning rule This is actually a "stochastic", "coordinate" descent algorithm This can be used as a on-line algorithm © Eric Xing @ CMU, 2006-2010

Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation: where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: By independence assumption: © Eric Xing @ CMU, 2006-2010

Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is: Do you recognize the last term? Yes it is: Thus under independence assumption, LMS is equivalent to MLE of θ ! © Eric Xing @ CMU, 2006-2010

Classification and logistic regression © Eric Xing @ CMU, 2006-2010

The logistic function © Eric Xing @ CMU, 2006-2010

Logistic regression (sigmoid classifier) The condition distribution: a Bernoulli where m is a logistic function We can used the brute-force gradient method as in LR But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …) © Eric Xing @ CMU, 2006-2010

Training Logistic Regression: MCLE Estimate parameters q=<q0, q1, ... qm> to maximize the conditional likelihood of training data Training data Data likelihood = Data conditional likelihood = © Eric Xing @ CMU, 2006-2010

Expressing Conditional Log Likelihood Recall the logistic function: and conditional likelihood: a.k.a. cross entropy error function © Eric Xing @ CMU, 2006-2010

Maximizing Conditional Log Likelihood The objective: Good news: l(q) is concave function of q Bad news: no closed-form solution to maximize l(q) © Eric Xing @ CMU, 2006-2010

The Newton’s method Finding a zero of a function © Eric Xing @ CMU, 2006-2010

The Newton’s method (con’d) To maximize the conditional likelihood l(q): since l is convex, we need to find q* where l’(q*)=0 ! So we can perform the following iteration: © Eric Xing @ CMU, 2006-2010

The Newton-Raphson method In LR the q is vector-valued, thus we need the following generalization: Ñ is the gradient operator over the function H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010

The Newton-Raphson method In LR the q is vector-valued, thus we need the following generalization: Ñ is the gradient operator over the function H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010

Iterative reweighed least squares (IRLS) Recall in the least square est. in linear regression, we have: which can also derived from Newton-Raphson Now for logistic regression: © Eric Xing @ CMU, 2006-2010

Generative vs. Discriminative Classifiers Goal: Wish to learn f: X ® Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers: Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Estimate parameters of P(Y|X) directly from training data Yi Xi Yi Xi © Eric Xing @ CMU, 2006-2010

Naïve Bayes vs Logistic Regression Consider Y boolean, X continuous, X=<X1 ... Xm> Number of parameters to estimate: NB: LR: Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled © Eric Xing @ CMU, 2006-2010

Naïve Bayes vs Logistic Regression Asymptotic comparison (# training examples ® infinity) when model assumptions correct NB, LR produce identical classifiers when model assumptions incorrect LR is less biased – does not assume conditional independence therefore expected to outperform NB © Eric Xing @ CMU, 2006-2010

Naïve Bayes vs Logistic Regression Non-asymptotic analysis (see [Ng & Jordan, 2002] ) convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log m (where m = # of attributes in X) LR order m NB converges more quickly to its (perhaps less helpful) asymptotic estimates © Eric Xing @ CMU, 2006-2010

Some experiments from UCI data sets © Eric Xing @ CMU, 2006-2010

Robustness The best fit from a quadratic regression But this is probably better … © Eric Xing @ CMU, 2006-2010

Bayesian Parameter Estimation Treat the distribution parameters q also as a random variable The a posteriori distribution of q after seem the data is: This is Bayes Rule Example of disease The prior p(.) encodes our prior knowledge about the domain © Eric Xing @ CMU, 2006-2010

Regularized Least Squares and MAP What if is not invertible ? log likelihood log prior I) Gaussian Prior Ridge Regression Closed form: HW © Eric Xing @ CMU, 2006-2010 Prior belief that β is Gaussian with zero-mean biases solution to “small” β 35

Regularized Least Squares and MAP What if is not invertible ? log likelihood log prior II) Laplace Prior Lasso Closed form: HW © Eric Xing @ CMU, 2006-2010 Prior belief that β is Laplace with zero-mean biases solution to “small” β 36

Ridge Regression vs Lasso HOT! Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 β1 βs with constant l1 norm Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinates Good for high-dimensional problems – don’t have to store all coordinates! © Eric Xing @ CMU, 2006-2010 37

Case study: predicting gene expression The genetic picture causal SNPs CGTTTCACTGTACAATTT a univariate phenotype: i.e., the expression intensity of a gene Define x and y, write down linear function A convergence curve © Eric Xing @ CMU, 2006-2010 38

Association Mapping as Regression Phenotype (BMI) Genotype . . C . . . . . T . . C . . . . . . . T . . . . . C . . . . . A . . C . . . . . . . T . . . Individual 1 2.5 . . G . . . . . A . . G . . . . . . . A . . . . . C . . . . . T . . C . . . . . . . T . . . Individual 2 4.8 … . . G . . . . . T . . C . . . . . . . T . . . . . G . . . . . T . . G . . . . . . . T . . . Individual N 4.7 Causal SNP Benign SNPs © Eric Xing @ CMU, 2006-2010 39

Association Mapping as Regression Phenotype (BMI) Genotype Individual 1 2.5 . . 0 . . . . . 1 . . 0 . . . . . . . 0 . . . Individual 2 4.8 . . 1 . . . . . 1 . . 1 . . . . . . . 1 . . . … Individual N 4.7 . . 2 . . . . . 2 . . 1 . . . . . . . 0 . . . yi = SNPs with large |βj| are relevant © Eric Xing @ CMU, 2006-2010 © Eric Xing @ CMU, 2006-2008 40 40

Experimental setup Asthama dataset 543 individuals, genotyped at 34 SNPs Diploid data was transformed into 0/1 (for homozygotes) or 2 (for heterozygotes) X=543x34 matrix Y=Phenotype variable (continuous) A single phenotype was used for regression Implementation details Iterative methods: Batch update and online update implemented. For both methods, step size α is chosen to be a small fixed value (10-6). This choice is based on the data used for experiments. Both methods are only run to a maximum of 2000 epochs or until the change in training MSE is less than 10-4 © Eric Xing @ CMU, 2006-2010 © Eric Xing @ CMU, 2006-2008 41 41

Convergence Curves For the batch method, the training MSE is initially large due to uninformed initialization In the online update, N updates for every epoch reduces MSE to a much smaller value. © Eric Xing @ CMU, 2006-2008 © Eric Xing @ CMU, 2006-2010 42 42

The Learned Coefficients © Eric Xing @ CMU, 2006-2010 © Eric Xing @ CMU, 2006-2008 43 43

Performance vs. Training Size The results from B and O update are almost identical. So the plots coincide. The test MSE from the normal equation is more than that of B and O during small training. This is probably due to overfitting. In B and O, since only 2000 iterations are allowed at most. This roughly acts as a mechanism that avoids overfitting. © Eric Xing @ CMU, 2006-2010 © Eric Xing @ CMU, 2006-2008 44 44

Summary Naïve Bayes classifier Logistic regression What’s the assumption Why we use it How do we learn it Logistic regression Functional form follows from Naïve Bayes assumptions For Gaussian Naïve Bayes assuming variance For discrete-valued Naïve Bayes too But training procedure picks parameters without the conditional independence assumption Gradient ascent/descent – General approach when closed-form solutions unavailable Generative vs. Discriminative classifiers – Bias vs. variance tradeoff © Eric Xing @ CMU, 2006-2010

Appendix © Eric Xing @ CMU, 2006-2010

Parameter Learning from iid Data Goal: estimate distribution parameters q from a dataset of N independent, identically distributed (iid), fully observed, training cases D = {x1, . . . , xN} Maximum likelihood estimation (MLE) One of the most common estimators With iid and full-observability assumption, write L(q) as the likelihood of the data: pick the setting of parameters most likely to have generated the data we saw: © Eric Xing @ CMU, 2006-2010

Example: Bernoulli model Data: We observed N iid coin tossing: D={1, 0, 1, …, 0} Representation: Binary r.v: Model: How to write the likelihood of a single observation xi ? The likelihood of datasetD={x1, …,xN}: © Eric Xing @ CMU, 2006-2010

Maximum Likelihood Estimation Objective function: We need to maximize this w.r.t. q Take derivatives wrt q Sufficient statistics The counts, are sufficient statistics of data D or Frequency as sample mean © Eric Xing @ CMU, 2006-2010

Overfitting Recall that for Bernoulli Distribution, we have What if we tossed too few times so that we saw zero head? We have and we will predict that the probability of seeing a head next is zero!!! The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count But can we make this more formal? © Eric Xing @ CMU, 2006-2010

Bayesian Parameter Estimation Treat the distribution parameters q also as a random variable The a posteriori distribution of q after seem the data is: This is Bayes Rule Example of disease The prior p(.) encodes our prior knowledge about the domain © Eric Xing @ CMU, 2006-2010

Frequentist Parameter Estimation Two people with different priors p(q) will end up with different estimates p(q|D). Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown constant, not a random variable. Hence they have to come up with different "objective" estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. The maximum likelihood estimator, is one such estimator. © Eric Xing @ CMU, 2006-2010

Discussion q or p(q), this is the problem! Bayesians know it © Eric Xing @ CMU, 2006-2010

Bayesian estimation for Bernoulli Beta distribution: When x is discrete Posterior distribution of q : Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior a and b are hyperparameters (parameters of the prior) and correspond to the number of “virtual” heads/tails (pseudo counts) © Eric Xing @ CMU, 2006-2010

Bayesian estimation for Bernoulli, con'd Posterior distribution of q : Maximum a posteriori (MAP) estimation: Posterior mean estimation: Prior strength: A=a+b A can be interoperated as the size of an imaginary data set from which we obtain the pseudo-counts Bata parameters can be understood as pseudo-counts © Eric Xing @ CMU, 2006-2010

Effect of Prior Strength Suppose we have a uniform prior (a=b=1/2), and we observe Weak prior A = 2. Posterior prediction: Strong prior A = 20. Posterior prediction: However, if we have enough data, it washes away the prior. e.g., . Then the estimates under weak and strong prior are and , respectively, both of which are close to 0.2 © Eric Xing @ CMU, 2006-2010

Example 2: Gaussian density Data: We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3} Model: Log likelihood: MLE: take derivative and set to zero: © Eric Xing @ CMU, 2006-2010

MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is where the scatter matrix is The sufficient statistics are Snxn and SnxnxnT. Note that XTX=SnxnxnT may not be full rank (eg. if N <D), in which case ΣML is not invertible © Eric Xing @ CMU, 2006-2010

Bayesian estimation Normal Prior: Joint probability: Posterior: © Eric Xing @ CMU, 2006-2010 Sample mean

Bayesian estimation: unknown µ, known σ The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels. The precision of the posterior 1/σ2N is the precision of the prior 1/σ20 plus one contribution of data precision 1/σ2 for each observed data point. Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known) Effect of single data point Uninformative (vague/ flat) prior, σ20 →∞ © Eric Xing @ CMU, 2006-2010