Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Naïve Bayes Classifier
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
What is Statistical Modeling
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Wrap-Up (probably). Administrivia Office hours tomorrow on schedule Woo hoo! Office hours today deferred... [sigh] 4:30-5:15.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Optimal Bayes Classification
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Jakob Verbeek December 11, 2009
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN & Naïve Bayes Hongning Wang
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CH 5: Multivariate Methods
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Generative Models Rong Jin

Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(  1,  1 ) Male: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(y|x;  ) Male: Gaussian distribution N(  1,  1 ) Female: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

Maximum Likelihood Estimation (MLE)  Given training example  Compute log-likelihood of data  Find the parameters  that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

Maximum Likelihood Estimation (MLE)  Given training example  Compute log-likelihood of data  Find the parameters  that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(xly;  )  { p(y|x;  ); p(y;  )}

Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(y|x;  )  { p(x|y;  ); p(y;  )}

Generative Models (cont’d)  Treatment of p(x|y;  )  Let y  Y={1, 2, …, c}  Allocate a separate set of parameters for each class   {  1,  2,…,  c } p(xly;  )  p(x;  y ) Data in different class have different input patterns

Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

Example Task: predict gender of individuals based on their heights Given 100 height examples of women 100 height examples of man Assume height of women and man follow different Gaussian distributions

Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for man: (  m  m ) Gaussian distribution for man: (  w  w ) Class priors: p m = p(y=man), p w = p(y=women)

Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for male: (  m,  m ) Gaussian distribution for female: (  f,  f ) Class priors: p m = p(y=male), p f = p(y=female)

Example (cont’d)

 Learn a Gaussian generative model Example (cont’d)

 Learn a Gaussian generative model Example (cont’d)

 Predict the gender of an individual given his/her height Example (cont’d)

Decision boundary  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f h*

Example  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f p f < p m p f > p m

Example  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f p f < p m p f > p m

Gaussian Generative Model (II)  Inputs contain multiple features  Example Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)

Multi-variate Gaussian Distribution

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any vector ? Positive semi-definitive matrix

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix  Number of different elements in  ?

Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)

Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)

Multi-variate Gaussian Generative Model  Input with multiple input features  A multi-variate Gaussian distribution for each class

Improve Multivariate Gaussian Model  How could we improve the prediction of model for overweight?  Multiple modes for each class  Introduce more attributes of individuals Location Occupation The number of children House Age …

Problems with Using Multi-variate Gaussian Generative Model   is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in  is 5,050 d=1000: the number of variables in  is 505,000  A large parameter space   can be singular If N < d If two features are linear correlated   -1 does not exist

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize 

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption (Naïve Bayes assumption)

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Smooth the covariance matrix

Overfitting Issue  Complex model vs. insufficient training  Example Consider a classification problem of multiple inputs  100 input features  5 classes  1000 training examples Total number parameters for a full Gaussian model is  5 class prior  5 parameters  5 means  500 parameters  5 covariance matrices  50,500 parameters  51,005 parameters  insufficient training data

Model Complexity Vs. Data

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption

Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

Text Categorization  Learn to classify text into predefined categories  Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}  Output y: if the document is politics or not +1 for political document, -1 for not political document

Text Categorization  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

Text Classification  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

Text Classification  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

Text Classification  A Naïve Bayes approach  For a document

Text Classification  The original parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  Parameter space after Naïve Bayes simplification p(+) and p ( - ) {p(w 1 |+), p(w 2 |+),…, p(w n |+)} {p(w 1 |-), p(w 2 |-),…, p(w n |-)}

Text Classification  Learning parameters from training examples Each document  Learn parameters using maximum likelihood estimation

Text Classification

 The optimal solution that maximizes the likelihood of training data

Text Classification Twenty NewsgroupsAn Example

Text Classification  Any problems with the Naïve Bayes text classifier?  Unseen words Word ‘w’ is unseen from the training documents, what is the consequence? Word ‘w’ is only unseen for documents of one class, what is the consequence? Related to the overfitting problem  Any suggestion?  Solution: word class approach Introducing word class T= {t 1, t 2, …, t m }  Compute p(t i |+), p(t i |-)  When w is unseen before, replace p(w|  ) with p(t i |  ) Introducing prior for word probabilities

Naïve Bayes Model  This is a terrible approximation

Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

Naïve Bayes Model  The key for the prediction model is not p(x|y;  ), but the ratio p(x|y;  )/p(x|y’;  )  Although Naïve Bayes model does a poor job for estimating p(x|y;  ), it does a reasonable good on estimating the ratio.

The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance

The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance

The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance Gaussian generative model is a linear model

Linear Decision Boundary  Gaussian Generative Models == Finding a linear decision boundary  Why not directly estimate the decision boundary?