Download presentation
Presentation is loading. Please wait.
1
Generative Models Rong Jin
2
Statistical Inference Training ExamplesLearning a Statistical Model Prediction p(x; ) Female: Gaussian distribution N( 1, 1 ) Male: Gaussian distribution N( 2, 2 ) Pr(male|1.67m) Pr(female|1.67m)
3
Statistical Inference Training ExamplesLearning a Statistical Model Prediction p(y|x; ) Male: Gaussian distribution N( 1, 1 ) Female: Gaussian distribution N( 2, 2 ) Pr(male|1.67m) Pr(female|1.67m)
4
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example using maximum likelihood approach The class of a new instance is predicted by
5
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example using maximum likelihood approach The class of a new instance is predicted by
6
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example using maximum likelihood approach The class of a new instance is predicted by
7
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example using the maximum likelihood approach The class of a new instance is predicted by
8
Maximum Likelihood Estimation (MLE) Given training example Compute log-likelihood of data Find the parameters that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation
9
Maximum Likelihood Estimation (MLE) Given training example Compute log-likelihood of data Find the parameters that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation
10
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example using the maximum likelihood approach The class of a new instance is predicted by
11
Generative Models Most probabilistic distributions are joint distribution (i.e., p(x; )), not conditional distribution (i.e., p(y|x; )) Using Bayes rule p(xly; ) { p(y|x; ); p(y; )}
12
Generative Models Most probabilistic distributions are joint distribution (i.e., p(x; )), not conditional distribution (i.e., p(y|x; )) Using Bayes rule p(y|x; ) { p(x|y; ); p(y; )}
13
Generative Models (cont’d) Treatment of p(x|y; ) Let y Y={1, 2, …, c} Allocate a separate set of parameters for each class { 1, 2,…, c } p(xly; ) p(x; y ) Data in different class have different input patterns
14
Generative Models (cont’d) Parameter space Parameters for distribution: { 1, 2,…, c } Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood
15
Generative Models (cont’d) Parameter space Parameters for distribution: { 1, 2,…, c } Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood
16
Generative Models (cont’d) Parameter space Parameters for distribution: { 1, 2,…, c } Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood
17
Generative Models (cont’d) Parameter space Parameters for distribution: { 1, 2,…, c } Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood
18
Example Task: predict gender of individuals based on their heights Given 100 height examples of women 100 height examples of man Assume height of women and man follow different Gaussian distributions
19
Example (cont’d) Gaussian distribution Parameter space Gaussian distribution for man: ( m m ) Gaussian distribution for man: ( w w ) Class priors: p m = p(y=man), p w = p(y=women)
20
Example (cont’d) Gaussian distribution Parameter space Gaussian distribution for male: ( m, m ) Gaussian distribution for female: ( f, f ) Class priors: p m = p(y=male), p f = p(y=female)
21
Example (cont’d)
24
Learn a Gaussian generative model Example (cont’d)
25
Learn a Gaussian generative model Example (cont’d)
27
Predict the gender of an individual given his/her height Example (cont’d)
28
Decision boundary Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h* Where is the decision boundary? It depends on the ratio p m /p f h*
29
Example Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h* Where is the decision boundary? It depends on the ratio p m /p f p f < p m p f > p m
30
Example Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h* Where is the decision boundary? It depends on the ratio p m /p f p f < p m p f > p m
31
Gaussian Generative Model (II) Inputs contain multiple features Example Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)
32
Multi-variate Gaussian Distribution
35
Properties of Covariance Matrix What if the number of data points N < d? How about for any vector ? Positive semi-definitive matrix
36
Properties of Covariance Matrix What if the number of data points N < d? How about for any ? Positive semi-definitive matrix
37
Properties of Covariance Matrix What if the number of data points N < d? How about for any ? Positive semi-definitive matrix Number of different elements in ?
38
Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)
39
Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)
40
Multi-variate Gaussian Generative Model Input with multiple input features A multi-variate Gaussian distribution for each class
41
Improve Multivariate Gaussian Model How could we improve the prediction of model for overweight? Multiple modes for each class Introduce more attributes of individuals Location Occupation The number of children House Age …
42
Problems with Using Multi-variate Gaussian Generative Model is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in is 5,050 d=1000: the number of variables in is 505,000 A large parameter space can be singular If N < d If two features are linear correlated -1 does not exist
43
Problems with Using Multi-variate Gaussian Generative Model Diagonalize
44
Problems with Using Multi-variate Gaussian Generative Model Diagonalize Feature independence assumption (Naïve Bayes assumption)
45
Problems with Using Multi-variate Gaussian Generative Model Diagonalize Smooth the covariance matrix
46
Overfitting Issue Complex model vs. insufficient training Example Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples Total number parameters for a full Gaussian model is 5 class prior 5 parameters 5 means 500 parameters 5 covariance matrices 50,500 parameters 51,005 parameters insufficient training data
47
Model Complexity Vs. Data
51
Problems with Using Multi-variate Gaussian Generative Model Diagonalize Feature independence assumption
52
Naïve Bayes Model In general, for any generative model, we have to estimate For x in high dimension space, this probability is hard to estimate In Naïve Bayes Model, we approximate
53
Naïve Bayes Model In general, for any generative model, we have to estimate For x in high dimension space, this probability is hard to estimate In Naïve Bayes Model, we approximate
54
Naïve Bayes Model In general, for any generative model, we have to estimate For x in high dimension space, this probability is hard to estimate In Naïve Bayes Model, we approximate
55
Text Categorization Learn to classify text into predefined categories Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …} Output y: if the document is politics or not +1 for political document, -1 for not political document
56
Text Categorization A generative model for text classification (TC) Parameter space p(+) and p ( - ) p(doc|+; ), p(doc| - ; ) It is difficult to estimate both p(doc|+; ), p(doc| - ; ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document A Naïve Bayes approach
57
Text Classification A generative model for text classification (TC) Parameter space p(+) and p ( - ) p(doc|+; ), p(doc| - ; ) It is difficult to estimate both p(doc|+; ), p(doc| - ; ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document A Naïve Bayes approach
58
Text Classification A generative model for text classification (TC) Parameter space p(+) and p ( - ) p(doc|+; ), p(doc| - ; ) It is difficult to estimate both p(doc|+; ), p(doc| - ; ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document A Naïve Bayes approach
59
Text Classification A Naïve Bayes approach For a document
60
Text Classification The original parameter space p(+) and p ( - ) p(doc|+; ), p(doc| - ; ) Parameter space after Naïve Bayes simplification p(+) and p ( - ) {p(w 1 |+), p(w 2 |+),…, p(w n |+)} {p(w 1 |-), p(w 2 |-),…, p(w n |-)}
61
Text Classification Learning parameters from training examples Each document Learn parameters using maximum likelihood estimation
62
Text Classification
65
The optimal solution that maximizes the likelihood of training data
66
Text Classification Twenty NewsgroupsAn Example
67
Text Classification Any problems with the Naïve Bayes text classifier? Unseen words Word ‘w’ is unseen from the training documents, what is the consequence? Word ‘w’ is only unseen for documents of one class, what is the consequence? Related to the overfitting problem Any suggestion? Solution: word class approach Introducing word class T= {t 1, t 2, …, t m } Compute p(t i |+), p(t i |-) When w is unseen before, replace p(w| ) with p(t i | ) Introducing prior for word probabilities
68
Naïve Bayes Model This is a terrible approximation
69
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x; ), not p(x|y; )
70
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x; ), not p(x|y; )
71
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x; ), not p(x|y; )
72
Naïve Bayes Model The key for the prediction model is not p(x|y; ), but the ratio p(x|y; )/p(x|y’; ) Although Naïve Bayes model does a poor job for estimating p(x|y; ), it does a reasonable good on estimating the ratio.
73
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance
74
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance
75
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance Gaussian generative model is a linear model
76
Linear Decision Boundary Gaussian Generative Models == Finding a linear decision boundary Why not directly estimate the decision boundary?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.