Recap: Naïve Bayes classifier

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Supervised Learning Recap
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Today Logistic Regression Decision Trees Redux Graphical Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Today Wrap up of probability Vectors, Matrices. Calculus
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Collaborative Filtering Matrix Factorization Approach
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
KNN & Naïve Bayes Hongning Wang
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Oliver Schulte Machine Learning 726
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Generative verses discriminative classifier
CH 5: Multivariate Methods
Machine learning, pattern recognition and statistical data modelling
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
Oliver Schulte Machine Learning 726
Lecture 04: Logistic Regression
Data Mining Lecture 11.
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Recap: Naïve Bayes classifier 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 𝑉 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 Class conditional density Class prior #parameters: 𝑌 ×𝑉 𝑌 −1 Computationally feasible 𝑌 ×( 2 𝑉 −1) v.s. CS@UVa CS 6501: Text Mining

Logistic Regression Hongning Wang CS@UVa

Today’s lecture Logistic regression model A discriminative classification model Two different perspectives to derive the model Parameter estimation CS@UVa CS 6501: Text Mining

Review: Bayes risk minimization Risk – assign instance to a wrong class 𝑦 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃(𝑦|𝑋) *Optimal Bayes decision boundary 𝑝(𝑋,𝑦) We have learned multiple ways to estimate this 𝑦 =0 𝑦 =1 𝑝 𝑋 𝑦=0 𝑝(𝑦=0) 𝑝 𝑋 𝑦=1 𝑝(𝑦=1) False negative False positive 𝑋 CS@UVa CS 6501: Text Mining

Instance-based solution k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point CS@UVa CS 6501: Text Mining

Instance-based solution k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point Let 𝑉 be the volume of the 𝑚 dimensional ball around 𝑥 containing the 𝑘 nearest neighbors for 𝑥, we have 𝑝 𝑥 𝑉= 𝑘 𝑁 𝑝 𝑥 = 𝑘 𝑁𝑉 => 𝑝 𝑥|𝑦=1 = 𝑘 1 𝑁 1 𝑉 𝑝 𝑦=1 = 𝑁 1 𝑁 Total number of instances Total number of instances in class 1 With Bayes rule: 𝑝 𝑦=1|𝑥 = 𝑁 1 𝑁 × 𝑘 1 𝑁 1 𝑉 𝑘 𝑁𝑉 = 𝑘 1 𝑘 Counting the nearest neighbors from class1 CS@UVa CS 6501: Text Mining

Generative solution Naïve Bayes classifier 𝑦 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) By Bayes rule =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 By independence assumption y x2 x3 xv x1 … CS@UVa CS 6501: Text Mining

Estimating parameters Maximial likelihood estimator 𝑃 𝑥 𝑖 𝑦 = 𝑑 𝑗 𝛿( 𝑥 𝑑 𝑗 = 𝑥 𝑖 , 𝑦 𝑑 =𝑦) 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑃(𝑦)= 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑑 1 text information identify mining mined is useful to from apple delicious Y D1 1 D2 D3 CS@UVa CS 6501: Text Mining

Discriminative v.s. generative models 𝑦=𝑓(𝑥) All instances are considered for probability density estimation More attention will be put onto the boundary points CS@UVa CS 6501: Text Mining

Parametric form of decision boundary in Naïve Bayes For binary cases 𝑓 𝑋 =𝑠𝑔𝑛( log 𝑃 𝑦=1 𝑋 − log 𝑃(𝑦=0|𝑋)) =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑃 𝑦=0 + 𝑖=1 𝑑 𝑐 𝑥 𝑖 ,𝑑 log 𝑃 𝑥 𝑖 𝑦=1 𝑃 𝑥 𝑖 𝑦=0 =𝑠𝑔𝑛( 𝑤 𝑇 𝑋 ) where Linear regression? 𝑤= log 𝑃 𝑦=1 𝑃 𝑦=0 , log 𝑃 𝑥 1 𝑦=1 𝑃 𝑥 1 𝑦=0 ,…, log 𝑃 𝑥 𝑣 𝑦=1 𝑃 𝑥 𝑣 𝑦=0 𝑋 =(1,𝑐( 𝑥 1 ,𝑑),…,𝑐( 𝑥 𝑣 ,𝑑)) CS@UVa CS 6501: Text Mining

Regression for classification? Linear regression 𝑦← 𝑤 𝑇 𝑋 Relationship between a scalar dependent variable 𝑦 and one or more explanatory variables CS@UVa CS 6501: Text Mining

Regression for classification? Linear regression 𝑦← 𝑤 𝑇 𝑋 Relationship between a scalar dependent variable 𝑦 and one or more explanatory variables Y is discrete in a classification problem! y 1.00 𝑦= 1 𝑤 𝑇 𝑋 >0.5 0 𝑤 𝑇 𝑋 ≤0.5 0.75 Optimal regression model 0.50 What if we have an outlier? 0.25 0.00 x CS@UVa CS 6501: Text Mining

Regression for classification? Logistic regression 𝑝 𝑦 𝑥 =𝜎 𝑤 𝑇 𝑋 = 1 1+exp⁡(− 𝑤 𝑇 𝑋) Directly modeling of class posterior Sigmoid function P(y|x) 1.00 0.75 What if we have an outlier? 0.50 0.25 0.00 x CS@UVa CS 6501: Text Mining

Logistic regression for classification Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑦=1 =𝛼 Binomial P(y|x) 1.00 𝑃 𝑋 𝑦=1 =𝑁( 𝜇 1 , 𝛿 2 ) 𝑃 𝑋 𝑦=0 =𝑁( 𝜇 0 , 𝛿 2 ) Normal with identical variance 0.75 0.50 0.25 0.00 x CS@UVa CS 6501: Text Mining

Logistic regression for classification Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) = 1 1+ exp − ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) CS@UVa CS 6501: Text Mining

Logistic regression for classification 𝑃 𝑥 𝑦 = 1 𝛿 2𝜋 𝑒 − 𝑥−𝜇 2 2 𝛿 2 Why sigmoid function? ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = ln 𝑃(𝑦=1) 𝑃(𝑦=0) + 𝑖=1 𝑉 ln 𝑃( 𝑥 𝑖 |𝑦=1) 𝑃( 𝑥 𝑖 |𝑦=0) = ln 𝛼 1−𝛼 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 − 𝜇 1𝑖 2 − 𝜇 0𝑖 2 2𝛿 𝑖 2 Origin of the name: logit function = 𝑤 0 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 = 𝑤 0 + 𝑤 𝑇 𝑋 = 𝑤 𝑇 𝑋 CS@UVa CS 6501: Text Mining

Logistic regression for classification Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) = 1 1+ exp − ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ exp (− 𝑤 𝑇 𝑋 ) Generalized Linear Model Note: it is still a linear relation among the features! CS@UVa CS 6501: Text Mining

Logistic regression for classification For multi-class categorization 𝑃 𝑦=𝑘 𝑋 = exp ( 𝑤 𝑘 𝑇 𝑋) 𝑗=1 𝐾 exp ( 𝑤 𝑗 𝑇 𝑋) 𝑃 𝑦=𝑘 𝑋 ∝ exp ( 𝑤 𝑘 𝑇 𝑋) y x2 x3 xv x1 … Warning: redundancy in model parameters, since 𝑗=1 𝐾 𝑃(𝑦=𝑗|𝑋) =1 When 𝐾=2, 𝑃 𝑦=1 𝑋 = exp ( 𝑤 1 𝑇 𝑋) exp ( 𝑤 1 𝑇 𝑋) + exp ( 𝑤 0 𝑇 𝑋) = 1 1+ exp (− 𝑤 1 − 𝑤 0 𝑇 𝑋) 𝑤 CS@UVa CS 6501: Text Mining

Logistic regression for classification Decision boundary for binary case 𝑦 = 1, 𝑝 𝑦=1 𝑋 >0.5 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑦 = 1, 𝑤 𝑇 𝑥>0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑝 𝑦=1 𝑋 = 1 1+ exp (− 𝑤 𝑇 𝑋) >0.5 i.f.f. exp − 𝑤 𝑇 𝑋 <1 i.f.f. 𝑤 𝑇 𝑋>0 A linear model! CS@UVa CS 6501: Text Mining

Logistic regression for classification Decision boundary in general 𝑦 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑝 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 exp ( 𝑤 𝑦 𝑇 𝑋) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑤 𝑦 𝑇 𝑋 A linear model! CS@UVa CS 6501: Text Mining

Logistic regression for classification Summary 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑦=1 =𝛼 Binomial P(y|x) 1.00 𝑃 𝑋 𝑦=1 =𝑁( 𝜇 1 , 𝛿 2 ) 𝑃 𝑋 𝑦=0 =𝑁( 𝜇 0 , 𝛿 2 ) Normal with identical variance 0.75 0.50 0.25 0.00 x CS@UVa CS 6501: Text Mining

Recap: Logistic regression for classification Decision boundary for binary case 𝑦 = 1, 𝑝 𝑦=1 𝑋 >0.5 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑦 = 1, 𝑤 𝑇 𝑥>0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑝 𝑦=1 𝑋 = 1 1+ exp (− 𝑤 𝑇 𝑋) >0.5 i.f.f. exp − 𝑤 𝑇 𝑋 <1 i.f.f. 𝑤 𝑇 𝑋>0 A linear model! CS@UVa CS 6501: Text Mining

Recap: Logistic regression for classification 𝑃 𝑥 𝑦 = 1 𝛿 2𝜋 𝑒 − 𝑥−𝜇 2 2 𝛿 2 Why sigmoid function? ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = ln 𝑃(𝑦=1) 𝑃(𝑦=0) + 𝑖=1 𝑉 ln 𝑃( 𝑥 𝑖 |𝑦=1) 𝑃( 𝑥 𝑖 |𝑦=0) = ln 𝛼 1−𝛼 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 − 𝜇 1𝑖 2 − 𝜇 0𝑖 2 2𝛿 𝑖 2 Origin of the name: logit function = 𝑤 0 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 = 𝑤 0 + 𝑤 𝑇 𝑋 = 𝑤 𝑇 𝑋 CS@UVa CS 6501: Text Mining

Recap: parametric form of decision boundary in Naïve Bayes For binary cases 𝑓 𝑋 =𝑠𝑔𝑛( log 𝑃 𝑦=1 𝑋 − log 𝑃(𝑦=0|𝑋)) =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑃 𝑦=0 + 𝑖=1 𝑑 𝑐 𝑥 𝑖 ,𝑑 log 𝑃 𝑥 𝑖 𝑦=1 𝑃 𝑥 𝑖 𝑦=0 =𝑠𝑔𝑛( 𝑤 𝑇 𝑋 ) where Linear regression? 𝑤= log 𝑃 𝑦=1 𝑃 𝑦=0 , log 𝑃 𝑥 1 𝑦=1 𝑃 𝑥 1 𝑦=0 ,…, log 𝑃 𝑥 𝑣 𝑦=1 𝑃 𝑥 𝑣 𝑦=0 𝑋 =(1,𝑐( 𝑥 1 ,𝑑),…,𝑐( 𝑥 𝑣 ,𝑑)) CS@UVa CS 6501: Text Mining

A different perspective Imagine we have the following Documents Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 𝑝 𝑥="item",𝑦=1 =1, and all the others 0 Answer1: 𝑝 𝑥="indeed",𝑦=1 =0.5, 𝑝 𝑥="good",𝑦=1 =0.5, and all the others 0 Answer2: We have too little information to favor either one of them. CS@UVa CS 6501: Text Mining

Occam's razor A problem-solving principle “among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.” William of Ockham (1287–1347) Principle of Insufficient Reason: "when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely” Pierre-Simon Laplace (1749–1827) CS@UVa CS 6501: Text Mining

A different perspective Imagine we have the following Documents Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. As a result, a safer choice would be: 𝑝 𝑥="⋅",𝑦=1 =0.2 Equally favor every possibility CS@UVa CS 6501: Text Mining

A different perspective Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. Again, a safer choice would be: 𝑝 𝑥="𝑔𝑜𝑜𝑑",𝑦=1 =𝑝 𝑥="𝑖𝑡𝑒𝑚",𝑦=1 =0.15, and all the others 7 30 Equally favor every possibility CS@UVa CS 6501: Text Mining

A different perspective Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 50% of time “good”, “happy” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="happy",𝑦=1 =0.5 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 1) what do we mean by equally/uniformly favoring the models? Time to think about: 2) given all these constraints, how could we find the most preferred model? CS@UVa CS 6501: Text Mining

Maximum entropy modeling A measure of uncertainty of random events 𝐻 𝑋 =𝐸 𝐼 𝑋 =− 𝑥∈𝑋 𝑃 𝑥 log 𝑃(𝑥) Maximized when P(X) is a uniform distribution Question 1 is answered, then how about question 2? CS@UVa CS 6501: Text Mining

Represent the constraints Indicator function E.g., to express the observation that word ‘good’ occurs in a positive document 𝑓 𝑥,𝑦 = 1 0 Usually referred as feature function if 𝑦=1 and 𝑥=‘𝑔𝑜𝑜𝑑’ otherwise CS@UVa CS 6501: Text Mining

Represent the constraints Empirical expectation of feature function over a corpus 𝐸[ 𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓(𝑥,𝑦) Expectation of feature function under a given statistical model 𝐸[𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥 𝑝(𝑦|𝑥)𝑓(𝑥,𝑦) where 𝑝 𝑥,𝑦 = 𝑐(𝑓(𝑥,𝑦)) 𝑁 i.e., frequency of observing 𝑓(𝑥,𝑦) in a given collection. Empirical distribution of 𝑥 in the same collection. Model’s estimation of conditional distribution. CS@UVa CS 6501: Text Mining

Represent the constraints When a feature is important, we require our preferred statistical model to accord with it 𝐶 ≔ 𝑝∈𝑃|𝐸[𝑝 𝑓 𝑖 ]=𝐸[ 𝑝 𝑓 𝑖 ],∀𝑖∈{1,2,…,𝑛} 𝐸[𝑝 𝑓 𝑖 ]=𝐸[ 𝑝 𝑓 𝑖 ] 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓 𝑖 𝑥,𝑦 = 𝑥,𝑦 𝑝 𝑥 𝑝 𝑦 𝑥 𝑓 𝑖 (𝑥,𝑦) We only need to specify this in our preferred model! Is Question 2 answered? CS@UVa CS 6501: Text Mining

Represent the constraints Let’s visualize this How to deal with these situations? (a) No constraint (b) Under constrained (c) Feasible constraint (d) Over constrained CS@UVa CS 6501: Text Mining

Maximum entropy principle To select a model from a set 𝐶 of allowed probability distributions, choose the model 𝑝 ∗ ∈𝐶 with maximum entropy 𝐻(𝑝) 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) 𝑝(𝑦|𝑥) Both questions are answered! CS@UVa CS 6501: Text Mining

Maximum entropy principle Let’s solve this constrained optimization problem with Lagrange multipliers Primal: 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) a strategy for finding the local maxima and minima of a function subject to equality constraints Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) CS@UVa CS 6501: Text Mining

Maximum entropy principle Let’s solve this constrained optimization problem with Lagrange multipliers Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) Dual: Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 𝑝 𝜆 (𝑦|𝑥)= 1 𝑍 𝜆 𝑥 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS@UVa CS 6501: Text Mining

Maximum entropy principle Let’s solve this constrained optimization problem with Lagrange multipliers Dual: Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 where Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS@UVa CS 6501: Text Mining

Maximum entropy principle Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 where Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS@UVa CS 6501: Text Mining

Maximum entropy principle Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑥 𝑝 𝑥 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 = 𝑥 𝑝 𝑥 log exp ( 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 ) 𝑍 𝜆 𝑥 = 𝑥 𝑝 𝑥 log 𝑝(𝑦|𝑥) Maximum likelihood estimator! CS@UVa CS 6501: Text Mining

Maximum entropy principle Primal: maximum entropy 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) Dual: logistic regression 𝑝 𝜆 (𝑦|𝑥)= 1 𝑍 𝜆 𝑥 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) where 𝜆 ∗ is determined by Ψ(𝜆) CS@UVa CS 6501: Text Mining

Questions haven’t been answered Class conditional density Why it should be Gaussian with equal variance? Model parameters What is the relationship between 𝑤 and 𝜆? How to estimate them? CS@UVa CS 6501: Text Mining

Recap: Occam's razor A problem-solving principle “among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.” William of Ockham (1287–1347) Principle of Insufficient Reason: "when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely” Pierre-Simon Laplace (1749–1827) CS@UVa CS 6501: Text Mining

Recap: a different perspective Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 50% of time “good”, “happy” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="happy",𝑦=1 =0.5 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 1) what do we mean by equally/uniformly favoring the models? Time to think about: 2) given all these constraints, how could we find the most preferred model? CS@UVa CS 6501: Text Mining

Recap: maximum entropy modeling A measure of uncertainty of random events 𝐻 𝑋 =𝐸 𝐼 𝑋 =− 𝑥∈𝑋 𝑃 𝑥 log 𝑃(𝑥) Maximized when P(X) is a uniform distribution Question 1 is answered, then how about question 2? CS@UVa CS 6501: Text Mining

Recap: represent the constraints Empirical expectation of feature function over a corpus 𝐸[ 𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓(𝑥,𝑦) Expectation of feature function under a given statistical model 𝐸[𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥 𝑝(𝑦|𝑥)𝑓(𝑥,𝑦) where 𝑝 𝑥,𝑦 = 𝑐(𝑓(𝑥,𝑦)) 𝑁 i.e., frequency of observing 𝑓(𝑥,𝑦) in a given collection. Empirical distribution of 𝑥 in the same collection. Model’s estimation of conditional distribution. CS@UVa CS 6501: Text Mining

Recap: maximum entropy principle Let’s solve this constrained optimization problem with Lagrange multipliers Primal: 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) a strategy for finding the local maxima and minima of a function subject to equality constraints Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) CS@UVa CS 6501: Text Mining

Recap: maximum entropy principle Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑥 𝑝 𝑥 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 = 𝑥 𝑝 𝑥 log exp ( 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 ) 𝑍 𝜆 𝑥 = 𝑥 𝑝 𝑥 log 𝑝(𝑦|𝑥) Maximum likelihood estimator! CS@UVa CS 6501: Text Mining

Maximum entropy principle The maximum entropy model subject to the constraints 𝐶 has a parametric solution 𝑝 𝜆 ∗ (𝑦|𝑥) where the parameters 𝜆 ∗ can be determined by maximizing the likelihood function of 𝑝 𝜆 (𝑦|𝑥) over a training set With a Gaussian distribution, differential entropy is maximized for a given variance. Features follow Gaussian distribution Maximum entropy model Logistic regression CS@UVa CS 6501: Text Mining

Parameter estimation Maximum likelihood estimation 𝐿 𝑤 = 𝑑∈𝐷 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) Take gradient of 𝐿 𝑤 with respect to 𝑤 𝜕𝐿 𝑤 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 +(1− 𝑦 𝑑 ) 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 CS@UVa CS 6501: Text Mining

Parameter estimation Maximum likelihood estimation 𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 =− 𝜕 log 1+exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 𝜕𝑤 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 = 0−𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝑋 𝑑 = exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 1+exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 𝑋 𝑑 = 1−𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝑋 𝑑 CS@UVa CS 6501: Text Mining

Parameter estimation Maximum likelihood estimation 𝐿 𝑤 = 𝑑∈𝐷 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) Take gradient of 𝐿 𝑤 with respect to 𝑤 𝜕𝐿 𝑤 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 +(1− 𝑦 𝑑 ) 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 1−𝑝 𝑦 𝑑 =1 𝑋 𝑑 𝑋 𝑑 + 1− 𝑦 𝑑 0−𝑝 𝑦 𝑑 =1 𝑋 𝑑 𝑋 𝑑 Good news: neat format, concave function for 𝒘 Bad news: no close form solution = 𝑑∈𝐷 𝑦 𝑑 −𝑝 𝑦=1 𝑋 𝑑 𝑋 𝑑 Can be easily generalized to multi-class case CS@UVa CS 6501: Text Mining

Gradient-based optimization Gradient descent 𝛻𝐿 𝑤 =[ 𝜕𝐿 𝑤 𝜕 𝑤 0 , 𝜕𝐿 𝑤 𝜕 𝑤 1 ,…, 𝜕𝐿 𝑤 𝜕 𝑤 𝑉 ] 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 (𝑡) 𝛻𝐿 𝑤 Iterative updating Step-size, affects convergence CS@UVa CS 6501: Text Mining

Parameter estimation Stochastic gradient descent while not converge randomly choose 𝑑∈𝐷 𝛻 𝐿 𝑑 𝑤 =[ 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 0 , 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 1 ,…, 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 𝑉 ] 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 𝑡 𝛻 𝐿 𝑑 𝑤 𝜂 (𝑡+1) =𝑎 𝜂 𝑡 Gradually shrink the step-size CS@UVa CS 6501: Text Mining

Parameter estimation Batch gradient descent while not converge Compute gradient w.r.t. all training instances Compute step size 𝜂 𝑡 𝛻 𝐿 𝐷 𝑤 =[ 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 0 , 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 1 ,…, 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 𝑉 ] Line search is required to ensure sufficient decent 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 𝑡 𝛻 𝐿 𝑑 𝑤 First order method Second order methods, e.g., quasi-Newton method and conjugate gradient, provide faster convergence CS@UVa CS 6501: Text Mining

Model regularization Avoid over-fitting We may not have enough samples to well estimate model parameters for logistic regression Regularization Impose additional constraints over the model parameters E.g., sparsity constraint – enforce the model to have more zero parameters CS@UVa CS 6501: Text Mining

Model regularization L2 regularized logistic regression Assume the model parameter 𝑤 is drawn from Gaussian: w∼𝑁(0, 𝜎 2 ) 𝑝 𝑦 𝑑 ,𝑤 𝑋 𝑑 ∝𝑝 𝑦 𝑑 𝑋 𝑑 ,𝑤 𝑝(𝑤) 𝐿 𝑤 = 𝑑∈𝐷 [ 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝 𝑦 𝑑 =0 𝑋 𝑑 ]− 𝑤 𝑇 𝑤 2 𝜎 2 L2-norm of 𝑤 CS@UVa CS 6501: Text Mining

Generative V.S. Discriminative models Specifying joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for 𝑝 𝑋 𝑦 and 𝑝(𝑦) Flexible, can be used in unsupervised learning Specifying conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling 𝑝 𝑦 𝑋 Need labeled data, only suitable for (semi-) supervised learning CS@UVa CS 6501: Text Mining

Naïve Bayes V.S. Logistic regression Naive Bayes Logistic Regression Conditional independence 𝑝 𝑋 𝑦 = 𝑖 𝑝( 𝑥 𝑖 |𝑦) Distribution assumption of 𝑝( 𝑥 𝑖 |𝑦) # parameters 𝑘(𝑉+1) Model estimation Closed form MLE Asymptotic convergence rate 𝜖 𝑁𝐵,𝑛 ≤ 𝜖 𝑁𝐵,∞ +𝑂( log 𝑉 𝑛 ) No independence assumption Functional form assumption of 𝑝 𝑦 𝑋 ∝ exp ( 𝑤 𝑦 𝑇 𝑋) # parameters (𝑘−1)(𝑉+1) Model estimation Gradient-based MLE Asymptotic convergence rate 𝜖 𝐿𝑅,𝑛 ≤ 𝜖 𝐿𝑅,∞ +𝑂( 𝑉 𝑛 ) Need more training data CS@UVa CS 6501: Text Mining

Naïve Bayes V.S. Logistic regression LR NB "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.“ – Ng, Jordan NIPS 2002, UCI Data set CS@UVa CS 6501: Text Mining

What you should know Two different derivations of logistic regression Functional form from Naïve Bayes assumptions 𝑝(𝑋|𝑦) follows equal variance Gaussian Sigmoid function Maximum entropy principle Primal/dual optimization Generalization to multi-class Parameter estimation Gradient-based optimization Regularization Comparison with Naïve Bayes CS@UVa CS 6501: Text Mining

Today’s reading Speech and Language Processing Chapter 6: Hidden Markov and Maximum Entropy Models 6.6 Maximum entropy models: background 6.7 Maximum entropy modeling CS@UVa CS 6501: Text Mining