Recap: Naïve Bayes classifier

Recap: Naïve Bayes classifier
𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 𝑉 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 Class conditional density Class prior #parameters: 𝑌 ×𝑉 𝑌 −1 Computationally feasible 𝑌 ×( 2 𝑉 −1) v.s. CS 6501: Text Mining

Logistic Regression Hongning Wang

Today’s lecture Logistic regression model
A discriminative classification model Two different perspectives to derive the model Parameter estimation CS 6501: Text Mining

Review: Bayes risk minimization
Risk – assign instance to a wrong class 𝑦 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃(𝑦|𝑋) *Optimal Bayes decision boundary 𝑝(𝑋,𝑦) We have learned multiple ways to estimate this 𝑦 =0 𝑦 =1 𝑝 𝑋 𝑦=0 𝑝(𝑦=0) 𝑝 𝑋 𝑦=1 𝑝(𝑦=1) False negative False positive 𝑋 CS 6501: Text Mining

Instance-based solution
k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point CS 6501: Text Mining

Instance-based solution
k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point Let 𝑉 be the volume of the 𝑚 dimensional ball around 𝑥 containing the 𝑘 nearest neighbors for 𝑥, we have 𝑝 𝑥 𝑉= 𝑘 𝑁 𝑝 𝑥 = 𝑘 𝑁𝑉 => 𝑝 𝑥|𝑦=1 = 𝑘 1 𝑁 1 𝑉 𝑝 𝑦=1 = 𝑁 1 𝑁 Total number of instances Total number of instances in class 1 With Bayes rule: 𝑝 𝑦=1|𝑥 = 𝑁 1 𝑁 × 𝑘 1 𝑁 1 𝑉 𝑘 𝑁𝑉 = 𝑘 1 𝑘 Counting the nearest neighbors from class1 CS 6501: Text Mining

Generative solution Naïve Bayes classifier 𝑦 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋
=𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) By Bayes rule =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 By independence assumption y x2 x3 xv x1 … CS 6501: Text Mining

Estimating parameters
Maximial likelihood estimator 𝑃 𝑥 𝑖 𝑦 = 𝑑 𝑗 𝛿( 𝑥 𝑑 𝑗 = 𝑥 𝑖 , 𝑦 𝑑 =𝑦) 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑃(𝑦)= 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑑 1 text information identify mining mined is useful to from apple delicious Y D1 1 D2 D3 CS 6501: Text Mining

Discriminative v.s. generative models
𝑦=𝑓(𝑥) All instances are considered for probability density estimation More attention will be put onto the boundary points CS 6501: Text Mining

Parametric form of decision boundary in Naïve Bayes
For binary cases 𝑓 𝑋 =𝑠𝑔𝑛( log 𝑃 𝑦=1 𝑋 − log 𝑃(𝑦=0|𝑋)) =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑃 𝑦=0 + 𝑖=1 𝑑 𝑐 𝑥 𝑖 ,𝑑 log 𝑃 𝑥 𝑖 𝑦=1 𝑃 𝑥 𝑖 𝑦=0 =𝑠𝑔𝑛( 𝑤 𝑇 𝑋 ) where Linear regression? 𝑤= log 𝑃 𝑦=1 𝑃 𝑦=0 , log 𝑃 𝑥 1 𝑦=1 𝑃 𝑥 1 𝑦=0 ,…, log 𝑃 𝑥 𝑣 𝑦=1 𝑃 𝑥 𝑣 𝑦=0 𝑋 =(1,𝑐( 𝑥 1 ,𝑑),…,𝑐( 𝑥 𝑣 ,𝑑)) CS 6501: Text Mining

Regression for classification?
Linear regression 𝑦← 𝑤 𝑇 𝑋 Relationship between a scalar dependent variable 𝑦 and one or more explanatory variables CS 6501: Text Mining

Linear regression 𝑦← 𝑤 𝑇 𝑋 Relationship between a scalar dependent variable 𝑦 and one or more explanatory variables Y is discrete in a classification problem! y 1.00 𝑦= 1 𝑤 𝑇 𝑋 > 𝑤 𝑇 𝑋 ≤0.5 0.75 Optimal regression model 0.50 What if we have an outlier? 0.25 0.00 x CS 6501: Text Mining

Logistic regression 𝑝 𝑦 𝑥 =𝜎 𝑤 𝑇 𝑋 = 1 1+exp⁡(− 𝑤 𝑇 𝑋) Directly modeling of class posterior Sigmoid function P(y|x) 1.00 0.75 What if we have an outlier? 0.50 0.25 0.00 x CS 6501: Text Mining

Logistic regression for classification
Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑦=1 =𝛼 Binomial P(y|x) 1.00 𝑃 𝑋 𝑦=1 =𝑁( 𝜇 1 , 𝛿 2 ) 𝑃 𝑋 𝑦=0 =𝑁( 𝜇 0 , 𝛿 2 ) Normal with identical variance 0.75 0.50 0.25 0.00 x CS 6501: Text Mining

Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) = 1 1+ exp − ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) CS 6501: Text Mining

𝑃 𝑥 𝑦 = 1 𝛿 2𝜋 𝑒 − 𝑥−𝜇 𝛿 2 Why sigmoid function? ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = ln 𝑃(𝑦=1) 𝑃(𝑦=0) + 𝑖=1 𝑉 ln 𝑃( 𝑥 𝑖 |𝑦=1) 𝑃( 𝑥 𝑖 |𝑦=0) = ln 𝛼 1−𝛼 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 − 𝜇 1𝑖 2 − 𝜇 0𝑖 2 2𝛿 𝑖 2 Origin of the name: logit function = 𝑤 0 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 = 𝑤 0 + 𝑤 𝑇 𝑋 = 𝑤 𝑇 𝑋 CS 6501: Text Mining

Why sigmoid function? 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) = 1 1+ exp − ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ exp (− 𝑤 𝑇 𝑋 ) Generalized Linear Model Note: it is still a linear relation among the features! CS 6501: Text Mining

For multi-class categorization 𝑃 𝑦=𝑘 𝑋 = exp ( 𝑤 𝑘 𝑇 𝑋) 𝑗=1 𝐾 exp ( 𝑤 𝑗 𝑇 𝑋) 𝑃 𝑦=𝑘 𝑋 ∝ exp ( 𝑤 𝑘 𝑇 𝑋) y x2 x3 xv x1 … Warning: redundancy in model parameters, since 𝑗=1 𝐾 𝑃(𝑦=𝑗|𝑋) =1 When 𝐾=2, 𝑃 𝑦=1 𝑋 = exp ( 𝑤 1 𝑇 𝑋) exp ( 𝑤 1 𝑇 𝑋) + exp ( 𝑤 0 𝑇 𝑋) = 1 1+ exp (− 𝑤 1 − 𝑤 0 𝑇 𝑋) 𝑤 CS 6501: Text Mining

Decision boundary for binary case 𝑦 = 1, 𝑝 𝑦=1 𝑋 >0.5 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑦 = 1, 𝑤 𝑇 𝑥>0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑝 𝑦=1 𝑋 = 1 1+ exp (− 𝑤 𝑇 𝑋) >0.5 i.f.f. exp − 𝑤 𝑇 𝑋 <1 i.f.f. 𝑤 𝑇 𝑋>0 A linear model! CS 6501: Text Mining

Decision boundary in general 𝑦 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑝 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 exp ( 𝑤 𝑦 𝑇 𝑋) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑤 𝑦 𝑇 𝑋 A linear model! CS 6501: Text Mining

Summary 𝑃 𝑦=1 𝑋 = 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=1 𝑃 𝑦=1 +𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = 1 1+ 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑦=1 =𝛼 Binomial P(y|x) 1.00 𝑃 𝑋 𝑦=1 =𝑁( 𝜇 1 , 𝛿 2 ) 𝑃 𝑋 𝑦=0 =𝑁( 𝜇 0 , 𝛿 2 ) Normal with identical variance 0.75 0.50 0.25 0.00 x CS 6501: Text Mining

Recap: Logistic regression for classification
Decision boundary for binary case 𝑦 = 1, 𝑝 𝑦=1 𝑋 >0.5 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑦 = 1, 𝑤 𝑇 𝑥>0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑝 𝑦=1 𝑋 = 1 1+ exp (− 𝑤 𝑇 𝑋) >0.5 i.f.f. exp − 𝑤 𝑇 𝑋 <1 i.f.f. 𝑤 𝑇 𝑋>0 A linear model! CS 6501: Text Mining

Recap: Logistic regression for classification
𝑃 𝑥 𝑦 = 1 𝛿 2𝜋 𝑒 − 𝑥−𝜇 𝛿 2 Why sigmoid function? ln 𝑃 𝑋 𝑦=1 𝑃(𝑦=1) 𝑃 𝑋 𝑦=0 𝑃(𝑦=0) = ln 𝑃(𝑦=1) 𝑃(𝑦=0) + 𝑖=1 𝑉 ln 𝑃( 𝑥 𝑖 |𝑦=1) 𝑃( 𝑥 𝑖 |𝑦=0) = ln 𝛼 1−𝛼 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 − 𝜇 1𝑖 2 − 𝜇 0𝑖 2 2𝛿 𝑖 2 Origin of the name: logit function = 𝑤 0 + 𝑖=1 𝑉 𝜇 1𝑖 − 𝜇 0𝑖 𝛿 𝑖 2 𝑥 𝑖 = 𝑤 0 + 𝑤 𝑇 𝑋 = 𝑤 𝑇 𝑋 CS 6501: Text Mining

Recap: parametric form of decision boundary in Naïve Bayes
For binary cases 𝑓 𝑋 =𝑠𝑔𝑛( log 𝑃 𝑦=1 𝑋 − log 𝑃(𝑦=0|𝑋)) =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑃 𝑦=0 + 𝑖=1 𝑑 𝑐 𝑥 𝑖 ,𝑑 log 𝑃 𝑥 𝑖 𝑦=1 𝑃 𝑥 𝑖 𝑦=0 =𝑠𝑔𝑛( 𝑤 𝑇 𝑋 ) where Linear regression? 𝑤= log 𝑃 𝑦=1 𝑃 𝑦=0 , log 𝑃 𝑥 1 𝑦=1 𝑃 𝑥 1 𝑦=0 ,…, log 𝑃 𝑥 𝑣 𝑦=1 𝑃 𝑥 𝑣 𝑦=0 𝑋 =(1,𝑐( 𝑥 1 ,𝑑),…,𝑐( 𝑥 𝑣 ,𝑑)) CS 6501: Text Mining

A different perspective
Imagine we have the following Documents Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 𝑝 𝑥="item",𝑦=1 =1, and all the others 0 Answer1: 𝑝 𝑥="indeed",𝑦=1 =0.5, 𝑝 𝑥="good",𝑦=1 =0.5, and all the others 0 Answer2: We have too little information to favor either one of them. CS 6501: Text Mining

Occam's razor A problem-solving principle
“among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.” William of Ockham (1287–1347) Principle of Insufficient Reason: "when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely” Pierre-Simon Laplace (1749–1827) CS 6501: Text Mining

Imagine we have the following Documents Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. As a result, a safer choice would be: 𝑝 𝑥="⋅",𝑦=1 =0.2 Equally favor every possibility CS 6501: Text Mining

Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. Again, a safer choice would be: 𝑝 𝑥="𝑔𝑜𝑜𝑑",𝑦=1 =𝑝 𝑥="𝑖𝑡𝑒𝑚",𝑦=1 =0.15, and all the others 7 30 Equally favor every possibility CS 6501: Text Mining

Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 50% of time “good”, “happy” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="happy",𝑦=1 =0.5 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 1) what do we mean by equally/uniformly favoring the models? Time to think about: 2) given all these constraints, how could we find the most preferred model? CS 6501: Text Mining

Maximum entropy modeling
A measure of uncertainty of random events 𝐻 𝑋 =𝐸 𝐼 𝑋 =− 𝑥∈𝑋 𝑃 𝑥 log 𝑃(𝑥) Maximized when P(X) is a uniform distribution Question 1 is answered, then how about question 2? CS 6501: Text Mining

Represent the constraints
Indicator function E.g., to express the observation that word ‘good’ occurs in a positive document 𝑓 𝑥,𝑦 = 1 0 Usually referred as feature function if 𝑦=1 and 𝑥=‘𝑔𝑜𝑜𝑑’ otherwise CS 6501: Text Mining

Empirical expectation of feature function over a corpus 𝐸[ 𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓(𝑥,𝑦) Expectation of feature function under a given statistical model 𝐸[𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥 𝑝(𝑦|𝑥)𝑓(𝑥,𝑦) where 𝑝 𝑥,𝑦 = 𝑐(𝑓(𝑥,𝑦)) 𝑁 i.e., frequency of observing 𝑓(𝑥,𝑦) in a given collection. Empirical distribution of 𝑥 in the same collection. Model’s estimation of conditional distribution. CS 6501: Text Mining

When a feature is important, we require our preferred statistical model to accord with it 𝐶 ≔ 𝑝∈𝑃|𝐸[𝑝 𝑓 𝑖 ]=𝐸[ 𝑝 𝑓 𝑖 ],∀𝑖∈{1,2,…,𝑛} 𝐸[𝑝 𝑓 𝑖 ]=𝐸[ 𝑝 𝑓 𝑖 ] 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓 𝑖 𝑥,𝑦 = 𝑥,𝑦 𝑝 𝑥 𝑝 𝑦 𝑥 𝑓 𝑖 (𝑥,𝑦) We only need to specify this in our preferred model! Is Question 2 answered? CS 6501: Text Mining

Let’s visualize this How to deal with these situations? (a) No constraint (b) Under constrained (c) Feasible constraint (d) Over constrained CS 6501: Text Mining

Maximum entropy principle
To select a model from a set 𝐶 of allowed probability distributions, choose the model 𝑝 ∗ ∈𝐶 with maximum entropy 𝐻(𝑝) 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) 𝑝(𝑦|𝑥) Both questions are answered! CS 6501: Text Mining

Let’s solve this constrained optimization problem with Lagrange multipliers Primal: 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) a strategy for finding the local maxima and minima of a function subject to equality constraints Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) CS 6501: Text Mining

Let’s solve this constrained optimization problem with Lagrange multipliers Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) Dual: Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 𝑝 𝜆 (𝑦|𝑥)= 1 𝑍 𝜆 𝑥 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS 6501: Text Mining

Let’s solve this constrained optimization problem with Lagrange multipliers Dual: Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 where Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS 6501: Text Mining

Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 where Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) CS 6501: Text Mining

Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑥 𝑝 𝑥 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 = 𝑥 𝑝 𝑥 log exp ( 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 ) 𝑍 𝜆 𝑥 = 𝑥 𝑝 𝑥 log 𝑝(𝑦|𝑥) Maximum likelihood estimator! CS 6501: Text Mining

Primal: maximum entropy 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) Dual: logistic regression 𝑝 𝜆 (𝑦|𝑥)= 1 𝑍 𝜆 𝑥 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) Z 𝜆 = 𝑦 exp 𝑖 𝜆 𝑖 𝑓 𝑖 (𝑥,𝑦) where 𝜆 ∗ is determined by Ψ(𝜆) CS 6501: Text Mining

Questions haven’t been answered
Class conditional density Why it should be Gaussian with equal variance? Model parameters What is the relationship between 𝑤 and 𝜆? How to estimate them? CS 6501: Text Mining

Recap: Occam's razor A problem-solving principle
“among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.” William of Ockham (1287–1347) Principle of Insufficient Reason: "when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely” Pierre-Simon Laplace (1749–1827) CS 6501: Text Mining

Recap: a different perspective
Imagine we have the following Observations Sentiment “happy”, “good”, “purchase”, “item”, “indeed” positive 30% of time “good”, “item” positive 50% of time “good”, “happy” positive 𝑝 𝑥="happy",𝑦=1 +𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒",𝑦=1 +𝑝 𝑥="item",𝑦=1 +𝑝 𝑥="𝑖𝑛𝑑𝑒𝑒𝑑",𝑦=1 =1 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="item",𝑦=1 =0.3 𝑝 𝑥="good",𝑦=1 +𝑝 𝑥="happy",𝑦=1 =0.5 Question: find a distribution 𝑝(𝑥,𝑦) that satisfies this observation. 1) what do we mean by equally/uniformly favoring the models? Time to think about: 2) given all these constraints, how could we find the most preferred model? CS 6501: Text Mining

Recap: maximum entropy modeling
A measure of uncertainty of random events 𝐻 𝑋 =𝐸 𝐼 𝑋 =− 𝑥∈𝑋 𝑃 𝑥 log 𝑃(𝑥) Maximized when P(X) is a uniform distribution Question 1 is answered, then how about question 2? CS 6501: Text Mining

Recap: represent the constraints
Empirical expectation of feature function over a corpus 𝐸[ 𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥,𝑦 𝑓(𝑥,𝑦) Expectation of feature function under a given statistical model 𝐸[𝑝 𝑓 ]= 𝑥,𝑦 𝑝 𝑥 𝑝(𝑦|𝑥)𝑓(𝑥,𝑦) where 𝑝 𝑥,𝑦 = 𝑐(𝑓(𝑥,𝑦)) 𝑁 i.e., frequency of observing 𝑓(𝑥,𝑦) in a given collection. Empirical distribution of 𝑥 in the same collection. Model’s estimation of conditional distribution. CS 6501: Text Mining

Recap: maximum entropy principle
Let’s solve this constrained optimization problem with Lagrange multipliers Primal: 𝑝 ∗ =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑝∈𝐶 𝐻(𝑝) a strategy for finding the local maxima and minima of a function subject to equality constraints Lagrangian: 𝐿 𝑝,𝜆 =𝐻 𝑝 + 𝑖 𝜆 𝑖 (𝑝 𝑓 𝑖 − 𝑝 𝑓 𝑖 ) CS 6501: Text Mining

Recap: maximum entropy principle
Let’s take a close look at the dual function Ψ 𝜆 =− 𝑥 𝑝 𝑥 log 𝑍 𝜆 𝑥 + 𝑥 𝑝 𝑥 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 = 𝑥 𝑝 𝑥 log exp ( 𝑖 𝜆 𝑖 𝑝 𝑓 𝑖 ) 𝑍 𝜆 𝑥 = 𝑥 𝑝 𝑥 log 𝑝(𝑦|𝑥) Maximum likelihood estimator! CS 6501: Text Mining

The maximum entropy model subject to the constraints 𝐶 has a parametric solution 𝑝 𝜆 ∗ (𝑦|𝑥) where the parameters 𝜆 ∗ can be determined by maximizing the likelihood function of 𝑝 𝜆 (𝑦|𝑥) over a training set With a Gaussian distribution, differential entropy is maximized for a given variance. Features follow Gaussian distribution Maximum entropy model Logistic regression CS 6501: Text Mining

Parameter estimation Maximum likelihood estimation
𝐿 𝑤 = 𝑑∈𝐷 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) Take gradient of 𝐿 𝑤 with respect to 𝑤 𝜕𝐿 𝑤 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 +(1− 𝑦 𝑑 ) 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 CS 6501: Text Mining

𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 =− 𝜕 log 1+exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 𝜕𝑤 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 = 0−𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝑋 𝑑 = exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 1+exp⁡(− 𝑤 𝑇 𝑋 𝑑 ) 𝑋 𝑑 = 1−𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝑋 𝑑 CS 6501: Text Mining

𝐿 𝑤 = 𝑑∈𝐷 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) Take gradient of 𝐿 𝑤 with respect to 𝑤 𝜕𝐿 𝑤 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 𝜕 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) 𝜕𝑤 +(1− 𝑦 𝑑 ) 𝜕 log 𝑝( 𝑦 𝑑 =0| 𝑋 𝑑 ) 𝜕𝑤 = 𝑑∈𝐷 𝑦 𝑑 1−𝑝 𝑦 𝑑 =1 𝑋 𝑑 𝑋 𝑑 + 1− 𝑦 𝑑 0−𝑝 𝑦 𝑑 =1 𝑋 𝑑 𝑋 𝑑 Good news: neat format, concave function for 𝒘 Bad news: no close form solution = 𝑑∈𝐷 𝑦 𝑑 −𝑝 𝑦=1 𝑋 𝑑 𝑋 𝑑 Can be easily generalized to multi-class case CS 6501: Text Mining

Gradient-based optimization
Gradient descent 𝛻𝐿 𝑤 =[ 𝜕𝐿 𝑤 𝜕 𝑤 0 , 𝜕𝐿 𝑤 𝜕 𝑤 1 ,…, 𝜕𝐿 𝑤 𝜕 𝑤 𝑉 ] 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 (𝑡) 𝛻𝐿 𝑤 Iterative updating Step-size, affects convergence CS 6501: Text Mining

Parameter estimation Stochastic gradient descent while not converge
randomly choose 𝑑∈𝐷 𝛻 𝐿 𝑑 𝑤 =[ 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 0 , 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 1 ,…, 𝜕 𝐿 𝑑 𝑤 𝜕 𝑤 𝑉 ] 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 𝑡 𝛻 𝐿 𝑑 𝑤 𝜂 (𝑡+1) =𝑎 𝜂 𝑡 Gradually shrink the step-size CS 6501: Text Mining

Parameter estimation Batch gradient descent while not converge
Compute gradient w.r.t. all training instances Compute step size 𝜂 𝑡 𝛻 𝐿 𝐷 𝑤 =[ 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 0 , 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 1 ,…, 𝜕 𝐿 𝐷 𝑤 𝜕 𝑤 𝑉 ] Line search is required to ensure sufficient decent 𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂 𝑡 𝛻 𝐿 𝑑 𝑤 First order method Second order methods, e.g., quasi-Newton method and conjugate gradient, provide faster convergence CS 6501: Text Mining

Model regularization Avoid over-fitting
We may not have enough samples to well estimate model parameters for logistic regression Regularization Impose additional constraints over the model parameters E.g., sparsity constraint – enforce the model to have more zero parameters CS 6501: Text Mining

Model regularization L2 regularized logistic regression
Assume the model parameter 𝑤 is drawn from Gaussian: w∼𝑁(0, 𝜎 2 ) 𝑝 𝑦 𝑑 ,𝑤 𝑋 𝑑 ∝𝑝 𝑦 𝑑 𝑋 𝑑 ,𝑤 𝑝(𝑤) 𝐿 𝑤 = 𝑑∈𝐷 [ 𝑦 𝑑 log 𝑝( 𝑦 𝑑 =1| 𝑋 𝑑 ) +(1− 𝑦 𝑑 ) log 𝑝 𝑦 𝑑 =0 𝑋 𝑑 ]− 𝑤 𝑇 𝑤 2 𝜎 2 L2-norm of 𝑤 CS 6501: Text Mining

Generative V.S. Discriminative models
Specifying joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for 𝑝 𝑋 𝑦 and 𝑝(𝑦) Flexible, can be used in unsupervised learning Specifying conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling 𝑝 𝑦 𝑋 Need labeled data, only suitable for (semi-) supervised learning CS 6501: Text Mining

Naïve Bayes V.S. Logistic regression
Naive Bayes Logistic Regression Conditional independence 𝑝 𝑋 𝑦 = 𝑖 𝑝( 𝑥 𝑖 |𝑦) Distribution assumption of 𝑝( 𝑥 𝑖 |𝑦) # parameters 𝑘(𝑉+1) Model estimation Closed form MLE Asymptotic convergence rate 𝜖 𝑁𝐵,𝑛 ≤ 𝜖 𝑁𝐵,∞ +𝑂( log 𝑉 𝑛 ) No independence assumption Functional form assumption of 𝑝 𝑦 𝑋 ∝ exp ( 𝑤 𝑦 𝑇 𝑋) # parameters (𝑘−1)(𝑉+1) Model estimation Gradient-based MLE Asymptotic convergence rate 𝜖 𝐿𝑅,𝑛 ≤ 𝜖 𝐿𝑅,∞ +𝑂( 𝑉 𝑛 ) Need more training data CS 6501: Text Mining

Naïve Bayes V.S. Logistic regression
LR NB "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.“ – Ng, Jordan NIPS 2002, UCI Data set CS 6501: Text Mining

What you should know Two different derivations of logistic regression
Functional form from Naïve Bayes assumptions 𝑝(𝑋|𝑦) follows equal variance Gaussian Sigmoid function Maximum entropy principle Primal/dual optimization Generalization to multi-class Parameter estimation Gradient-based optimization Regularization Comparison with Naïve Bayes CS 6501: Text Mining

Today’s reading Speech and Language Processing
Chapter 6: Hidden Markov and Maximum Entropy Models 6.6 Maximum entropy models: background 6.7 Maximum entropy modeling CS 6501: Text Mining

Recap: Naïve Bayes classifier

Similar presentations

Presentation on theme: "Recap: Naïve Bayes classifier"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recap: Naïve Bayes classifier

Similar presentations

Presentation on theme: "Recap: Naïve Bayes classifier"— Presentation transcript:

Similar presentations

About project

Feedback