Natural Language Processing COMPSCI 423/723

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Linear Regression.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Support Vector Machines and Margins
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Classification and risk prediction
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Today Logistic Regression Decision Trees Redux Graphical Models
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Graphical models for part of speech tagging
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Classification Techniques: Bayesian Classification
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
INTRODUCTION TO Machine Learning 3rd Edition
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
John Lafferty Andrew McCallum Fernando Pereira
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Web-Mining Agents Part: Data Mining
Data Mining Lecture 11.
CSC 594 Topics in AI – Natural Language Processing
Statistical Learning Dong Liu Dept. EEIS, USTC.
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Recap: Naïve Bayes classifier
Presentation transcript:

Natural Language Processing COMPSCI 423/723 Rohit Kate

Classification for NLP: Naïve Bayes Model Maximum Entropy Model Some of these slides have been adapted from Raymond Mooney’s slides from his NLP and Machine Learning courses at UT Austin. Referenes: - Sections 6.6 & 6.7 from Jurafsy & Martin book Naïve Bayes portions from Word Sense Disambiguation chapters in Jurafsy & Martin and Manning & Schutze books Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression by Tom Mitchell http://www.cs.cmu.edu/~tom/NewChapters.html A Maximum Entropy approach to Natural Language Processing by Adam L. Berger, Stephen A. Della Pietra and  Vincent J. Della Pietra Computational Linguistics, Vol. 22, No. 1. (1996), pp. 39-71.

Naïve Bayes Model

Classification in NLP Several NLP problems can be formulated as classification problems, a few examples: Information Extraction Given an entity, is it a person name or not? Given two protein names, does the sentence say they interact or not? Word Sense Disambiguation I am out of money. I am going to the bank. Document Classification Given a document, which category does it belong to? Sentiment Analysis Given a passage (e.g. product or movie review), is it saying positive things or negative things? Textual Entailment Given two sentences, can the second sentence be inferred from the first?

Classification Usually the classification output variable is denoted by Y and the input variables by Xs Y: {river bank, money bank, verb bank} X1: Previous word X2: Next word X3: Part-of-speech of previous word X4: Part-of-speech of next word Xs are usually called features in NLP Coming up with good feature sets for NLP problems is a skill: feature engineering Requires linguistic insights Grasp of the theory behind the classification method

Probabilistic Classification Often it is useful to know the probabilities of different output values and not just the best output value To have confidence in the output (0.9-0.1 vs 0.6-0.4) These probabilities may be useful for the next stage of NLP processing Conditional probability: P(Y|X1,X2,..,Xn)

Probabilistic Classification If the joint probability distribution P(Y,X1,X2,..,Xn) is given then the conditional probability distribution can be easily estimated

Estimating Conditional Probabilities X1,X2,Y P(Y,X1,X2) Circle, Red, Positive 0.2 Circle, Red, Negative 0.05 Circle, Blue, Positive 0.02 Circle, Blue, Negative Square, Red, Positive Square, Red, Negative 0.3 Square, Blue, Positive 0.01 Square, Blue, Negative Similarly estimate P(Y|X1,X2) for the remaining values.

Estimating Joint Probability Distributions Not Easy :-( Assuming Y and all Xi are binary, we need 2n+1 - 1 entries (parameters) to specify the joint probability distribution This is impossible to accurately estimate from a reasonably-sized training set Note that P(Y|X1,X2,..,Xn) requires fewer entries (2n-1), why? But they are still too many for even small size of n

Estimating Joint Probability Distributions Simplification assumptions are made about the joint probability distribution to reduce the number of parameters to estimate Let the random variables be nodes of a graph, there are two major types of simplifications, they are represented as Directed probabilistic graphical models Simplest: Naïve Bayes model More complex: Hidden Markov Model (HMM) Undirected probabilistic graphical models Simplest: Maximum entropy model More complex: Conditional Random Field (CRFs)

Directed Graphical Models Simplification assumption: Some random variables are conditionally independent of others given values for some other random variables

Conditional Independence Two random variables A and B are conditionally independent given C if P(AПB|C) = P(A|C)P(B|C) Rain and Thunder are not independent (given there was Rain, it increases the probability that there was Thunder). But given that there was Lightning (or no Lighting) they are independent. P(Rain^Thunder|Lightning) = P(Rain|Lightning)P(Thunder|Lightning)

Directed Graphical Models Also known as Bayesian networks Simplification assumption: Some random variables are conditionally independent of others given values for some other random variables Simplest directed graphical model: Naïve Bayes Naïve Bayes assumption: The features are conditionally independent given the category

Naïve Bayes Assumption Features are conditionally independent given the category How do we estimate P(Y|X1,X2,..,Xn) from this? Recall the Bayes’ theorem: Lets us calculate P(B|A) in terms of P(A|B)

Bayes’ Theorem Simple proof from definition of conditional probability: (Def. cond. prob.) (Def. cond. prob.) QED:

Naïve Bayes Model Naïve Bayes assumption From Bayes’ Theorem Computing marginals and definition of conditional probability Naïve Bayes assumption

Naïve Bayes Model Only need to estimate P(Y) and P(Xi|Y) for all i, that with the naïve Bayes assumption specifies the entire joint probability distribution Assuming all Y and Xis are binary, only 2n+1 parameters instead of 2n+1-1 parameters: a dramatic reduction Lightning Y P(Y) Rain Thunder X1 P(X1|Y) X2 P(X2|Y) X3 ……. P(X3|Y) Xn P(Xn|Y) Directed graphical model representation

<medium ,red, circle> Naïve Bayes Example P(Label|Size,Color,Shape) Probability positive negative P(Y) 0.5 P(small | Y) 0.4 P(medium | Y) 0.1 0.2 P(large | Y) P(red | Y) 0.9 0.3 P(blue | Y) 0.05 P(green | Y) P(square | Y) P(triangle | Y) P(circle | Y) Test Instance: <medium ,red, circle>

<medium ,red, circle> Naïve Bayes Example Probability positive negative P(Y) 0.5 P(medium | Y) 0.1 0.2 P(red | Y) 0.9 0.3 P(circle | Y) Test Instance: <medium ,red, circle> P(positive |medium,red,circle) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(medium,red,cirlce) 0.5 * 0.1 * 0.9 * 0.9 = 0.0405 / P(medium,red,circle) = 0.0405 / 0.0495 = 0.8181 P(negative |medium,red,circle) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(medium,red,cirlce) 0.5 * 0.2 * 0.3 * 0.3 = 0.009 / P(medium,red,circle) = 0.009 / 0.0495 = 0.1818

Estimating Probabilities Normally, probabilities are estimated based on observed frequencies in the training data. If D contains nk examples in category yk, and nijk of these nk examples have the jth value for feature Xi, xij, then: However, estimating such probabilities from small training sets is error-prone. If due only to chance, a rare feature, Xi, is always false in the training data, yk :P(Xi=true | Y=yk) = 0. If Xi=true then occurs in a test example, X, the result is that yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0

Probability Estimation Example positive negative P(Y) 0.5 P(small | Y) P(medium | Y) 0.0 P(large | Y) P(red | Y) 1.0 P(blue | Y) P(green | Y) P(square | Y) P(triangle | Y) P(circle | Y) Ex Size Color Shape Category 1 small red circle positive 2 large 3 triangle negitive 4 blue Test Instance X: <medium, red, circle> P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0

Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. For binary features, p is simply assumed to be 0.5.

Laplace Smothing Example Assume training set contains 10 positive examples: 4: small 0: medium 6: large Estimate parameters as follows (if m=1, p=1/3) P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576 P(small or medium or large | positive) = 1.0

Naïve Bayes Model is a Generative Model Models the joint probability distribution P(Y,X1,X2,..,Xn) using P(Y) and P(Xi|Y) An assumed generative process: First generate Y according to P(Y) then generate X1,X2,..,Xn independently according to P(X1|Y), P(X2|Y), .., P(Xn|Y) respectively

Naïve Bayes Generative Model neg pos pos pos neg pos neg Category Size Color Shape red circ lg red circ med tri sm blue sm blue tri sqr lg med grn med red grn red circ med grn tri circ circ lg lg circ sm sm lg red blue circ tri sqr blue red sqr sm med red circ sm lg blue grn sqr tri Size Color Shape Positive Negative

Naïve Bayes Inference Problem lg red circ ?? ?? neg pos pos pos neg pos neg Category Size Color Shape red circ lg red circ med tri sm blue sm blue tri sqr lg grn red circ med grn tri circ med red med grn circ lg lg circ tri sm blue sm lg red blue circ sqr red sqr sm med circ sm lg red blue grn sqr tri Size Color Shape Positive Negative

Some Comments on Naïve Bayes Model Tends to work well despite strong (or naïve) assumption of conditional independence Experiments show it to be quite competitive with other classification methods on standard UCI datasets Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases

Maximum Entropy Model

Maximum Entropy Models Very popular in NLP Several ways to look at them: Exponential or log-linear classifiers or multinomial logical regression Assume a parametric form for conditional distribution Maximize entropy of the joint distribution given the constraints Discriminative models instead of generative (directly estimates P(Y|X1,..,Xn) instead of via P(Y,X1,..,Xn))

Linear Regression Classification: Predict a discrete values Regression: Predict a real value Linear Regression: Predict a real value using a linear combination of inputs Y = W0 + W1*X1 + W2*X2 + … + Wn*Xn Ws are the weights associated with the features Xs Example: price = 16550 - 4900*(# vague adjectives)

Estimating Weights in Linear Regression Find the Ws that minimize the sum-squared error for the given M training examples Statistical packages are available that solve this fast

Logistic Regression But we are interested in probabilistic classification, that is in predicting P(Y|X1,..,Xn) Can we modify linear regression to do that? Nothing constrains it to be between [0,1] which is required for a legal probability Predict odds (assume Y is binary) instead of the probability

Logistic Regression But LHS lies between 0 and infinity, RHS could be between -infinity to infinity Take log of LHS (known as logit function) Logistic function

Logistic Regression as a Log-Linear Model Logistic regression is basically a linear model, which is demonstrated by taking logs 34

Logistic Regression Training Weights are set during training to maximize the conditional data likelihood : where D is the set of training examples and Yd and Xid denote, respectively, the values of Y and Xi for example d. Equivalently viewed as maximizing the conditional log likelihood (CLL) 35

Logistic Regression Training Use standard gradient descent to find the parameters (weights) that optimize the CLL objective function Many other more advanced training methods are available Conjugate gradient Generalized Iterative Scaling (GIS) Improved Iterative Scaling (IIS) Limited-memory quasi-Newton (L-BFGS) Packages are available that do these 36

Preventing Overfitting in Logistic Regression To prevent overfitting, one can use regularization (smoothing) by penalizing large weights by changing the training objective: Where λ is a constant that determines the amount of smoothing This can be shown to be equivalent to assuming a Guassian prior for W with zero mean and a variance related to 1/λ. 37

Generative vs. Discriminative Models Generative models and are not directly designed to maximize the performance of classification. They model the complete joint distribution P(Y,X1,...Xn). But a generative model can also be used to perform any other inference task, e.g. P(X1 | X2, …Xn, Y) “Jack of all trades, master of none.” Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X1, …Xn). By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data. Master of one trade: Classification P(Y|X1,.. Xn) 38 38

Multinomial Logistic Regression (Maximum Entropy or MaxEnt) So far Y was binary, a generalization if Y takes multiple values (classes) Make weights dependent on the class c: Wci instead of Wi Normalization term (Z) so that probabilities sum to 1

Multinomial Logistic Regression (Maximum Entropy or MaxEnt) Usually features take binary values in NLP Introduce indicator functions (0 or 1 output) that depend on the input and output class Call X as input, features are fi(c,x)

A Small MaxEnt Example Word Sense Disambiguation: Y: {river bank, money bank, verb bank} X: Entire Sentence Features: f1(river bank,X) = 1 if “river” is in the sentence, 0 otherwise f2(river bank,X) = 1 if “water” is in the sentence, 0 otherwise f3(money bank,X) = 1 if “money” is in the sentence, 0 otherwise f4(money bank,X) = 1 if “deposit” is in the sentence, 0 otherwise f5(verb bank,X) = 1 if previous word was “to”, 0 otherwise Obtain examples of feature values and Y from annotated training data Compute weights Wci to maximize the conditional log-likelihood of the training data For a test example, predict Y using MaxEnt equation

Why is it Called Maximum Entropy Model? Entropy of a random variable Y: The more uniform distribution, the higher is the entropy It can be shown that standard training for logistic regression gives the distribution with maximum entropy that is empirically consistent with the training data

Undirected Graphical Model Also called Markov Network, Random Field Undirected graph over a set of random variables, where an edge represents a dependency The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X) Every node in a Markov Net is conditionally independent of every other node given its Markov blanket Simplest Markov Network: MaxEnt model

Relation with Naïve Bayes … X1 X2 Xn Generative Discriminative Conditional Y Logistic Regression … X1 X2 Xn 44

Simplification Assumption for MaxEnt The probability P(Y|X1..Xn) can be factored as: Note there is no product term that has two or more Xis

Naïve Bayes and MaxEnt Naïve Bayes can be extended to work with continuous inputs X (like logistic regression) Both make the conditional independence assumption MaxEnt is not rigidly tied with it because it tries to maximize the conditional likelihood of the data even when the data disobeys the assumption It has been observed that with scarce training data Naïve Bayes performs better and with sufficient data MaxEnt performs better

Classification in General Several other classifiers are also available: perceptron, neural networks, support vector machines, k-nearest neighbors, decision trees… Naïve Bayes and MaxEnt are based on probabilities Can’t handle combination of features as features If right features are engineered they work very well Are widely used for tasks other than NLP tasks All this was for one label classification (there was only one Y), extensions to handle multi-label classifications, e.g. sequence labeling with HMMs or CRFs

HW 2 Write Naïve Bayes (P(Y|f1,f2,f3,f4,f5)) and MaxEnt (P(Y|X)) equations for the example shown on slide #41.

References for Next Class Chapter 5 (part-of-speech tagging) of Jurafsky & Martin book; Chapter 10 of Manning and Schutze book An Introduction to Conditional Random Fields for Relational Learning By Charles Sutton and Andrew McCallum, Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006