Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Slides:

Advertisements

Similar presentations

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Advertisements

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.

Supervised Learning Recap

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

… Hidden Markov Models Markov assumption: Transition model:

Lecture 6, Thursday April 17, 2003

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Conditional Random Fields

Linear Methods for Classification

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Today Logistic Regression Decision Trees Redux Graphical Models

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.

Classification and Prediction: Regression Analysis

Online Learning Algorithms

Review of Lecture Two Linear Regression Normal Equation

Crash Course on Machine Learning

Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.

Final review LING572 Fei Xia Week 10: 03/11/

Conditional Random Fields

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 5: Sequence Prediction & HMMs 1.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

John Lafferty Andrew McCallum Fernando Pereira

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CRF Recitation Kevin Tang. Conditional Random Field Definition.

LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

ECE 5424: Introduction to Machine Learning

Lecture 04: Logistic Regression

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1

Previous Lecture Sequence Prediction – Input: x = (x 1,…,x M ) – Predict: y = (y 1,…,y M ) – Naïve full multiclass: exponential explosion – Independent multiclass: strong independence assumption Hidden Markov Models – Generative model: P(y i |y i-1 ), P(x i |y i ) – Prediction using Bayes’s Rule + Viterbi – Train using Maximum Likelihood 2

Outline of Today Long Prelude: – Generative vs Discriminative Models – Naïve Bayes Conditional Random Fields – Discriminative version of HMMs 3

Generative vs Discriminative Generative Models: – Joint Distribution: P(x,y) – Uses Bayes’s Rule to predict: argmax y P(y|x) – Can generate new samples (x,y) Discriminative Models: – Conditional Distribution: P(y|x) – Can use model directly to predict: argmax y P(y|x) Both trained via Maximum Likelihood 4 Same thing! Hidden Markov Models Conditional Random Fields Mismatch!

Naïve Bayes Binary (or Multiclass) prediction Model joint distribution (Generative): “Naïve” independence assumption: Prediction via: 5

Naïve Bayes Prediction: 6 P(x d =1|y)y=-1y=+1 P(x 1 =1|y) P(x 2 =1|y) P(x 3 =1|y) P(y) P(y=-1) = 0.4 P(y=+1) = 0.6 xP(y=-1|x)P(y=+1|x)Predict (1,0,0)0.4 * 0.5 * 0.1 * 0.9 = * 0.7 * 0.6 * 0.5 = 0.126y = +1 (0,1,1)0.4 * 0.5 * 0.9 * 0.1 = * 0.3 * 0.4 * 0.5 = 0.036y = +1 (0,1,0)0.4 * 0.5 * 0.9 * 0.9 = * 0.3 * 0.4 * 0.5 = 0.036y = -1

Naïve Bayes Matrix Formulation: 7 P(x d =1|y)y=-1y=+1 P(x 1 =1|y) P(x 2 =1|y) P(x 3 =1|y) P(y) P(y=-1) = 0.4 P(y=+1) = 0.6 (Sums to 1) (Each Sums to 1)

Naïve Bayes Train via Max Likelihood: Estimate P(y) and each P(x d |y) from data – Count frequencies 8

Naïve Bayes vs HMMs Naïve Bayes: Hidden Markov Models: HMMs ≈ 1 st order variant of Naïve Bayes! 9 “Naïve” Generative Independence Assumption (just one interpretation…) P(y)

Naïve Bayes vs HMMs Naïve Bayes: Hidden Markov Models: HMMs ≈ 1 st order variant of Naïve Bayes! 10 (just one interpretation…) P(y) “Naïve” Generative Independence Assumption P(x|y)

Summary: Naïve Bayes Joint model of (x,y): – “Naïve” independence assumption each x d Use Bayes’s Rule for prediction: Maximum Likelihood Training: – Count Frequencies 11 “Generative Model” (can sample new data)

Learn Conditional Prob.? Weird to train to maximize: When goal should be to maximize: 12 Breaks independence! Can no longer use count statistics *HMMs suffer same problem In general, you should maximize the likelihood of the model you define! So if you define joint model P(x,y), then maximize P(x,y) on training data. In general, you should maximize the likelihood of the model you define! So if you define joint model P(x,y), then maximize P(x,y) on training data.

Summary: Generative Models Joint model of (x,y): – Compact & easy to train… –...with ind. assumptions E.g., Naïve Bayes & HMMs Maximize Likelihood Training: Mismatch w/ prediction goal: – But hard to maximize P(y|x) 13 Θ often used to denote all parameters of model

Discriminative Models Conditional model: – Directly model prediction goal Maximum Likelihood: Matches prediction goal: What does P(y|x) look like? 14

First Try Model P(y|x) for every possible x Train by counting frequencies Exponential in # input variables D! – Need to assume something… what? 15 P(y=1|x)x1x1 x2x

Log Linear Models! (Logistic Regression) “Log-Linear” assumption – Model representation to linear in D – Most common discriminative probabilistic model 16 Prediction: Training: Match!

Naïve Bayes vs Logistic Regression Naïve Bayes: – Strong ind. assumptions – Super easy to train… – …but mismatch with prediction Logistic Regression: – “Log Linear” assumption Often more flexible than Naïve Bayes – Harder to train (gradient desc.)… – …but matches prediction 17 P(y)P(x|y)

Naïve Bayes vs Logistic Regression NB has K parameters for P(y) (i.e., A) LR has K parameters for bias b NB has K*D parameters for P(x|y) (i.e, O) LR has K*D parameters for w Same number of parameters! 18 P(y)P(x|y) Naïve Bayes Logistic Regression Intuition: Both models have same “capacity” NB spends a lot of capacity on P(x) LR spends all of capacity on P(y|x) No Model Is Perfect! (Especially on finite training set) NB will trade off P(y|x) with P(x) LR will fit P(y|x) as well as possible Intuition: Both models have same “capacity” NB spends a lot of capacity on P(x) LR spends all of capacity on P(y|x) No Model Is Perfect! (Especially on finite training set) NB will trade off P(y|x) with P(x) LR will fit P(y|x) as well as possible

GenerativeDiscriminative P(x,y) Joint model over x and y Cares about everything P(y|x) (when probabilistic) Conditional model Only cares about predicting well Naïve Bayes, HMMs Also Topic Models (later) Logistic Regression, CRFs also SVM, Least Squares, etc. Max LikelihoodMax (Conditional) Likelihood (=minimize log loss) Can pick any loss based on y Hinge Loss, Squared Loss, etc. Always ProbabilisticNot Necessarily Probabilistic Certainly never joint over P(x,y) Often strong assumptions Keeps training tractable More flexible assumptions Focuses entire model on P(y|x) Mismatch between train & predict Requires Bayes’s rule Train to optimize predict goal Can sample anythingCan only sample y given x Can handle missing values in xCannot handle missing values in x 19

Conditional Random Fields 20

“Log-Linear” 1 st Order Sequential Model 21 y 0 = special start state Scoring transitionsScoring input features Scoring Function aka “Partition Function”

( ) x = “Fish Sleep” y = (N,V) 22 u N,* u V,* u *,N -21 u *,V 2-2 u *,Start 1 w N,* w V,* w *,Fish 21 w *,Sleep 10 u N,V w V,Fish yexp(F(y,x)) (N,N)exp( ) = exp(2) (N,V)exp( ) = exp(4) (V,N)exp( ) = exp(3) (V,V)exp( ) = exp(-2) Z(x) = Sum

23 x = “Fish Sleep” y = (N,V) *hold other parameters fixed

Basic Conditional Random Field Directly models P(y|x) – Discriminative – Log linear assumption – Same #parameters as HMM – 1 st Order Sequential LR How to Predict? How to Train? Extensions? 24 CRF spends all model capacity on P(y|x), rather than P(x,y)

Predict via Viterbi 25 Scoring transitionsScoring input features Maintain length-k prefix solutions Recursively solve for length-(k+1) solutions Predict via best length-M solution

26 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Solve: y 1 =V y 1 =D y 1 =N Ŷ 1 (T) is just T

27 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) y 1 =N Ŷ 1 (T) is just TEx: Ŷ 2 (V) = (N, V) Solve:

28 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Store each Ŷ 2 (Z) & F(Ŷ 2 (Z),x) Ex: Ŷ 2 (V) = (N, V) Ŷ 3 (V) Ŷ 3 (D) Ŷ 3 (N) Solve: y 2 =V y 2 =D y 2 =N Ŷ 1 (Z) is just Z

29 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (Z) & F(Ŷ 1 (Z),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Store each Ŷ 2 (T) & F(Ŷ 2 (T),x) Ex: Ŷ 2 (V) = (N, V) Ŷ 3 (V) Ŷ 3 (D) Ŷ 3 (N) Store each Ŷ 3 (T) & F(Ŷ 3 (T),x) Ex: Ŷ 3 (V) = (D,N,V) Ŷ L (V) Ŷ L (D) Ŷ L (N) … Ŷ 1 (T) is just T Solve:

Computing P(y|x) Viterbi doesn’t compute P(y|x) – Just maximizes the numerator F(y,x) Also need to compute Z(x) – aka the “Partition Function” 30

Computing Partition Function Naive approach is iterate over all y’ – Exponential time, L M possible y’! Notation: 31

Matrix Semiring 32 G j (a,b) L+1 Matrix Version of G j G 1:2 G2G2 G2G2 G1G1 G1G1 = G i:j G i+1 GiGi GiGi = GjGj GjGj G j-1 … L+1 Include ‘Start’

Path Counting Interpretation Interpretation G 1 (a,b) – L+1 start & end locations – Weight of path from ‘b’ to ‘a’ in step 1 G 1:2 (a,b) – Weight of all paths Start in ‘b’ beginning of Step 1 End in ‘a’ after Step 2 33 G1G1 G1G1 G 1:2 G2G2 G2G2 G1G1 G1G1 =

Consider Length-1 (M=1) M=2 General M – Do M (L+1)x(L+1) matrix computations to compute G 1:M – Z(x) = sum column ‘Start’ of G 1:M Computing Partition Function 34 Sum column ‘Start’ of G 1 ! Sum column ‘Start’ of G 1:2 ! G 1:M G2G2 G2G2 G1G1 G1G1 = GMGM GMGM G M-1 … Sum column ‘Start’ of G 1:M !

Train via Gradient Descent Similar to Logistic Regression – Gradient Descent on negative log likelihood (log loss) First term is easy: – Recall: 35 Θ often used to denote all parameters of model Harder to differentiate!

Differentiating Log Partition 36 Lots of Chain Rule & Algebra! Definition of P(y’|x) Marginalize over all y’ Forward-Backward!

Optimality Condition Consider one parameter: Optimality condition: Frequency counts = Cond. expectation on training data! – Holds for each component of the model – Each component is a “log-linear” model and requires gradient desc. 37

Forward-Backward for CRFs 38

Path Interpretation 39 α 1 (V) α 1 (D) α 1 (N) α 2 (V) α 2 (D) α 2 (N) α 3 (V) α 3 (D) α 3 (N) α1α1 α2α2 α3α3 “Start” G 1 (V,“Start”) G 1 (N,“Start”) x G 2 (D,N) x G 3 (N,D) Total Weight of paths from “Start” to “V” in 3 rd step β just does it backwards

Matrix Formulation 40 G2G2 G2G2 α1α1 α1α1 α2α2 α2α2 = (G 2 ) T β5β5 β5β5 β6β6 β6β6 = Use Matrices! Fast to compute! Easy to implement!

Path Interpretation: Forward-Backward vs Viterbi Forward (and Backward) sums over all paths – Computes expectation of reaching each state – E.g., total (un-normalized) probability of y 3 =Verb over all possible y 1:2 Viterbi only keeps the best path – Computes best possible path to reaching each state – E.g., single highest probability setting of y 1:3 such that y 3 =Verb 41 Forward Viterbi

Summary: Training CRFs Similar optimality condition as HMMs: – Match frequency counts of model components! – Except HMMs can just set the model using counts – CRFs need to do gradient descent to match counts Run Forward-Backward for expectation – Just like HMMs as well 42

More General CRFs 43 New: Old: Reduction: θ is “flattened” weight vector Can extend φ j (a,b|x) 

More General CRFs 44 1 st order Sequence CRFs:

Example 45 Various attributes of xStack for each label y j =b All 0’s except 1 sub-vector Basic formulation only had first part

Summary: CRFs “Log-Linear” 1 st order sequence model – Multiclass LR + 1 st order components – Discriminative Version of HMMs – Predict using Viterbi, Train using Gradient Descent – Need forward-backward to differentiate partition function 46

Next Week Structural SVMs – Hinge loss for sequence prediction More General Structured Prediction Next Recitation: – Optimizing non-differentiable functions (Lasso) – Accelerated gradient descent Homework 2 due in 12 days – Tuesday, Feb 3 rd at 2pm via Moodle 47