Conditional Random Fields

Slides:

Advertisements

Similar presentations

Markov Networks Alan Ritter.

Advertisements

Conditional Random Fields and beyond …

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Entropy Rates of a Stochastic Process

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Conditional Random Fields

Hidden Markov Models (HMMs) for Information Extraction

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Information Extraction Yunyao Li EECS /SI /29/2006.

Graphical models for part of speech tagging

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

John Lafferty Andrew McCallum Fernando Pereira

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

IE With Undirected Models: the saga continues

Lecture 7: Constrained Conditional Models

Local factors in a graphical model

Structured prediction

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Named Entity Tagging with Conditional Random Fields

CIS 700 Advanced Machine Learning Structured Machine Learning: Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.

Bidirectional CRF for NER

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15

MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.

Max-margin sequential learning methods

CSC 594 Topics in AI – Natural Language Processing

Prototype-Driven Learning for Sequence Models

Prof. Adriana Kovashka University of Pittsburgh April 4, 2017

Conditional Random Fields

CRFs for SPLODD William W. Cohen Sep 8, 2011.

CRFs vs CMMs, and Stacking

Klein and Manning on CRFs vs CMMs

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Hidden Markov Models Part 2: Algorithms

CSE 574 Finite State Machines for Information Extraction

CSCI 5832 Natural Language Processing

Markov Random Fields Presented by: Vladan Radosavljevic.

Conditional Random Fields model

Algorithms of POS Tagging

Conditional Random Fields

Conditional Random Fields

IE With Undirected Models

NER with Models Allowing Long-Range Dependencies

The Voted Perceptron for Ranking and Structured Classification

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Sequential Learning with Dependency Nets

Markov Chains & Population Movements

Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Presentation transcript:

Conditional Random Fields William W. Cohen CALD

Announcements Upcoming assignments: Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page Spring break week, no class

Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1

Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

Label Bias Problem Consider this MEMM: P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation

Label Bias Problem Pr(0123|rib)=1 Pr(0453|rob)=1 Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

How important is label bias? Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for next week….

Another view of label bias [Sha & Pereira] So what’s the alternative?

Review of maxent

Review of maxent/MEMM/CMMs

Details on CMMs

From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model

What’s the new model look like? What’s independent? y1 y2 y3 x1 x2 x3

What’s the new model look like? What’s independent now?? y1 y2 y3 x

Hammerley-Clifford For positive distributions P(x1,…,xn): Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

Lafferty et al notation If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Learning: Lafferty et al’s IIS-based method is rather inefficient. Gradient-based methods are faster Trickiest bit is computing normalization, which is over exponentially many y vectors.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

y1 y2 y3 x y1 y2 y3

Forward backward ideas name name name c g b f nonName nonName nonName d h

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k examples

POS tagging Experiments in Lafferty et al Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

POS tagging vs MXPost