Conditional Random Fields

Slides:



Advertisements
Similar presentations
Markov Networks Alan Ritter.
Advertisements

Conditional Random Fields and beyond …
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Entropy Rates of a Stochastic Process
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Conditional Random Fields
Hidden Markov Models (HMMs) for Information Extraction
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Information Extraction Yunyao Li EECS /SI /29/2006.
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
IE With Undirected Models: the saga continues
Lecture 7: Constrained Conditional Models
Local factors in a graphical model
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Named Entity Tagging with Conditional Random Fields
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Bidirectional CRF for NER
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
Max-margin sequential learning methods
CSC 594 Topics in AI – Natural Language Processing
Prototype-Driven Learning for Sequence Models
Prof. Adriana Kovashka University of Pittsburgh April 4, 2017
Conditional Random Fields
CRFs for SPLODD William W. Cohen Sep 8, 2011.
CRFs vs CMMs, and Stacking
Klein and Manning on CRFs vs CMMs
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Hidden Markov Models Part 2: Algorithms
CSE 574 Finite State Machines for Information Extraction
CSCI 5832 Natural Language Processing
Markov Random Fields Presented by: Vladan Radosavljevic.
Conditional Random Fields model
Algorithms of POS Tagging
Conditional Random Fields
Conditional Random Fields
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Sequential Learning with Dependency Nets
Markov Chains & Population Movements
Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong
Presentation transcript:

Conditional Random Fields William W. Cohen CALD

Announcements Upcoming assignments: Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page Spring break week, no class

Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1

Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

Label Bias Problem Consider this MEMM: P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation

Label Bias Problem Pr(0123|rib)=1 Pr(0453|rob)=1 Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

How important is label bias? Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for next week….

Another view of label bias [Sha & Pereira] So what’s the alternative?

Review of maxent

Review of maxent/MEMM/CMMs

Details on CMMs

From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model

What’s the new model look like? What’s independent? y1 y2 y3 x1 x2 x3

What’s the new model look like? What’s independent now?? y1 y2 y3 x

Hammerley-Clifford For positive distributions P(x1,…,xn): Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

Lafferty et al notation If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Learning: Lafferty et al’s IIS-based method is rather inefficient. Gradient-based methods are faster Trickiest bit is computing normalization, which is over exponentially many y vectors.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

y1 y2 y3 x y1 y2 y3

Forward backward ideas name name name c g b f nonName nonName nonName d h

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k examples

POS tagging Experiments in Lafferty et al Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

POS tagging vs MXPost