Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Advertisements

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Extraction as Classification. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00.
Conditional Random Fields
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Exploiting diverse knowledge source via Maximum Entropy in Name Entity Recognition Author: Andrew Borthwick John Sterling Eugene Agichtein Ralph Grishman.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Graphical models for part of speech tagging
Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.
1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.
Albert Gatt Corpora and Statistical Methods Lecture 10.
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
1 Information Extraction using HMMs Sunita Sarawagi.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
Distance functions and IE – 5 William W. Cohen CALD.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Conditional Markov Models: MaxEnt Tagging and MEMMs
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Wrapper Induction & Other Use of “Structure”
IE With Undirected Models: the saga continues
Information Extraction Lecture
Conditional Random Fields
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
Max-margin sequential learning methods
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Klein and Manning on CRFs vs CMMs
CSCI 5832 Natural Language Processing
Information Extraction Lecture
CONTEXT DEPENDENT CLASSIFICATION
CSCI 5832 Natural Language Processing
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
Sequential Learning with Dependency Nets
Presentation transcript:

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Announcements Confused about what to write up? –Mon 2/9: Ratnaparki & Frietag et al –Wed 2/11: Borthwick et al & Mikheev –Mon 2/16: no class (President’s day) –Wed 2/18: Sha & Pereira, Lafferty et al –Mon 2/23: Klein & Manning, Toutanova et al –Wed 2/25: no writeup due –Mon 3/1: no writeup due –Wed 3/3: project proposal due: personnel page –Spring break week, no class

Review of review Multinomial HMMs are sequential version of naïve Bayes. One way to drop independence assumption: use a maxent instead of NB, and a conditional model

From NB to Maxent

Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

MaxEnt Comments Functional form same as Naïve Bayes (loglinear model) Numerical issues & smoothing important All methods are iterative Classification performance can be competitive with state-of-art optimizes Pr(y|x), not error rate

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski”

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference GIS Feature selection

MXPost inference Adwait: consider only extensions suggested by a dictionary

MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same approach used for NER by Borthwick, Malouf, Collins, Manning, and others.

Alternative inference

Finding the most probable path: the Viterbi algorithm (for HMMs) define to be the probability of the most probable path accounting for the first i characters of x and ending in state k (ending in with tag k) we want to compute, the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use dynamic programming to find efficiently

Finding the most probable path: the Viterbi algorithm for HMMs initialization:

The Viterbi algorithm for HMMs recursion for emitting states (i =1…L):

The Viterbi algorithm for HMMs and Maxent Taggers recursion for emitting states (i =1…L): Previous tag k i-th token

MEMMs (Frietag & McCallum) Basic difference from ME tagging: –ME tagging: previous state is feature of MaxEnt classifier –MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” –Mostly a difference in viewpoint – easier to see parallels to HMMs

MEMM task: FAQ parsing

MEMM features

MEMMs

Borthwick et al: MENE system Much like MXPost, with some tricks for NER: –4 tags/field: x_start, x_continue, x_end, x_unique –Features: Section features Tokens in window Lexical features of tokens in window Dictionary features of tokens (is token a firstName?) External system of tokens (is this a NetOwl_company_start? proteus_person_unique?) Smooth by discarding low-count features –No history: viterbi search used to find best consistent tag sequence (e.g. no continue w/o start)

Dictionaries in MENE

MENE results (dry run)

MENE learning curves

Largest U.S. Cable Operator Makes Bid for Walt Disney By ANDREW ROSS SORKIN The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios. Short names Longer names

LTG system Another MUC-7 competitor Handcoded rules for “easy” cases (amounts, etc) Process of repeated tagging and “matching” for hard cases –Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) –Partial matches to sure-fire rule are filtered maxent classifier (candidate filtering) using contextual information, etc –Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” –Final partial-match & filter step on titles with different learned filter. Exploits discourse/context information

LTG Results

LTG Identifinder MENE+Proteus Manitoba (NB filtered names) NetOwl Commercial RBS