Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth.
Supervised Learning Recap
Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Introduction to Hidden Markov Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Conditional Random Fields
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Information Retrieval in Practice
Albert Gatt Corpora and Statistical Methods Lecture 9.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Text Classification, Active/Interactive learning.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Tokenization & POS-Tagging
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Language Identification and Part-of-Speech Tagging
Conditional Random Fields for ASR
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Prototype-Driven Learning for Sequence Models
A Markov Random Field Model for Term Dependencies
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
N-Gram Model Formulas Word sequences Chain rule of probability
LECTURE 23: INFORMATION THEORY REVIEW
Presentation transcript:

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi- supervised NL Learning Reading Group

Motivation: Learn models with least effort Supervised learning requires many labeled examples Unsupervised learning requires a carefully designed model – does not necessarily minimize total effort Prototype-driven learning can require less total effort

Prototype-driven learning Specify prototypical examples for each target label Example: for POS tagging, list the target tags and a few examples of each tag

English POS Prototypes

Arguments for prototype-driven learning Minimum one would have to provide to a human annotator Pedagogical use Natural language exhibits proform and prototype effects

General approach Link any given word to similar prototypes using distributional similarity

General approach Link any given word to similar prototypes using distributional similarity Encode these prototype links as features in a log-linear generative model, trained to fit unlabeled data

General approach Link any given word to similar prototypes using distributional similarity Encode these prototype links as features in a log-linear generative model, trained to fit unlabeled data Example: reported may be linked to said, which is a prototype for the POS tag VBD.

English POS task (Penn treebank)

Chinese POS task (Penn treebank)

Classified Ad Segmentation (Grenager et al. 2005) Task: Segment classified advertisements into topical sections

Classified Ad Segmentation (Grenager et al. 2005) Task: Segment classified advertisements into topical sections Typical of unsupervised learning on a new domain: Grenager et al. altered their HMM to prefer diagonal transitions, then modified the transition structure to explicitly model boundary tokens.

Approach For each document x, we would like to predict a sequence of labels y Build a generative model and choose parameters θ to maximize the log- likelihood of the observed data D:

Markov Random Fields Use MRF model family Undirected equivalent to HMMs

Markov Random Fields Use MRF model family Undirected equivalent to HMMs Normalizer Edge/transition clique potential Node/emission clique potential Joint probability

Markov Random Fields Use MRF model family Undirected equivalent to HMMs Normalizer Edge/transition clique potential Node/emission clique potential Joint probability is a potential over a clique with form

English POS trigram tagger

Using distributional similarity and prototypes Add a node feature PROTO = z for all prototypes z similar to the word at that node For POS tagging, similarity is based on positional context vectors For the classified ad task, position is ignored and a wider window is used

Prototype list for classified ads

Parameter estimation See the paper Gradient-based method (L-BFGS), forward-backward, Viterbi

English POS tagging Used the WSJ portion of the Penn treebank Used two sizes– 48K tokens, 193K tokens

Baseline features BASE features –Node features: exact word, character suffixes, init-caps, contains-hyphen, contains-digit –Edge features: tag trigrams

Building the prototype list Automatically extracted the prototype list For each label, selected the top three occurring words that were not given another label more often

Building the prototype list Automatically extracted the prototype list For each label, selected the top three occurring words that were not given another label more often Yes. This does use labeled data! The authors did it this way to give repeatable results, and to avoid excessive tuning.

English POS Prototypes

Use the prototypes Restricting the prototype words to have their respective labels improved performance, but did not help similar non- prototype words. Solution: add PROTO features to similar words

English POS trigram tagger

English POS results

English POS transition mass trueestimated

Chinese POS results Reduces the error rate from BASE by 35%, but not as good as English results Reasons: task is harder, and had less data for distributional similarity

Classified ad segmentation For distributional similarity, used context vectors but ignored distance and direction Added special BOUNDARY state to handle tokens that indicate transitions Special model tweaking– deviance from “least effort” motivation

Classified ad segmentation results BASE scored 46.4% accuracy BASE+PROTO+SIM scores 71.5% accuracy BASE+PROTO+SIM+BOUND scores 72.4% Grenager et al. reported supervised accuracy of 74.4%

Common classified ad confusions

Classified ad transition mass labeledestimate from BASE features estimate from all features

Conclusions Prototype-driven learning provides a compact and declarative way to specify a target labeling scheme. Distributional similarity features seem to work well in linking words to prototypes. Bridges gap between unsupervised sequence-free distributional clustering approaches and supervised sequence model learning.