Prototype-Driven Learning for Sequence Models

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.

Noah A. Smith and Jason Eisner Department of Computer Science /

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

1 AUTOMATIC TRANSCRIPTION OF PIANO MUSIC - SARA CORFINI LANGUAGE AND INTELLIGENCE U N I V E R S I T Y O F P I S A DEPARTMENT OF COMPUTER SCIENCE Automatic.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Conditional Random Fields

Graphical models for part of speech tagging

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

John Lafferty Andrew McCallum Fernando Pereira

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

Spoken Term Discovery for Language Documentation using Translations

Language Identification and Part-of-Speech Tagging

Hidden Markov Models BMI/CS 576

Part-of-Speech Tagging

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep learning David Kauchak CS158 – Fall 2016.

Deep Learning Amin Sobhani.

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Conditional Random Fields

CSC 594 Topics in AI – Natural Language Processing

Multimodal Learning with Deep Boltzmann Machines

Bidirectional CRF for NER

Yoav Goldberg and Michael Elhadad

"Playing Atari with deep reinforcement learning."

Hidden Markov Models Part 2: Algorithms

LING/C SC 581: Advanced Computational Linguistics

Toward Better Understanding

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

N-Gram Model Formulas Word sequences Chain rule of probability

Family History Technology Workshop

Natural Language Processing

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Random Variables A random variable is a rule that assigns exactly one value to each point in a sample space for an experiment. A random variable can be.

Word representations David Kauchak CS158 – Fall 2016.

Part-of-Speech Tagging Using Hidden Markov Models

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

A Joint Model of Orthography and Morphological Segmentation

Presentation transcript:

Prototype-Driven Learning for Sequence Models Hi, my name is Aria. I’m presenting “Prototype-Driven Learning for Sequence Models” Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley

Overview + Annotated Data Unlabeled Data Prototype List Prototypes Target Label Prototypes Target Label * Unlabeled data in conjunction with a prototype list Declarative specification of target structure The resulting model produces annotated data just like a supervised system An advantage of this approach is that if we want to alter our annotation scheme, we only need to alter the prototype list in order to change the tagging behavior +

Sequence Modeling Tasks Information Extraction: Classified Ads Features Location Terms Restrict Size Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Prototype List * Two sequence examples in this talk, this is one FEATURE kitchen, laundry LOCATION near, close TERMS paid, utilities SIZE large, feet RESTRICT cat, smoking

Sequence Modeling Tasks English POS NN VBN CC JJ CD PUNC IN NNS NNP RB DET Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Prototype List * In order to learn a POS model, we just change the prototype list NN president IN of VBD said NNS shares CC and TO to NNP Mr. PUNC . JJ new CD million DET the VBP are

Generalizing Prototypes a witness reported a witness reported ¼ the ¼ president ¼ said How do we generalize to non-prototypes? Tie each word to most distributionally similar prototype Of course POS depends on lots of factors including… NN president DET the VBD said Tie each word to its most similar prototype

Generalizing Prototypes DET NN VBD a witness reported y: x: What we’d like to do is give a noisy-hint of the label and use it as just another feature How do we make a structured model for unlabeled data? ‘reported’ Æ VBD suffix-2=‘ed’ Æ VBD sim=‘said’ Æ VBD Weights ‘reported’ Æ VBD = 0.35 suffix-2=‘ed’ Æ VBD = 0.23 sim=‘said’ Æ VBD = 0.35

Markov Random Fields for Unlabeled Data Pause. Breathe. Give them time to digest!

Markov Random Fields y: hidden labels x: input sentence DET NN VBD a witness reported y: x: Markov Random Field x: input sentence y: hidden labels

Markov Random Fields Weights sim=‘the’ Æ DET = 0.75 ‘a’ Æ DET = 1.5 NN VBD a witness reported y: x: Node Features ‘a’ Æ DET sim=‘the’ Æ DET suffix-1=‘a’ Æ DET Weights sim=‘the’ Æ DET = 0.75 ‘a’ Æ DET = 1.5 suffix-1=‘a’ Æ DET = 0.15

Markov Random Fields Weights  ‘witness’ Æ NN = 0.35 DET NN VBD a witness reported y: x: ‘witness’ Æ NN sim=‘president’ Æ NN suffix-2=‘ss’ Æ NN Weights  ‘witness’ Æ NN = 0.35 sim=‘president’ Æ NN = 0.35 suffix-2=‘ss’ Æ NN = 0.23

Markov Random Fields Weights  ‘reported’ Æ VBD= 0.35 DET NN VBD a witness reported y: x: ‘reported’ Æ VBD sim=‘said’ Æ VBD suffix-2=‘ed’ Æ VBD Weights  ‘reported’ Æ VBD= 0.35 sim=‘said’ Æ VBD = 0.35 suffix-2=‘ed’ Æ ed = 0.23

Markov Random Fields Weights DET Æ NN Æ VBD = 1.15 DET NN VBD a witness reported y: x: DET Æ NN Æ VBD Weights DET Æ NN Æ VBD = 1.15

Markov Random Fields score(x,y) = exp(T ) DET NN VBD a witness reported y: x: We collect features Assign score based on exponential dot product Explain Theta! DET Æ NN Æ VBD ‘witness Æ NN suffix-2=‘ed’ Æ VBD ‘suffix-1=‘s’ Æ NN suffix-1=‘a’ Æ DET ‘a’ Æ DET score(x,y) = exp(T )

Joint Probability Model Markov Random Fields Joint Probability Model p(x,y) = score(x,y) / Z() Partition Function Z() = x,y score(x,y) Normalize score We’ll talk about partition function in a slide Sum over infinite inputs!

Objective Function Given unlabeled sentences {x1,…,xn} choose  to maximize

Optimization Second Expectation First Expectation ? the of in …… For fixed input length, Forward Backward Algorithm for Lattices ? a witness reported First Expectation Forward Backward Algorithm

Partition Function Length Lattice Approximation Compute sum for fixed length Lattice Forward Backward [Smith & Eisner 05] Approximation Truncate to finite length ? the of In … + ? the In …

Experiments

English POS Experiments Data 193K tokens (about 8K sentences) of WSJ portion of Penn Treebank Features [Smith & Eisner 05] Trigram tagger Word type, suffixes up to length 3, contains hyphen, contains digit, initial capitalization

English POS Experiments BASE Fully Unsupervised Random initialization Greedy label remapping

English POS Experiments Prototype List 3 prototypes per tag Automatically extracted by frequency In a realistic setting, we would pick these by hand but here… We fix prototypes to take their label

English POS Distributional Similarity Judge a word by the company it keeps <s> the president said a downturn is near </s> Collect context counts from 40M words of WSJ Similarity [Schuetze 93] SVD dimensionality reduction cos() similarity measure -1 +1 president the ___ said :0.6 a ___ reported:0.3

English POS Experiments Add similarity features Top five most similar prototypes that exceed threshold PROTO+SIM 67.8% on non-prototype accuracy

English POS Transition Counts Target Structure Learned Structure

Classified Ads Experiments Data 100 ads (about 119K tokens) from [Grenager et. al. 05] Features Trigram tagger Word type

Classified Ads Experiments Fully Unsupervised Random initialization Greedy label remapping BASE

Classified Ads Experiments Prototype List 3 prototypes per tag 33 words in total Automatically extracted by frequency

Different from English POS Classified Ads Distributional Similarity Different from English POS Similar to topic model <s> the president said a downturn is near </s> -1 +1 walking distance to shopping , public transportation

Classified Ads Experiments Add similarity features PROTO + SIM

Reacting to observed errors Boundary Model Augment Prototype List Terms Location schools and park . Paid water and schools and park . Paid water and schools and park . Paid water and How to change prototype lists in the Here’s a common error …….. Boundary , ; .

Classified Ads Experiments Add Boundary field BOUND

Information Extraction Transition Counts Target Structure Learned Structure

Conclusion Prototype-Driven learning Novel flexible weakly-supervised learning framework Merged distributional clustering techniques with supervised structured models

Thanks! Questions?

English POS Experiments Fix Prototypes to their tag No random initialization No remapping PROTO 47.7% on non-prototype accuracy Accuracy 40 50 60 70 BASE 41.3 PROTO 68.8

Classified Ads Experiments Fix Prototypes to their tag No random initialization No remapping PROTO 40 50 60 70 80 BASE 46.4 PROTO 53.7 Accuracy

Objective Function Forward-Backward Algorithm Sum over hidden labels ? witness reported

Objective Function …… over all lengths of input Infinite sum over all lengths of input …… ? + Can be computed exactly under certain conditions

English POS Distributional Similarity Collect context counts form BLIPP corpus president the ___ said : 0.6 a ___ reported: 0.3 downturn a ___ was: 0.8 a ___ is: 0.2 a <a> ___ witness: 0.6 said ___ downturn: 0.3 the <s> ___ president: 0.8 said ___ downturn: 0.2 Similarity [Schuetze 93] SVD dimensionality reduction cos() between context vectors