IE With Undirected Models: the saga continues

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Sparse vs. Ensemble Approaches to Supervised Learning
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Computational Learning Theory
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Final review LING572 Fei Xia Week 10: 03/11/
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Conditional Markov Models: MaxEnt Tagging and MEMMs
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
HW 2.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
Conditional Random Fields for ASR
Machine Learning Basics
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Max-margin sequential learning methods
CS 4/527: Artificial Intelligence
Bayesian Averaging of Classifiers and the Overfitting Problem
CRFs for SPLODD William W. Cohen Sep 8, 2011.
CRFs vs CMMs, and Stacking
Klein and Manning on CRFs vs CMMs
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Bayesian Models in Machine Learning
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 18: Bagging and Boosting
Ensemble Methods for Machine Learning: The Ensemble Strikes Back
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CSCI 5832 Natural Language Processing
Computational Learning Theory
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Lecture 06: Bagging and Boosting
Support Vector Machines and Kernels
IE With Undirected Models
Machine Learning: UNIT-3 CHAPTER-1
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
Sequential Learning with Dependency Nets
Presentation transcript:

IE With Undirected Models: the saga continues William W. Cohen CALD

Announcements Upcoming assignments: Mon 2/23: Toutanova et al Wed 2/25: Klein & Manning, intro to max margin theory Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page Spring break week, no class

Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

CRF model y1 y2 y3 y4 x

Dependency Nets

Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

Toutanova, Klein, Manning, Singer Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger

Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11} Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4

Klein & Manning: Conditional Structure vs Estimation

Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

Task 1: WSD (Word Sense Disambiguation) Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability Don’t even try on optimizing accuracy Penalty for extreme predictions in SCL

Task 2: POS Tagging Sequential problem Replace NB with HMM model. Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)

Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

Error analysis for POS tagging Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

Error analysis for POS tagging

Background for next week: the last 20 years of learning theory

Milestones in learning theory Valiant 1984 CACM: Turing machines and Turing tests—formal analysis of AI problems Chernoff bound shows that Prob(error of h>e) => Prob(h consistent with m examples)<d So given m examples, can afford to examine 2^m hypotheses

Milestones in learning theory Haussler AAAI86: Pick a small hypothesis from a large set Given m examples, can learn hypothesis of size O(m) bits Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88: Generalize notion of “hypothesis size” to VC-dimension.

More milestones.... Littlestone MLJ88: Winnow algorithm Blum COLT91: Learning “small” hypothesis in many dimensions, in mistake bounded model Mistake bound ~= VCdim. Blum COLT91: Learning over infinitely many attributes in mistake-bounded model Learning as compression as learning...

More milestones.... Freund Schapire 1996: boosting C4.5, even to extremes, does not overfit data (!?) --how does this reconcile with Occam’s razor? Vapnik’s support vector machines: kernel representation of a function “true” optimization in machine learning boosting as iterative “margin maximization”

Comments For bag of words text, R^2=|words in doc| Vocabulary size matters not