1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Slides:

Advertisements

Similar presentations

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Advertisements

Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Search-Based Structured Prediction

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Machine learning continued Image source:

Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Learning to Predict Structures with Applications to Natural Language Processing Ivan Titov TexPoint fonts used in EMF. Read the TexPoint manual before.

Conditional Random Fields

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Introduction to Machine Learning Approach Lecture 5.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Online Learning Algorithms

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Final review LING572 Fei Xia Week 10: 03/11/

1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Graphical models for part of speech tagging

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Multiclass Classification in NLP

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Revisiting Output Coding for Sequential Supervised Learning Guohua Hao & Alan Fern School of Electrical Engineering and Computer Science Oregon State University.

John Lafferty Andrew McCallum Fernando Pereira

Logistic Regression William Cohen.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Lecture 7: Constrained Conditional Models

Semi-Supervised Clustering

Statistical NLP Spring 2011

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

CSCI 5832 Natural Language Processing

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Dan Roth Department of Computer Science

Modeling IDS using hybrid intelligent systems

Presentation transcript:

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 Outline – Multi-Class classification: – Structured Prediction – Models for Structured Prediction and Classification Example of POS tagging

3 Mutliclass problems – Most of the machinery we talked before was focused on binary classification problems – e.g., SVMs we discussed so far – However most problems we encounter in NLP are either: MultiClass: e.g., text categorization Structured Prediction: e.g., predict syntactic structure of a sentence – How to deal with them?

4 Binary linear classification

5 Multiclass classification

6 Perceptron

Structured Perceptron Joint feature representation: Algoritm:

8 Perceptron

9 Binary Classification Margin

10 Generalize to MultiClass

11 Converting to MultiClass SVM

12 Max margin = Min Norm As before, these are equivalent formulations:

13 Problems: Requires separability What if we have noise in data? What if we have little simple feature space?

14 Non-separable case

15 Non-separable case

16 Compare with MaxEnt

17 Loss Comparison

18 So far, we considered multiclass classification 0-1 losses l(y,y’) What if what we want to do is to predict: sequences of POS syntactic trees translation Multiclass -> Structured

19 Predicting word alignments

20 Predicting Syntactic Trees

21 Structured Models

22 Parsing

23 Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004

24 Max Margin Markov Networks (M3Ns)

25MultiClass Classification Solving MultiClass with binary learning MultiClass classifier – Function f : R d  {1,2,3,...,k} Decompose into binary problems Not always possible to learn Different scale No theoretical justification Real Problem

26MultiClass Classification Learning via One-Versus-All (OvA) Assumption Find v r,v b,v g,v y  R n such that – v r.x > 0 iff y = red  – v b.x > 0 iff y = blue  – v g.x > 0 iff y = green  – v y.x > 0 iff y = yellow  Classifier f(x) = argmax v i.x Individual Classifiers Decision Regions H = R kn

27MultiClass Classification Learning via All-Verses-All (AvA) Assumption Find v rb,v rg,v ry,v bg,v by,v gy  R d such that – v rb.x > 0 if y = red < 0 if y = blue – v rg.x > 0 if y = red < 0 if y = green –... (for all pairs) Individual Classifiers Decision Regions H = R kkn How to classify?

28 Classifying with AvA Tree 1 red, 2 yellow, 2 green  ? Majority Vote Tournament All are post-learning and might cause weird stuff

29 POS Tagging English tags

30 POS Tagging, examples from WSJ From McCallum

31 POS Tagging Ambiguity: not a trivial task Useful tasks: important features for other steps are based on POS E.g., use POS as input to a parser

32 But still why so popular – Historically the first statistical NLP problem – Easy to apply arbitrary classifiers: – both for sequence models and just independent classifiers – Can be regarded as Finite-State Problem – Easy to evaluate – Annotation is cheaper to obtain than TreeBanks (other languages)

33 HMM (reminder)

34 HMM (reminder) - transitions

35 Transition Estimates

36 Emission Estimates

37 MaxEnt (reminder)

38 Decoding: HMM vs MaxEnt

39 Accuracies overview

40 Accuracies overview

41 SVMs for tagging – We can use SVMs in a similar way as MaxEnt (or other classifiers) – We can use a window around the word – % on WSJ

42 SVMs for tagging from Jimenez & Marquez

43 No sequence modeling

44 CRFs and other global models

45 CRFs and other global models

Compare CRFs - no local normalization MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions HMMs W T

47 Label Bias based on a slide from Joe Drish

48 Label Bias Recall Transition based parsing -- Nivre’s algorithm (with beam search) At each step we can observe only local features (limited look-ahead) If later we see that the following word is impossible we can only distribute probability uniformly across all (im- )possible decisions If a small number of such decisions – we cannot decrease probability dramatically So, label bias is likely to be a serious problem if: Non local dependencies States have small number of possible outgoing transitions

49 Pos Tagging Experiments – “+” is an extended feature set (hard to integrate in a generative model) – oov – out-of-vocabulary

50 Supervision – We considered before the supervised case – Training set is labeled – However, we can try to induce word classes without supervision – Unsupervised tagging – We will later discuss the EM algorithm – Can do it in a partly supervised: – Seed tags – Small labeled dataset – Parallel corpus –....

51 Why not to predict POS + parse trees simultaneously? – It is possible and often done this way – Doing tagging internally often benefits parsing accuracy – Unfortunately, parsing models are less robust than taggers – e.g., non-grammatical sentences, different domains – It is more expensive and does not help...

52 Questions Why there is no label-bias problem for a generative model (e.g., HMM) ? How would you integrate word features in a generative model (e.g., HMMs for POS tagging)? e.g., if word has: -ing, -s, -ed, -d, -ment,... post-, de-,...

53 “CRFs” for more complex structured output problems We considered sequence labeled problems Here, the structure of dependencies is fixed What if we do not know the structure but would like to have interactions respecting the structure ?

54 “CRFs” for more complex structured output problems Recall, we had the MST algorithm (McDonald and Pereira, 05)

55 “CRFs” for more complex structured output problems Complex inference E.g., arbitrary 2 nd order dependency parsing models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06) Recently conditional models for constituent parsing: (Finkel et al, ACL 08) (Carreras et al, CoNLL 08)...

56 Back to MultiClass – Let us review how to decompose multiclass problem to binary classification problems

57 Summary Margin-based method for multiclass classification and structured prediction CRFs vs HMMs vs MEMMs for POS tagging

58 Conclusions All approaches use linear representation The differences are – Features – How to learn weights – Training Paradigms: Global Training (CRF, Global Perceptron) Modular Training (PMM, MEMM,...) – These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.