1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 Outline – Multi-Class classification: – Structured Prediction – Models for Structured Prediction and Classification Example of POS tagging

3 Mutliclass problems – Most of the machinery we talked before was focused on binary classification problems – e.g., SVMs we discussed so far – However most problems we encounter in NLP are either: MultiClass: e.g., text categorization Structured Prediction: e.g., predict syntactic structure of a sentence – How to deal with them?

4 Binary linear classification

5 Multiclass classification

6 Perceptron

Structured Perceptron Joint feature representation: Algoritm:

8 Perceptron

9 Binary Classification Margin

10 Generalize to MultiClass

11 Converting to MultiClass SVM

12 Max margin = Min Norm As before, these are equivalent formulations:

13 Problems: Requires separability What if we have noise in data? What if we have little simple feature space?

14 Non-separable case

15 Non-separable case

16 Compare with MaxEnt

17 Loss Comparison

18 So far, we considered multiclass classification 0-1 losses l(y,y’) What if what we want to do is to predict: sequences of POS syntactic trees translation Multiclass -> Structured

19 Predicting word alignments

20 Predicting Syntactic Trees

21 Structured Models

22 Parsing

23 Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004

24 Max Margin Markov Networks (M3Ns)

25MultiClass Classification Solving MultiClass with binary learning MultiClass classifier – Function f : R d  {1,2,3,...,k} Decompose into binary problems Not always possible to learn Different scale No theoretical justification Real Problem

26MultiClass Classification Learning via One-Versus-All (OvA) Assumption Find v r,v b,v g,v y  R n such that – v r.x > 0 iff y = red  – v b.x > 0 iff y = blue  – v g.x > 0 iff y = green  – v y.x > 0 iff y = yellow  Classifier f(x) = argmax v i.x Individual Classifiers Decision Regions H = R kn

27MultiClass Classification Learning via All-Verses-All (AvA) Assumption Find v rb,v rg,v ry,v bg,v by,v gy  R d such that – v rb.x > 0 if y = red < 0 if y = blue – v rg.x > 0 if y = red < 0 if y = green –... (for all pairs) Individual Classifiers Decision Regions H = R kkn How to classify?

28 Classifying with AvA Tree 1 red, 2 yellow, 2 green  ? Majority Vote Tournament All are post-learning and might cause weird stuff

29 POS Tagging English tags

30 POS Tagging, examples from WSJ From McCallum

31 POS Tagging Ambiguity: not a trivial task Useful tasks: important features for other steps are based on POS E.g., use POS as input to a parser

32 But still why so popular – Historically the first statistical NLP problem – Easy to apply arbitrary classifiers: – both for sequence models and just independent classifiers – Can be regarded as Finite-State Problem – Easy to evaluate – Annotation is cheaper to obtain than TreeBanks (other languages)

33 HMM (reminder)

34 HMM (reminder) - transitions

35 Transition Estimates

36 Emission Estimates

37 MaxEnt (reminder)

38 Decoding: HMM vs MaxEnt

39 Accuracies overview

40 Accuracies overview

41 SVMs for tagging – We can use SVMs in a similar way as MaxEnt (or other classifiers) – We can use a window around the word – 97.16 % on WSJ

42 SVMs for tagging from Jimenez & Marquez

43 No sequence modeling

44 CRFs and other global models

45 CRFs and other global models

Compare CRFs - no local normalization MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions HMMs W T

47 Label Bias based on a slide from Joe Drish

48 Label Bias Recall Transition based parsing -- Nivre’s algorithm (with beam search) At each step we can observe only local features (limited look-ahead) If later we see that the following word is impossible we can only distribute probability uniformly across all (im- )possible decisions If a small number of such decisions – we cannot decrease probability dramatically So, label bias is likely to be a serious problem if: Non local dependencies States have small number of possible outgoing transitions

49 Pos Tagging Experiments – “+” is an extended feature set (hard to integrate in a generative model) – oov – out-of-vocabulary

50 Supervision – We considered before the supervised case – Training set is labeled – However, we can try to induce word classes without supervision – Unsupervised tagging – We will later discuss the EM algorithm – Can do it in a partly supervised: – Seed tags – Small labeled dataset – Parallel corpus –....

51 Why not to predict POS + parse trees simultaneously? – It is possible and often done this way – Doing tagging internally often benefits parsing accuracy – Unfortunately, parsing models are less robust than taggers – e.g., non-grammatical sentences, different domains – It is more expensive and does not help...

52 Questions Why there is no label-bias problem for a generative model (e.g., HMM) ? How would you integrate word features in a generative model (e.g., HMMs for POS tagging)? e.g., if word has: -ing, -s, -ed, -d, -ment,... post-, de-,...

53 “CRFs” for more complex structured output problems We considered sequence labeled problems Here, the structure of dependencies is fixed What if we do not know the structure but would like to have interactions respecting the structure ?

54 “CRFs” for more complex structured output problems Recall, we had the MST algorithm (McDonald and Pereira, 05)

55 “CRFs” for more complex structured output problems Complex inference E.g., arbitrary 2 nd order dependency parsing models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06) Recently conditional models for constituent parsing: (Finkel et al, ACL 08) (Carreras et al, CoNLL 08)...

56 Back to MultiClass – Let us review how to decompose multiclass problem to binary classification problems

57 Summary Margin-based method for multiclass classification and structured prediction CRFs vs HMMs vs MEMMs for POS tagging

58 Conclusions All approaches use linear representation The differences are – Features – How to learn weights – Training Paradigms: Global Training (CRF, Global Perceptron) Modular Training (PMM, MEMM,...) – These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Similar presentations

Presentation on theme: "1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Similar presentations

Presentation on theme: "1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback