Download presentation
Presentation is loading. Please wait.
Published byAlannah Malone Modified over 9 years ago
1
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA
2
2 Outline – Multi-Class classification: – Structured Prediction – Models for Structured Prediction and Classification Example of POS tagging
3
3 Mutliclass problems – Most of the machinery we talked before was focused on binary classification problems – e.g., SVMs we discussed so far – However most problems we encounter in NLP are either: MultiClass: e.g., text categorization Structured Prediction: e.g., predict syntactic structure of a sentence – How to deal with them?
4
4 Binary linear classification
5
5 Multiclass classification
6
6 Perceptron
7
Structured Perceptron Joint feature representation: Algoritm:
8
8 Perceptron
9
9 Binary Classification Margin
10
10 Generalize to MultiClass
11
11 Converting to MultiClass SVM
12
12 Max margin = Min Norm As before, these are equivalent formulations:
13
13 Problems: Requires separability What if we have noise in data? What if we have little simple feature space?
14
14 Non-separable case
15
15 Non-separable case
16
16 Compare with MaxEnt
17
17 Loss Comparison
18
18 So far, we considered multiclass classification 0-1 losses l(y,y’) What if what we want to do is to predict: sequences of POS syntactic trees translation Multiclass -> Structured
19
19 Predicting word alignments
20
20 Predicting Syntactic Trees
21
21 Structured Models
22
22 Parsing
23
23 Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004
24
24 Max Margin Markov Networks (M3Ns)
25
25MultiClass Classification Solving MultiClass with binary learning MultiClass classifier – Function f : R d {1,2,3,...,k} Decompose into binary problems Not always possible to learn Different scale No theoretical justification Real Problem
26
26MultiClass Classification Learning via One-Versus-All (OvA) Assumption Find v r,v b,v g,v y R n such that – v r.x > 0 iff y = red – v b.x > 0 iff y = blue – v g.x > 0 iff y = green – v y.x > 0 iff y = yellow Classifier f(x) = argmax v i.x Individual Classifiers Decision Regions H = R kn
27
27MultiClass Classification Learning via All-Verses-All (AvA) Assumption Find v rb,v rg,v ry,v bg,v by,v gy R d such that – v rb.x > 0 if y = red < 0 if y = blue – v rg.x > 0 if y = red < 0 if y = green –... (for all pairs) Individual Classifiers Decision Regions H = R kkn How to classify?
28
28 Classifying with AvA Tree 1 red, 2 yellow, 2 green ? Majority Vote Tournament All are post-learning and might cause weird stuff
29
29 POS Tagging English tags
30
30 POS Tagging, examples from WSJ From McCallum
31
31 POS Tagging Ambiguity: not a trivial task Useful tasks: important features for other steps are based on POS E.g., use POS as input to a parser
32
32 But still why so popular – Historically the first statistical NLP problem – Easy to apply arbitrary classifiers: – both for sequence models and just independent classifiers – Can be regarded as Finite-State Problem – Easy to evaluate – Annotation is cheaper to obtain than TreeBanks (other languages)
33
33 HMM (reminder)
34
34 HMM (reminder) - transitions
35
35 Transition Estimates
36
36 Emission Estimates
37
37 MaxEnt (reminder)
38
38 Decoding: HMM vs MaxEnt
39
39 Accuracies overview
40
40 Accuracies overview
41
41 SVMs for tagging – We can use SVMs in a similar way as MaxEnt (or other classifiers) – We can use a window around the word – 97.16 % on WSJ
42
42 SVMs for tagging from Jimenez & Marquez
43
43 No sequence modeling
44
44 CRFs and other global models
45
45 CRFs and other global models
46
Compare CRFs - no local normalization MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions HMMs W T
47
47 Label Bias based on a slide from Joe Drish
48
48 Label Bias Recall Transition based parsing -- Nivre’s algorithm (with beam search) At each step we can observe only local features (limited look-ahead) If later we see that the following word is impossible we can only distribute probability uniformly across all (im- )possible decisions If a small number of such decisions – we cannot decrease probability dramatically So, label bias is likely to be a serious problem if: Non local dependencies States have small number of possible outgoing transitions
49
49 Pos Tagging Experiments – “+” is an extended feature set (hard to integrate in a generative model) – oov – out-of-vocabulary
50
50 Supervision – We considered before the supervised case – Training set is labeled – However, we can try to induce word classes without supervision – Unsupervised tagging – We will later discuss the EM algorithm – Can do it in a partly supervised: – Seed tags – Small labeled dataset – Parallel corpus –....
51
51 Why not to predict POS + parse trees simultaneously? – It is possible and often done this way – Doing tagging internally often benefits parsing accuracy – Unfortunately, parsing models are less robust than taggers – e.g., non-grammatical sentences, different domains – It is more expensive and does not help...
52
52 Questions Why there is no label-bias problem for a generative model (e.g., HMM) ? How would you integrate word features in a generative model (e.g., HMMs for POS tagging)? e.g., if word has: -ing, -s, -ed, -d, -ment,... post-, de-,...
53
53 “CRFs” for more complex structured output problems We considered sequence labeled problems Here, the structure of dependencies is fixed What if we do not know the structure but would like to have interactions respecting the structure ?
54
54 “CRFs” for more complex structured output problems Recall, we had the MST algorithm (McDonald and Pereira, 05)
55
55 “CRFs” for more complex structured output problems Complex inference E.g., arbitrary 2 nd order dependency parsing models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06) Recently conditional models for constituent parsing: (Finkel et al, ACL 08) (Carreras et al, CoNLL 08)...
56
56 Back to MultiClass – Let us review how to decompose multiclass problem to binary classification problems
57
57 Summary Margin-based method for multiclass classification and structured prediction CRFs vs HMMs vs MEMMs for POS tagging
58
58 Conclusions All approaches use linear representation The differences are – Features – How to learn weights – Training Paradigms: Global Training (CRF, Global Perceptron) Modular Training (PMM, MEMM,...) – These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.