Download presentation
Presentation is loading. Please wait.
Published byJarvis Dodsworth Modified over 10 years ago
1
Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan
2
« Discriminative method » Decision theoretic framework: Loss: Decision function: Risk Contrast funtion
3
« with structure » on outputs: Handwriting recognition InputOutput brace huge! Machine translation Ce n'est pas un autre problème de classification. This is not another classification problem.
4
« with structure » on inputs: text documents ….. ……. … ……… ….......................................... latent variable model new representation classification
5
Structure on outputs: Discriminative Word Alignment project (joint work with Ben Taskar, Dan Klein and Mike Jordan)
6
Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Key step in most machine translation systems
7
Overview Review of large-margin word alignment [Taskar et al. EMNLP 05] Two new extensions to the basic model: Fertility features First order interactions using quadratic assignment Results on Hansards dataset
8
Feature-Based Alignment Features: Association MI = 3.2 Dice = 4.1 Lexical pair ID( proposal, proposition ) = 1 Position in sentence AbsDist = 5 RelDist = 0.3 Orthography ExactMatch = 0 Similarity = 0.8 Resources PairInDictionary Other Models (IBM2, IBM4) What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k
9
Scoring Whole Alignments What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k
10
Prediction as a Linear Program Still guaranteed to have integral solutions y Degree constraint What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k relaxation
11
Learning w Supervised training data Training methods Maximum likelihood/entropy Perceptron Maximum margin
12
Maximum Likelihood/Entropy Probabilistic approach: Problem: denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Cant find maximum likelihood parameters
13
(Averaged) Perceptron Perceptron for structured output [Collins 2002]: For each example, Predict: Update: Output averaged parameters:
14
Large Margin Estimation Equivalent min-max formulation [Taskar et al 04,05] Simple LP true score other score loss
15
Min-max formulation - QP LP duality QP of polynomial size! => Mosek
16
Experimental Setup French Canadian Hansards Corpus Word-level aligned 200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data) Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models Learn using Large Margin Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]
17
Old Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec
18
Improving basic model We would like to model: Fertility: Alignments are not necessarily 1-to-1 First-order interactions: Alignments are mostly locally diagonal: would like to score depending on its neighbors Strategy: extensions keeping prediction model as a LP
19
Modeling Fertility Example of node feature: for word w, fraction of time it had fertility > k on the training set fertility penalty
20
Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec
21
Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% + model 4 + fertility4.996 / 94% AER Prec / Rec
22
Fertility example Sure align. Possible align. Predicted align. = = =
23
Modeling First Order Effects Restrict: monoticity local inversion local fertility want: relaxation:
24
Integer program Quadratic assignment NP-complete; on real-world sentences (2 to 30 words) takes a few seconds using Mosek (~1k variables) Interestingly, in our dataset 80% of examples yield integer solution when solved via linear relaxation same AER when using relaxation!
25
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec
26
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% AER Prec / Rec
27
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% AER Prec / Rec
28
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% + fertility + qap + model 4 + liang3.897 / 96 % AER Prec / Rec
29
Fert + qap example
31
Conclusions Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other models Surprisingly accurate with simple features Include fertility model and first order interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT
32
Structure on inputs: discLDA project (work in progress) (joint work with Fei Sha and Mike Jordan)
33
Unsupervised dimensionality reduction text documents ….. ……. … ……… ….......................................... latent variables model new representation classification
34
Analogy: PCA vs. FDA x x x x x x x x x x o o o o o o o o o o o o o o o o x x x PCA direction FDA direction
35
Goal: supervised dim. reduction text documents ….. ……. … ……… ….......................................... latent variables model with supervised information new representation classification
36
Review: LDA model
37
Discriminative version of LDA Ultimately, want to learn discriminatively -> but high-dimensional non-convex objective, hard to optimize! Instead, propose to learn class-dependent linear transformation of common s: New generative model: Equivalently, transformation on :
38
Simplex Geometry x x x x x x o o o o word simplex w3 w2 w1 topic simplex x x x x x x o o o o w2 w1 w3
39
Interpretation 1 Shared topic vs. class-specific topic: shared topics class-specific topics
40
Interpretation 2 Generative model from T, add a new latent variable u:
41
Compare with AT model Author-Topic model [Rosen-Zvi et al. 2004] discLDA
42
Inference and learning
43
Learning For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling] For fixed, update T using stochastic gradient ascent on conditional log-likelihood: in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate Currently, results are noisy…
44
Inference (dimensionality reduction) Given learned T and : estimate using Harmonic Mean estimator compute by marginalizing over y to get new representation of document
45
Preliminary Experiments
46
20 Newsgroup dataset Used fixed T: Get reduced representation -> train linear SVM on it hence 110 topics for 11k train 7.5k test vocabulary: 50k
47
Classification results discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error
48
Newsgroup embedding (LDA)
49
Newsgroup embedding (discLDA)
50
using tSNE (on discLDA) thanks to Laurens van der Maaten for figure! [Hintons group]
51
using tSNE (on LDA) thanks to Laurens van der Maaten for figure! [Hintons group]
52
Learned topics
53
Another embedding NIPS papers vs. Psychology abstracts LDA discLDA
54
13 scenes dataset [Fei-Fei 2005] train: 100 per category test: 2558
55
Vocabulary (visual words)
56
Topics
57
Conclusion fixed transformation T enables topic sharing & exploration get reduced representation which preserves predictive power noisy gradient estimates still work in progress will probably try variational approach instead
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.