Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Similar presentations


Presentation on theme: "Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan."— Presentation transcript:

1 Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan

2 « Discriminative method » Decision theoretic framework: Loss: Decision function: Risk Contrast funtion

3 « with structure » on outputs: Handwriting recognition InputOutput brace huge! Machine translation Ce n'est pas un autre problème de classification. This is not another classification problem.

4 « with structure » on inputs: text documents ….. ……. … ……… ….......................................... latent variable model new representation classification

5 Structure on outputs: Discriminative Word Alignment project (joint work with Ben Taskar, Dan Klein and Mike Jordan)

6 Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Key step in most machine translation systems

7 Overview Review of large-margin word alignment [Taskar et al. EMNLP 05] Two new extensions to the basic model: Fertility features First order interactions using quadratic assignment Results on Hansards dataset

8 Feature-Based Alignment Features: Association MI = 3.2 Dice = 4.1 Lexical pair ID( proposal, proposition ) = 1 Position in sentence AbsDist = 5 RelDist = 0.3 Orthography ExactMatch = 0 Similarity = 0.8 Resources PairInDictionary Other Models (IBM2, IBM4) What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

9 Scoring Whole Alignments What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

10 Prediction as a Linear Program Still guaranteed to have integral solutions y Degree constraint What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k relaxation

11 Learning w Supervised training data Training methods Maximum likelihood/entropy Perceptron Maximum margin

12 Maximum Likelihood/Entropy Probabilistic approach: Problem: denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Cant find maximum likelihood parameters

13 (Averaged) Perceptron Perceptron for structured output [Collins 2002]: For each example, Predict: Update: Output averaged parameters:

14 Large Margin Estimation Equivalent min-max formulation [Taskar et al 04,05] Simple LP true score other score loss

15 Min-max formulation - QP LP duality QP of polynomial size! => Mosek

16 Experimental Setup French Canadian Hansards Corpus Word-level aligned 200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data) Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models Learn using Large Margin Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

17 Old Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

18 Improving basic model We would like to model: Fertility: Alignments are not necessarily 1-to-1 First-order interactions: Alignments are mostly locally diagonal: would like to score depending on its neighbors Strategy: extensions keeping prediction model as a LP

19 Modeling Fertility Example of node feature: for word w, fraction of time it had fertility > k on the training set fertility penalty

20 Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

21 Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% + model 4 + fertility4.996 / 94% AER Prec / Rec

22 Fertility example Sure align. Possible align. Predicted align. = = =

23 Modeling First Order Effects Restrict: monoticity local inversion local fertility want: relaxation:

24 Integer program Quadratic assignment NP-complete; on real-world sentences (2 to 30 words) takes a few seconds using Mosek (~1k variables) Interestingly, in our dataset 80% of examples yield integer solution when solved via linear relaxation same AER when using relaxation!

25 New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

26 New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% AER Prec / Rec

27 New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% AER Prec / Rec

28 New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% + fertility + qap + model 4 + liang3.897 / 96 % AER Prec / Rec

29 Fert + qap example

30

31 Conclusions Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other models Surprisingly accurate with simple features Include fertility model and first order interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

32 Structure on inputs: discLDA project (work in progress) (joint work with Fei Sha and Mike Jordan)

33 Unsupervised dimensionality reduction text documents ….. ……. … ……… ….......................................... latent variables model new representation classification

34 Analogy: PCA vs. FDA x x x x x x x x x x o o o o o o o o o o o o o o o o x x x PCA direction FDA direction

35 Goal: supervised dim. reduction text documents ….. ……. … ……… ….......................................... latent variables model with supervised information new representation classification

36 Review: LDA model

37 Discriminative version of LDA Ultimately, want to learn discriminatively -> but high-dimensional non-convex objective, hard to optimize! Instead, propose to learn class-dependent linear transformation of common s: New generative model: Equivalently, transformation on :

38 Simplex Geometry x x x x x x o o o o word simplex w3 w2 w1 topic simplex x x x x x x o o o o w2 w1 w3

39 Interpretation 1 Shared topic vs. class-specific topic: shared topics class-specific topics

40 Interpretation 2 Generative model from T, add a new latent variable u:

41 Compare with AT model Author-Topic model [Rosen-Zvi et al. 2004] discLDA

42 Inference and learning

43 Learning For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling] For fixed, update T using stochastic gradient ascent on conditional log-likelihood: in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate Currently, results are noisy…

44 Inference (dimensionality reduction) Given learned T and : estimate using Harmonic Mean estimator compute by marginalizing over y to get new representation of document

45 Preliminary Experiments

46 20 Newsgroup dataset Used fixed T: Get reduced representation -> train linear SVM on it hence 110 topics for 11k train 7.5k test vocabulary: 50k

47 Classification results discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

48 Newsgroup embedding (LDA)

49 Newsgroup embedding (discLDA)

50 using tSNE (on discLDA) thanks to Laurens van der Maaten for figure! [Hintons group]

51 using tSNE (on LDA) thanks to Laurens van der Maaten for figure! [Hintons group]

52 Learned topics

53 Another embedding NIPS papers vs. Psychology abstracts LDA discLDA

54 13 scenes dataset [Fei-Fei 2005] train: 100 per category test: 2558

55 Vocabulary (visual words)

56 Topics

57 Conclusion fixed transformation T enables topic sharing & exploration get reduced representation which preserves predictive power noisy gradient estimates still work in progress will probably try variational approach instead

58


Download ppt "Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan."

Similar presentations


Ads by Google