Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan

« Discriminative method » Decision theoretic framework: Loss: Decision function: Risk Contrast funtion

« with structure » on outputs: Handwriting recognition InputOutput brace huge! Machine translation Ce n'est pas un autre problème de classification. This is not another classification problem.

« with structure » on inputs: text documents ….. ……. … ……… ….......................................... latent variable model new representation classification

Structure on outputs: Discriminative Word Alignment project (joint work with Ben Taskar, Dan Klein and Mike Jordan)

Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Key step in most machine translation systems

Overview Review of large-margin word alignment [Taskar et al. EMNLP 05] Two new extensions to the basic model: Fertility features First order interactions using quadratic assignment Results on Hansards dataset

Feature-Based Alignment Features: Association MI = 3.2 Dice = 4.1 Lexical pair ID( proposal, proposition ) = 1 Position in sentence AbsDist = 5 RelDist = 0.3 Orthography ExactMatch = 0 Similarity = 0.8 Resources PairInDictionary Other Models (IBM2, IBM4) What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

Scoring Whole Alignments What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

Prediction as a Linear Program Still guaranteed to have integral solutions y Degree constraint What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k relaxation

Learning w Supervised training data Training methods Maximum likelihood/entropy Perceptron Maximum margin

Maximum Likelihood/Entropy Probabilistic approach: Problem: denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Cant find maximum likelihood parameters

(Averaged) Perceptron Perceptron for structured output [Collins 2002]: For each example, Predict: Update: Output averaged parameters:

Large Margin Estimation Equivalent min-max formulation [Taskar et al 04,05] Simple LP true score other score loss

Min-max formulation - QP LP duality QP of polynomial size! => Mosek

Experimental Setup French Canadian Hansards Corpus Word-level aligned 200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data) Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models Learn using Large Margin Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

Old Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

Improving basic model We would like to model: Fertility: Alignments are not necessarily 1-to-1 First-order interactions: Alignments are mostly locally diagonal: would like to score depending on its neighbors Strategy: extensions keeping prediction model as a LP

Modeling Fertility Example of node feature: for word w, fraction of time it had fertility > k on the training set fertility penalty

Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% + model 4 + fertility4.996 / 94% AER Prec / Rec

Fertility example Sure align. Possible align. Predicted align. = = =

Modeling First Order Effects Restrict: monoticity local inversion local fertility want: relaxation:

Integer program Quadratic assignment NP-complete; on real-world sentences (2 to 30 words) takes a few seconds using Mosek (~1k variables) Interestingly, in our dataset 80% of examples yield integer solution when solved via linear relaxation same AER when using relaxation!

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% + fertility + qap + model 4 + liang3.897 / 96 % AER Prec / Rec

Fert + qap example

Conclusions Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other models Surprisingly accurate with simple features Include fertility model and first order interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

Structure on inputs: discLDA project (work in progress) (joint work with Fei Sha and Mike Jordan)

Unsupervised dimensionality reduction text documents ….. ……. … ……… ….......................................... latent variables model new representation classification

Analogy: PCA vs. FDA x x x x x x x x x x o o o o o o o o o o o o o o o o x x x PCA direction FDA direction

Goal: supervised dim. reduction text documents ….. ……. … ……… ….......................................... latent variables model with supervised information new representation classification

Review: LDA model

Discriminative version of LDA Ultimately, want to learn discriminatively -> but high-dimensional non-convex objective, hard to optimize! Instead, propose to learn class-dependent linear transformation of common s: New generative model: Equivalently, transformation on :

Simplex Geometry x x x x x x o o o o word simplex w3 w2 w1 topic simplex x x x x x x o o o o w2 w1 w3

Interpretation 1 Shared topic vs. class-specific topic: shared topics class-specific topics

Interpretation 2 Generative model from T, add a new latent variable u:

Compare with AT model Author-Topic model [Rosen-Zvi et al. 2004] discLDA

Inference and learning

Learning For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling] For fixed, update T using stochastic gradient ascent on conditional log-likelihood: in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate Currently, results are noisy…

Inference (dimensionality reduction) Given learned T and : estimate using Harmonic Mean estimator compute by marginalizing over y to get new representation of document

Preliminary Experiments

20 Newsgroup dataset Used fixed T: Get reduced representation -> train linear SVM on it hence 110 topics for 11k train 7.5k test vocabulary: 50k

Classification results discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

Newsgroup embedding (LDA)

Newsgroup embedding (discLDA)

using tSNE (on discLDA) thanks to Laurens van der Maaten for figure! [Hintons group]

using tSNE (on LDA) thanks to Laurens van der Maaten for figure! [Hintons group]

Learned topics

Another embedding NIPS papers vs. Psychology abstracts LDA discLDA

13 scenes dataset [Fei-Fei 2005] train: 100 per category test: 2558

Vocabulary (visual words)

Topics

Conclusion fixed transformation T enables topic sharing & exploration get reduced representation which preserves predictive power noisy gradient estimates still work in progress will probably try variational approach instead

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Similar presentations

Presentation on theme: "Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Similar presentations

Presentation on theme: "Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan."— Presentation transcript:

Similar presentations

About project

Feedback