Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Slides:



Advertisements
Similar presentations
3.6 Support Vector Machines
Advertisements

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Introduction to Support Vector Machines (SVM)
Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS Spring 2015 Alex Berg.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support Vector Machines
Machine learning continued Image source:
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Crash Course on Machine Learning Part IV Several slides from Derek Hoiem, and Ben Taskar.
Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.
Statistical Topic Modeling part 1
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.
Speaker Adaptation for Vowel Classification
Latent Dirichlet Allocation a generative model for text
Visual Recognition Tutorial
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
An Introduction to Support Vector Machines Martin Law.
Crash Course on Machine Learning
Collaborative Filtering Matrix Factorization Approach
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Latent Dirichlet Allocation
Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Visualizing High-Dimensional Data
Learning Deep Generative Models by Ruslan Salakhutdinov
Multimodal Learning with Deep Boltzmann Machines
Machine Learning Basics
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Video Summarization via Determinantal Point Processes (DPP)
Learning with information of features
CS 2750: Machine Learning Support Vector Machines
[Figure taken from googleblog
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Machine learning overview
Presentation transcript:

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan

« Discriminative method » Decision theoretic framework: Loss: Decision function: Risk Contrast funtion

« with structure » on outputs: Handwriting recognition InputOutput brace huge! Machine translation Ce n'est pas un autre problème de classification. This is not another classification problem.

« with structure » on inputs: text documents ….. ……. … ……… … latent variable model new representation classification

Structure on outputs: Discriminative Word Alignment project (joint work with Ben Taskar, Dan Klein and Mike Jordan)

Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Key step in most machine translation systems

Overview Review of large-margin word alignment [Taskar et al. EMNLP 05] Two new extensions to the basic model: Fertility features First order interactions using quadratic assignment Results on Hansards dataset

Feature-Based Alignment Features: Association MI = 3.2 Dice = 4.1 Lexical pair ID( proposal, proposition ) = 1 Position in sentence AbsDist = 5 RelDist = 0.3 Orthography ExactMatch = 0 Similarity = 0.8 Resources PairInDictionary Other Models (IBM2, IBM4) What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

Scoring Whole Alignments What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

Prediction as a Linear Program Still guaranteed to have integral solutions y Degree constraint What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k relaxation

Learning w Supervised training data Training methods Maximum likelihood/entropy Perceptron Maximum margin

Maximum Likelihood/Entropy Probabilistic approach: Problem: denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Cant find maximum likelihood parameters

(Averaged) Perceptron Perceptron for structured output [Collins 2002]: For each example, Predict: Update: Output averaged parameters:

Large Margin Estimation Equivalent min-max formulation [Taskar et al 04,05] Simple LP true score other score loss

Min-max formulation - QP LP duality QP of polynomial size! => Mosek

Experimental Setup French Canadian Hansards Corpus Word-level aligned 200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data) Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models Learn using Large Margin Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

Old Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% AER Prec / Rec

Improving basic model We would like to model: Fertility: Alignments are not necessarily 1-to-1 First-order interactions: Alignments are mostly locally diagonal: would like to score depending on its neighbors Strategy: extensions keeping prediction model as a LP

Modeling Fertility Example of node feature: for word w, fraction of time it had fertility > k on the training set fertility penalty

Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% AER Prec / Rec

Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% + model 4 + fertility4.996 / 94% AER Prec / Rec

Fertility example Sure align. Possible align. Predicted align. = = =

Modeling First Order Effects Restrict: monoticity local inversion local fertility want: relaxation:

Integer program Quadratic assignment NP-complete; on real-world sentences (2 to 30 words) takes a few seconds using Mosek (~1k variables) Interestingly, in our dataset 80% of examples yield integer solution when solved via linear relaxation same AER when using relaxation!

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% Basic + fertility + qap6.194 / 93% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model / 95% AER Prec / Rec

New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model / 95% + fertility + qap + model 4 + liang3.897 / 96 % AER Prec / Rec

Fert + qap example

Conclusions Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other models Surprisingly accurate with simple features Include fertility model and first order interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

Structure on inputs: discLDA project (work in progress) (joint work with Fei Sha and Mike Jordan)

Unsupervised dimensionality reduction text documents ….. ……. … ……… … latent variables model new representation classification

Analogy: PCA vs. FDA x x x x x x x x x x o o o o o o o o o o o o o o o o x x x PCA direction FDA direction

Goal: supervised dim. reduction text documents ….. ……. … ……… … latent variables model with supervised information new representation classification

Review: LDA model

Discriminative version of LDA Ultimately, want to learn discriminatively -> but high-dimensional non-convex objective, hard to optimize! Instead, propose to learn class-dependent linear transformation of common s: New generative model: Equivalently, transformation on :

Simplex Geometry x x x x x x o o o o word simplex w3 w2 w1 topic simplex x x x x x x o o o o w2 w1 w3

Interpretation 1 Shared topic vs. class-specific topic: shared topics class-specific topics

Interpretation 2 Generative model from T, add a new latent variable u:

Compare with AT model Author-Topic model [Rosen-Zvi et al. 2004] discLDA

Inference and learning

Learning For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling] For fixed, update T using stochastic gradient ascent on conditional log-likelihood: in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate Currently, results are noisy…

Inference (dimensionality reduction) Given learned T and : estimate using Harmonic Mean estimator compute by marginalizing over y to get new representation of document

Preliminary Experiments

20 Newsgroup dataset Used fixed T: Get reduced representation -> train linear SVM on it hence 110 topics for 11k train 7.5k test vocabulary: 50k

Classification results discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

Newsgroup embedding (LDA)

Newsgroup embedding (discLDA)

using tSNE (on discLDA) thanks to Laurens van der Maaten for figure! [Hintons group]

using tSNE (on LDA) thanks to Laurens van der Maaten for figure! [Hintons group]

Learned topics

Another embedding NIPS papers vs. Psychology abstracts LDA discLDA

13 scenes dataset [Fei-Fei 2005] train: 100 per category test: 2558

Vocabulary (visual words)

Topics

Conclusion fixed transformation T enables topic sharing & exploration get reduced representation which preserves predictive power noisy gradient estimates still work in progress will probably try variational approach instead