Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.
Efficient Large-Scale Structured Learning
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Intro to DPM By Zhangliliang. Outline Intuition Introduction to DPM Model Inference(matching) Training latent SVM Training Procedure Initialization Post-processing.
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Active Set Support Vector Regression
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.
Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.
SVM by Sequential Minimal Optimization (SMO)
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Biointelligence Laboratory, Seoul National University
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Loss-based Learning with Weak Supervision M. Pawan Kumar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.
Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.
Semi-Supervised Learning Using Label Mean
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 07: Soft-margin SVM
Support Vector Machines
Jan Rupnik Jozef Stefan Institute
Margin-based Decomposed Amortized Inference
Asymmetric Gradient Boosting with Application to Spam Filtering
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
CS 4/527: Artificial Intelligence
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Large Scale Support Vector Machines
Lecture 08: Soft-margin SVM
Lecture 07: Soft-margin SVM
CS639: Data Management for Data Science
Parallel Perceptrons and Iterative Parameter Mixing
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
TensorFlow: A System for Large-Scale Machine Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1

Motivation  Many NLP tasks are structured Parsing, Coreference, Chunking, SRL, Summarization, Machine translation, Entity Linking,…  Inference is required Find the structure with the best score according to the model  Goal: a better/faster linear structured learning algorithm Using Structural SVM  What can be done for perceptron? 2

Two key parts of Structured Prediction  Common training procedure (algorithm perspective)  Perceptron: Inference and Update procedures are coupled  Inference is expensive But we only use the result once in a fixed step Inference Structure Update 3

Observations InferenceUpdate Structure Update Structure 4

Observations  Inference and Update procedures can be decoupled If we cache inference results/structures  Advantage Better balance (e.g. more updating; less inference)  Need to do this carefully… We still need inference at test time Need to control the algorithm such that it converges Infer Update 5

Questions  Can we guarantee the convergence of the algorithm?  Can we control the cache such that it is not too large?  Is the balanced approach better than the “coupled” one? Yes! 6

Contributions  We propose a Dual Coordinate Descent (DCD) Algorithm For L2-Loss Structural SVM; Most people solve L1-Loss SSVM  DCD decouples Inference and Update procedures Easy to implement; Enables “inference-less” learning  Results Competitive to online learning algorithms; Guarantee to converge [Optimization] DCD algorithms are faster than cutting plane/ SGD Balance control makes the algorithm converges faster (in practice)  Myth Structural SVM is slower than Perceptron 7

Outline  Structured SVM Background Dual Formulations  Dual Coordinate Descent Algorithm Hybrid-Style Algorithm  Experiments  Other possibilities 8

Structured Learning Candidate output set 9

The Perceptron Algorithm = Gold structure Prediction Infer Update 10

Structural SVM  Objective function  Distance-Augmented Argmax Loss: How wrong your prediction is? 11

Dual formulation 12

Outline  Structured SVM Background Dual Formulations  Dual Coordinate Descent Algorithm Hybrid-Style Algorithm  Experiments  Other possibilities 13

Dual Coordinate Descent algorithm 14 Update

What are the role of dual variables? 15

Problem: too many structures 16

DCD-Light  To notice Distance-augmented inference No average We will still update even if the structure is correct UpdateAll is important Update Weight Vector; Grow working set; 17 Infer

DCD-SSVM DCD-Light; Inference-less Learning 18

Convergence Guarantee 19

Outline  Structured SVM Background Dual Formulations  Dual Coordinate Descent Algorithm Hybrid-Style Algorithm  Experiments  Other possibilities 20

Settings  Data/Algorithm Compared to Perceptron, MIRA, SGD, SVM-Struct and FW- Struct Work on NER-MUC7, NER-CoNLL, WSJ-POS and WSJ-DP  Parameter C is tuned on the development set  We also add caching and example permutation for Preceptron, MIRA, SGD and FW-Struct Permutation is very important  Details in the paper 21

Research Questions  Is “balanced” a better strategy? Compare DCD-Light, DCD-SSVM, and Cutting plane method [Chang et al. 2010]  How does DCD compare to other SSVM algorithms? Compare to SVM-struct [Joachims et al. 09]; FW-struct [Lacoste-Julien et al. 13]  How does DCD compare to online learning algorithms? Compare to Perceptron [Collins 02], MIRA [Crammar 05], and SGD 22

Compare L2-Loss SSVM algorithms Same Inference code! 23 [Optimization] DCD algorithms are faster than cutting plane methods (CPD)

Compare to SVM-Struct  SVM-Struct in C, DCD in C#  Early iterations of SVM-Struct are not very stable  Early iterations for our algorithm are still good 24

Compare Perceptron, MIRA, SGD Data\AlgoDCDPercep. NER-MUC NER-CoNLL POS-WSJ DP-WSJ

Questions  Can we guarantee the convergence of the algorithm?  Can we control the cache such that it is not too large?  Is the balanced approach better than the “coupled” one? Yes! 26

Outline  Structured SVM Background Dual Formulations  Dual Coordinate Descent Algorithm Hybrid-Style Algorithm  Experiments  Other possibilities 27

Parallel DCD is faster than Parallel Perceptron  With cache buffering techniques; multi-core DCD can be much faster than multi-core Perceptron [Chang et al. 2013] 28 Infer Update N workers1 workers

Conclusion  We have proposed dual coordinate descent algorithms [Optimization] DCD algorithms are faster than cutting plane/ SGD Decouple inference and learning  There is value for developing Structural SVM We can design more elaborated algorithms Myth: Structural SVM is slower than perceptron Not necessary More comparisons need to be done  The hybrid approach is the best overall strategy Different strategies are needed for different datasets Other ways of caching results 29 Thanks!