Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
3.6 Support Vector Machines
1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009.
Finding The Unknown Number In A Number Sentence! NCSCOS 3 rd grade 5.04 By: Stephanie Irizarry Click arrow to go to next question.
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Kapitel 21 Astronomie Autor: Bennett et al. Galaxienentwicklung Kapitel 21 Galaxienentwicklung © Pearson Studium 2010 Folie: 1.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Unsupervised Ontology Induction From Text Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
CALENDAR.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Mean-Field Theory and Its Applications In Computer Vision1 1.
ZMQS ZMQS
Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.
The 5S numbers game..
Richmond House, Liverpool (1) 26 th January 2004.
Koby Crammer Department of Electrical Engineering
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
EE, NCKU Tien-Hao Chang (Darby Chang)
ABC Technology Project
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
VOORBLAD.
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
Squares and Square Root WALK. Solve each problem REVIEW:
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
When you see… Find the zeros You think….
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
1 First EMRAS II Technical Meeting IAEA Headquarters, Vienna, 19–23 January 2009.
Addition 1’s to 20.
25 seconds left…...
Fast Full Parsing by Linear-Chain Conditional Random Fields Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou The University of Manchester.
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
1 Unit 1 Kinematics Chapter 1 Day
PSSA Preparation.
How Cells Obtain Energy from Food
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Chapter 30 Induction and Inductance In this chapter we will study the following topics: -Faraday’s law of induction -Lenz’s rule -Electric field induced.
Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
Classification Classification Examples
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Scalable training of L1-regularized log-linear models
Presentation transcript:

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester

Log-linear models in NLP Maximum entropy models Text classification (Nigam et al., 1999) History-based approaches (Ratnaparkhi, 1998) Conditional random fields Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc. Structured prediction Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

Log-linear models Log-linear (a.k.a. maximum entropy) model Training Maximize the conditional likelihood of the training data Weight Feature function Partition function: 3

Regularization To avoid overfitting to the training data Penalize the weights of the features L1 regularization Most of the weights become zero Produces sparse (compact) models Saves memory and storage

Training log-linear models Numerical optimization methods Gradient descent (steepest descent or hill-climbing) Quasi-Newton methods (e.g. BFGS, OWL-QN) Stochastic Gradient Descent (SGD) etc. Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

Gradient Descent (Hill Climbing) objective

Stochastic Gradient Descent (SGD) objective Compute an approximate gradient using one training sample

Stochastic Gradient Descent (SGD) Weight update procedure very simple (similar to the Perceptron algorithm) Not differentiable : learning rate

Using subgradients Weight update procedure

Using subgradients Problems L1 penalty needs to be applied to all features (including the ones that are not used in the current sample). Few weights become zero as a result of training.

Clipping-at-zero approach w Carpenter (2008) Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009) Enables lazy update

Clipping-at-zero approach

Named entity recognition Text chunking Named entity recognition Part-of-speech tagging Number of non-zero features Quasi-Newton 18,109 SGD (Naive) 455,651 SGD (Clipping-at-zero) 87,792 Number of non-zero features Quasi-Newton 30,710 SGD (Naive) 1,032,962 SGD (Clipping-at-zero) 279,886 Number of non-zero features Quasi-Newton 50,870 SGD (Naive) 2,142,130 SGD (Clipping-at-zero) 323,199

Why it does not produce sparse models In SGD, weights are not updated smoothly Fails to become zero! L1 penalty is wasted away

Cumulative L1 penalty The absolute value of the total L1 penalty which should have been applied to each weight The total L1 penalty which has actually been applied to each weight

Applying L1 with cumulative penalty Penalize each weight according to the difference between and

Implementation 10 lines of code!

Experiments Model: Conditional Random Fields (CRFs) Baseline: OWL-QN (Andrew and Gao, 2007) Tasks Text chunking (shallow parsing) CoNLL 2000 shared task data Recognize base syntactic phrases (e.g. NP, VP, PP) Named entity recognition NLPBA 2004 shared task data Recognize names of genes, proteins, etc. Part-of-speech (POS) tagging WSJ corpus (sections 0-18 for training)

CoNLL 2000 chunking task: objective

CoNLL 2000 chunking: non-zero features

CoNLL 2000 chunking Performance of the produced model Passes Obj. # Features Time (sec) F-score OWL-QN 160 -1.583 18,109 598 93.62 SGD (Naive) 30 -1.671 455,651 1,117 93.64 SGD (Clipping + Lazy Update) 87,792 144 93.65 SGD (Cumulative) -1.653 28,189 149 93.68 SGD (Cumulative + ED) -1.622 23,584 148 93.66 Training is 4 times faster than OWL-QN The model is 4 times smaller than the clipping-at-zero approach The objective is also slightly better

NLPBA 2004 named entity recognition Passes Obj. # Features Time (sec) F-score OWL-QN 160 -2.448 30,710 2,253 71.76 SGD (Naive) 30 -2.537 1,032,962 4,528 71.20 SGD (Clipping + Lazy Update) -2.538 279,886 585 SGD (Cumulative) -2.479 31,986 631 71.40 SGD (Cumulative + ED) -2.443 25,965 71.63 Part-of-speech tagging on WSJ Passes Obj. # Features Time (sec) Accuracy OWL-QN 124 -1.941 50,870 5,623 97.16 SGD (Naive) 30 -2.013 2,142,130 18,471 97.18 SGD (Clipping + Lazy Update) 323,199 1,680 SGD (Cumulative) -1.987 62,043 1,777 97.19 SGD (Cumulative + ED) -1.954 51,857 1,774 97.17

Discussions Convergence Learning rate Demonstrated empirically Penalties applied are not i.i.d. Learning rate The need for tuning can be annoying Rule of thumb: Exponential decay (passes = 30, alpha = 0.85)

Conclusions Stochastic gradient descent training for L1-regularized log-linear models Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available 3 to 4 times faster than OWL-QN Extremely easy to implement