Download presentation
Published byDaisy Wilkey Modified over 10 years ago
1
Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty
Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester
2
Log-linear models in NLP
Maximum entropy models Text classification (Nigam et al., 1999) History-based approaches (Ratnaparkhi, 1998) Conditional random fields Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc. Structured prediction Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.
3
Log-linear models Log-linear (a.k.a. maximum entropy) model Training
Maximize the conditional likelihood of the training data Weight Feature function Partition function: 3
4
Regularization To avoid overfitting to the training data
Penalize the weights of the features L1 regularization Most of the weights become zero Produces sparse (compact) models Saves memory and storage
5
Training log-linear models
Numerical optimization methods Gradient descent (steepest descent or hill-climbing) Quasi-Newton methods (e.g. BFGS, OWL-QN) Stochastic Gradient Descent (SGD) etc. Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.
6
Gradient Descent (Hill Climbing)
objective
7
Stochastic Gradient Descent (SGD)
objective Compute an approximate gradient using one training sample
8
Stochastic Gradient Descent (SGD)
Weight update procedure very simple (similar to the Perceptron algorithm) Not differentiable : learning rate
9
Using subgradients Weight update procedure
10
Using subgradients Problems
L1 penalty needs to be applied to all features (including the ones that are not used in the current sample). Few weights become zero as a result of training.
11
Clipping-at-zero approach
w Carpenter (2008) Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009) Enables lazy update
12
Clipping-at-zero approach
13
Named entity recognition
Text chunking Named entity recognition Part-of-speech tagging Number of non-zero features Quasi-Newton 18,109 SGD (Naive) 455,651 SGD (Clipping-at-zero) 87,792 Number of non-zero features Quasi-Newton 30,710 SGD (Naive) 1,032,962 SGD (Clipping-at-zero) 279,886 Number of non-zero features Quasi-Newton 50,870 SGD (Naive) 2,142,130 SGD (Clipping-at-zero) 323,199
14
Why it does not produce sparse models
In SGD, weights are not updated smoothly Fails to become zero! L1 penalty is wasted away
15
Cumulative L1 penalty The absolute value of the total L1 penalty which should have been applied to each weight The total L1 penalty which has actually been applied to each weight
16
Applying L1 with cumulative penalty
Penalize each weight according to the difference between and
17
Implementation 10 lines of code!
18
Experiments Model: Conditional Random Fields (CRFs)
Baseline: OWL-QN (Andrew and Gao, 2007) Tasks Text chunking (shallow parsing) CoNLL 2000 shared task data Recognize base syntactic phrases (e.g. NP, VP, PP) Named entity recognition NLPBA 2004 shared task data Recognize names of genes, proteins, etc. Part-of-speech (POS) tagging WSJ corpus (sections 0-18 for training)
19
CoNLL 2000 chunking task: objective
20
CoNLL 2000 chunking: non-zero features
21
CoNLL 2000 chunking Performance of the produced model
Passes Obj. # Features Time (sec) F-score OWL-QN 160 -1.583 18,109 598 93.62 SGD (Naive) 30 -1.671 455,651 1,117 93.64 SGD (Clipping + Lazy Update) 87,792 144 93.65 SGD (Cumulative) -1.653 28,189 149 93.68 SGD (Cumulative + ED) -1.622 23,584 148 93.66 Training is 4 times faster than OWL-QN The model is 4 times smaller than the clipping-at-zero approach The objective is also slightly better
22
NLPBA 2004 named entity recognition
Passes Obj. # Features Time (sec) F-score OWL-QN 160 -2.448 30,710 2,253 71.76 SGD (Naive) 30 -2.537 1,032,962 4,528 71.20 SGD (Clipping + Lazy Update) -2.538 279,886 585 SGD (Cumulative) -2.479 31,986 631 71.40 SGD (Cumulative + ED) -2.443 25,965 71.63 Part-of-speech tagging on WSJ Passes Obj. # Features Time (sec) Accuracy OWL-QN 124 -1.941 50,870 5,623 97.16 SGD (Naive) 30 -2.013 2,142,130 18,471 97.18 SGD (Clipping + Lazy Update) 323,199 1,680 SGD (Cumulative) -1.987 62,043 1,777 97.19 SGD (Cumulative + ED) -1.954 51,857 1,774 97.17
23
Discussions Convergence Learning rate Demonstrated empirically
Penalties applied are not i.i.d. Learning rate The need for tuning can be annoying Rule of thumb: Exponential decay (passes = 30, alpha = 0.85)
24
Conclusions Stochastic gradient descent training for L1-regularized log-linear models Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available 3 to 4 times faster than OWL-QN Extremely easy to implement
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.