Download presentation
Presentation is loading. Please wait.
Published byTrevion Lief Modified over 9 years ago
1
Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov
2
Overview Motivation and Prior Related Research Experimental Setup Results Analysis Conclusions 2
3
Parse Tree Sentence Parameters... Derivations PCFG-LA Parser [Matsuzaki et. al ’05] [Petrov et. al ’06] [Petrov & Klein’07] 3
4
PCFG-LA Parser NP NP1NP2 Hierarchical splitting (& merging) NP1NP2NP3NP4 NP1NP2NP3NP4NP5NP6NP7NP8 Split to 2 Split to 4 Split to 8 Original Node … Increased Model Complexity n-th grammar: grammar trained after n-th split-merge rounds … Typical learning curve Grammar Order Selection Use development set
5
PCFG-LA Properties Hierarchical Training ◦ Increase the number of latent states hierarchically Adaptive State Splitting ◦ Goal is to split complex categories more and simple categories less ◦ Idea: split everything, and then roll back the splits that are least useful (use loss in likelihood from removing splits, typically 50% are undone) Parameter Smoothing (Pool statistics) Decoding Methods (Max Rule Product) Coarse-to-Fine Parsing (To speed up decoding)
6
Max-Rule Decoding (Single Grammar) S NP VP [Goodman ’98, Matsuzaki et al. ’05, Petrov & Klein ’07] 6
7
Variability 7 [Petrov, ’10]
8
... Max-Rule Decoding (Multiple Grammars) [Petrov, ’10] Treebank 8
9
Product Model Results 9 [Petrov, ’10]
10
Motivation for Self-Training 10
11
Self-training (ST) Hand Labeled Unlabeled Data Train Label Automatically Labeled Data Train Select with dev 11
12
Self-training (ST) Hand Labeled Train Model Label New Model Train Automatically Labeled Data Unlabeled Labeled 12
13
Self-Training Curve 13
14
WSJ Self-Training Results F score 14 [Huang & Harper, ’09]
15
Self-Trained Grammar Variability Self-trained Parser 15
16
Self-Trained Grammar Variability Self-trained Round 7 Self-trained Round 6 16
17
Summary Two issues: Variability & Over-fitting Product model Makes use of variability Over-fitting remains in individual grammars Self-training Alleviates over-fitting Variability remains in individual grammars Next step: combine self-training with product models 17
18
Experimental Setup Two genres: WSJ: Sections 2-21 for training, 22 for dev, 23 for test, 176.9K sentences per self-trained grammar Broadcast News: WSJ+80% of BN for training, 10% for dev, 10% for test (see paper), Training Scenarios: train 10 models with different seeds and combine using Max-Rule Decoding Regular: treebank training with up to 7 split-merge iterations Self-Training: three methods with up to 7 split-merge iterations 18
19
ST-Reg Label Automatically Labeled Data Unlabeled Data Hand Labeled Train Multiple Grammars? Product Train Select with dev set 19 Single automatically labeled set by round 6 product
20
ST-Prod Label Automatically Labeled Data Unlabeled Data Hand Labeled Train Product Train Use more data? Product 20 Single automatically labeled set by round 6 product
21
ST-Prod-Mult Hand Labeled Train Label Product Label Product 21 10 different automatically labeled sets by round 6 product
22
22
23
23
24
24
25
A Closer Look at Regular Results 25
26
A Closer Look at Regular Results 26
27
A Closer Look at Regular Results 27
28
A Closer Look at Self-Training Results 28
29
A Closer Look at Self-Training Results 29
30
A Closer Look at Self-Training Results 30
31
Analysis of Rule Variance We measure the average empirical variance of the log posterior probabilities of the rules among the learned grammars over a held-out set S to get at the diversity among the grammars: 31
32
Analysis of Rule Variance 32
33
English Test Set Results (WSJ 23) Single ParserRerankerProductParser Combination [Charniak ’00]Petrov et al. ’06][Carreras et al. ’08][Huang & Harper ’08]This Work[Petrov ’10]This Work[Charniak & Johnson ’05][Huang ’08][McClosky et al. ’06][Sagae & Lavie ’06][Fossum & Knight ’09][Zhang et al. ’09] 33
34
Broadcast News 34
35
Conclusions Very high parse accuracies can be achieved by combining self-training and product models on newswire and broadcast news parsing tasks. Two important factors: 1. Accuracy of the model used to parse the unlabeled data 2. Diversity of the individual grammars 35
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.