Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006
Roadmap Motivation: –Resource Bottleneck Co-training Co-training with different parsers –CFG & LTAG Experiments: –Initial seed set size –Parse selection –Domain porting Results and discussion
Motivation: Issues Current statistical parsers –Many grammatical models –Significant progress: F-score ~ 93% Issues: –Trained on ~1M words Penn WSJ treebank Annotation: significant investment: time & money –Portability: Single genre – business news Later treebanks – smaller, still news –Training resource bottleneck
Motivation: Approach Goal: –Enhance portability, performance without large amounts of additional training data Observations: –“Self-training”: Train parser on own output Very small improvement (better counts for heads) Limited to slightly refining current model –Ensemble methods, voting: useful Approach: Co-training
Co-Training Co-Training (Blum & Mitchell 1998) –Weakly supervised training technique Successful for basic classification –Materials Small “seed” set of labeled examples Large set of unlabeled examples –Training: Evidence from multiple models –Optimize degree of agreement b/t models on unlabeled data Train several models on seed data Run on unlabeled data Use new “reliable” labeled examples to train others Iterate
Co-training Issues Challenge: –Picking reliable novel examples No guaranteed, simple approach Rely on heuristics –Intersection: Highly ranked by other; low by self –Difference: Score by other exceeds self by some margin Possibly employ parser confidence measures
Experimental Structure Approach (Steedman et al, 2003) –Focus here: Co-training with different parsers Also examined reranking, supertaggers &parsers Co-train CFG (Collins) & LTAG Data: Penn Treebank WSJ, Brown, NA News Questions: –How select reliable novel samples? –How does labeled seed size affect co-training? –How effective in co-training w/in, across genre?
System Architecture Two “different” parsers –“Views” – can be different by feature space Here Collins CFG & LTAG –Comparable performance, different formalisms Cache Manager –Draws labeled sentences for parsers to label –Selects subset of newly labeled to training set
Two Different Parsers Both train on treebank input –Lexicalized, head information percolated Collins-CFG –Lexicalized CFG parser “Bi-lexical”: each pair of non-terminals leads to bigram relation b/t pair of lexical items Ph= head percolation; Pm=modifiers of head dtr LTAG: –Lexicalized TAG parser Bigram relations b/t trees Ps=substitution probability; Pa=adjunction probability Different in tree creation and lexical reln depth
Selecting Labeled Examples Scoring the parse –Ideal – true – score impossible F-prob: trust the parser; F-norm-prob: norm by len F-entropy: Diff b/t parse score distr and uniform –Baseline: # of parses, sentence length Selecting (newly labeled) sentences –Goal: minimize noise, maximize training utility S-base: n highest scores (both parsers use same) Asymmetric: teacher/student –S-topn: teacher’s top n –S-intersect: sentences in teacher’s top n, student’s bottom n –S-diff: teacher’s score higher than student’s by some amount
Experiments: Initial Seed Size Typically evaluate after all training Consider convergence rate –Initial rapid growth – tailing off w/more –Largest improvement: instances Collins-CFG plateaus at 40K (89.3) LTAG still improving –Will benefit from additional training Co-training w/500 vs 1000 instances –Less data, greater benefit Enhance coverage –However, 500 seed doesn’t reach level of 1000 seed
Experiments: Parse Selection Contrast: –Select-all newly labeled vs S-intersect (67%) Co-training experiments: –500 seed set –LTAG performs better w/S-intersect Reduces noise, LTAG sensitive to noisy trees –CFG performs better w/S-select-all CFG needs to increase coverage, more samples
Experiments: Cross-domain Train on Brown corpus seed –Co-train on WSJ –CFG, w/s-intersect improves, 76.6-> 78.3 Mostly in first 5 iterations –Lexicalizing for new domain vocab Train on Brown WSJ seed –Co-train on other WSJ –Base improves to 78.7, co-train to 80 Gradual improvement, new constructs?
Summary Semi-supervised parser training –Co-training Two different parse formalisms provide diff’t views –Enhances effectiveness Biggest gains with small seed sets Cross-domain enhancement –Selection methods dependent on Parse model, amount of seed data
Findings Co-training enhances parsing when trained on small datasets: sentences Co-training aids genre porting w/o labels Co-training improved w/ANY labels for genre Approaches for crucial sample selection