Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006.

Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006

Roadmap Motivation: –Resource Bottleneck Co-training Co-training with different parsers –CFG & LTAG Experiments: –Initial seed set size –Parse selection –Domain porting Results and discussion

Motivation: Issues Current statistical parsers –Many grammatical models –Significant progress: F-score ~ 93% Issues: –Trained on ~1M words Penn WSJ treebank Annotation: significant investment: time & money –Portability: Single genre – business news Later treebanks – smaller, still news –Training resource bottleneck

Motivation: Approach Goal: –Enhance portability, performance without large amounts of additional training data Observations: –“Self-training”: Train parser on own output Very small improvement (better counts for heads) Limited to slightly refining current model –Ensemble methods, voting: useful Approach: Co-training

Co-Training Co-Training (Blum & Mitchell 1998) –Weakly supervised training technique Successful for basic classification –Materials Small “seed” set of labeled examples Large set of unlabeled examples –Training: Evidence from multiple models –Optimize degree of agreement b/t models on unlabeled data Train several models on seed data Run on unlabeled data Use new “reliable” labeled examples to train others Iterate

Co-training Issues Challenge: –Picking reliable novel examples No guaranteed, simple approach Rely on heuristics –Intersection: Highly ranked by other; low by self –Difference: Score by other exceeds self by some margin Possibly employ parser confidence measures

Experimental Structure Approach (Steedman et al, 2003) –Focus here: Co-training with different parsers Also examined reranking, supertaggers &parsers Co-train CFG (Collins) & LTAG Data: Penn Treebank WSJ, Brown, NA News Questions: –How select reliable novel samples? –How does labeled seed size affect co-training? –How effective in co-training w/in, across genre?

System Architecture Two “different” parsers –“Views” – can be different by feature space Here Collins CFG & LTAG –Comparable performance, different formalisms Cache Manager –Draws labeled sentences for parsers to label –Selects subset of newly labeled to training set

Two Different Parsers Both train on treebank input –Lexicalized, head information percolated Collins-CFG –Lexicalized CFG parser “Bi-lexical”: each pair of non-terminals leads to bigram relation b/t pair of lexical items Ph= head percolation; Pm=modifiers of head dtr LTAG: –Lexicalized TAG parser Bigram relations b/t trees Ps=substitution probability; Pa=adjunction probability Different in tree creation and lexical reln depth

Selecting Labeled Examples Scoring the parse –Ideal – true – score impossible F-prob: trust the parser; F-norm-prob: norm by len F-entropy: Diff b/t parse score distr and uniform –Baseline: # of parses, sentence length Selecting (newly labeled) sentences –Goal: minimize noise, maximize training utility S-base: n highest scores (both parsers use same) Asymmetric: teacher/student –S-topn: teacher’s top n –S-intersect: sentences in teacher’s top n, student’s bottom n –S-diff: teacher’s score higher than student’s by some amount

Experiments: Initial Seed Size Typically evaluate after all training Consider convergence rate –Initial rapid growth – tailing off w/more –Largest improvement: 500-1000 instances Collins-CFG plateaus at 40K (89.3) LTAG still improving –Will benefit from additional training Co-training w/500 vs 1000 instances –Less data, greater benefit Enhance coverage –However, 500 seed doesn’t reach level of 1000 seed

Experiments: Parse Selection Contrast: –Select-all newly labeled vs S-intersect (67%) Co-training experiments: –500 seed set –LTAG performs better w/S-intersect Reduces noise, LTAG sensitive to noisy trees –CFG performs better w/S-select-all CFG needs to increase coverage, more samples

Experiments: Cross-domain Train on Brown corpus -1000 seed –Co-train on WSJ –CFG, w/s-intersect improves, 76.6-> 78.3 Mostly in first 5 iterations –Lexicalizing for new domain vocab Train on Brown + 100 WSJ seed –Co-train on other WSJ –Base improves to 78.7, co-train to 80 Gradual improvement, new constructs?

Summary Semi-supervised parser training –Co-training Two different parse formalisms provide diff’t views –Enhances effectiveness Biggest gains with small seed sets Cross-domain enhancement –Selection methods dependent on Parse model, amount of seed data

Findings Co-training enhances parsing when trained on small datasets: 500-10000 sentences Co-training aids genre porting w/o labels Co-training improved w/ANY labels for genre Approaches for crucial sample selection

Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006.

Similar presentations

Presentation on theme: "Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006.

Similar presentations

Presentation on theme: "Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006."— Presentation transcript:

Similar presentations

About project

Feedback