Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Brief Introduction Why semi-supervised training? Co-training framework and applications Can parsing fit in this framework? How? Conclusion
Why Semi-supervised Training Compromise between su… and unsu… Pay-offs: –Minimize the need for labeled data –Maximize the value of unlabeled data –Easy portability
Co-training Scenario Idea: two different students learn from each other, incrementally, mutually improving “ 二人行必有我师 ” difference(motive) –mutual learning(optimize)-> agreement(objective). Task: to optimize the objective function of agreement. Heuristic selection is important: what to learn?
[Blum & Mitchell, 98] Co- training Assumptions Classification problem Feature redundancy –Allows different views of data –Each view is sufficient for classification View independency of features, given class
[Blum & Mitchell, 98] Co- training example “Course home page” classification (y/n) Two views: content text/anchor text (more perfect example: two sides of a coin) Two naïve Bayes classifiers: should agree
[Blum & Mitchell, 98] Co- Training Algorithm Given: A set L of labeled training examples A set U of unlabeled examples Create a pool U’ of examples by choosing u examples at random from U. Loop for k iterations: Use L to train a classifier h1 that considers only the x1 portion of x Use L to train a classifier h2 that considers only the x2 portion of x Allow h1 to label p positive and n negative examples from U’ Allow h2 to label p positive and n negative examples from U’ Add these self-labeled examples to L Randomly choose 2p+2n examples from U to replenish U’ n:p matches the ratio of negtive to positive examples The selected examples are those “most confidently” labeled ones, i.e. heuristic selection
Family of Algorithms Related to Co-training MethodFeature Split (Yes) Feature Split (No) IncrementalCo-trainingSelf-training IterativeCo-EMEM [Nigam & Ghani 2000]
Parsing As Supertagging and Attaching [Sarkar 2001] The difference between parsing and other NLP applications:WSD, WBPC, TC, NEI –A tree vs. A label –Composite vs. Monolithic –Large parameter space vs. Small … LTAG –Each word is tagged with a lexicalized elementary tree (supertagging) –Parsing is a process of substitution and adjoining of elementary trees –Supertagger finishes a very large part of job a traditional parser must do
A glimpse of Suppertags
Two Models to Co-training H1: selects trees based on previous context (tagging probability model) H2: computes attachment between trees and returns best parse (parsing probability model)
[Sarkar 2000] Co-training Algorithm 1. Input: labeled and unlabeled 2. Update cache Randomly select sentences from unlabeled and refill cache If cache is empty; Exit 3. Train models H1 and H2 using labeled 4. Apply H1 and H2 to cache 5. Pick most probable n from H1 (run through H2) and add to labeled 6. Pick most probable n from H2 and add to labeled 7. n=n+k; Go to step 2
JHU SW2002 tasks Co-train Collins CFG parser with Sarkar LTAG parser Co-train Rerankers Co-train CCG supertaggers and parsers
Co-training: The Algorithm Requires: Two learners with different views of the task Cache Manager (CM) to interface with the disparate learners A small set of labeled seed data and a larger pool of unlabelled data Pseudo-Code: –Init: Train both learners with labeled seed data –Loop: CM picks unlabelled data to add to cache Both learners label cache CM selects newly labeled data to add to the learners' respective training sets Learners re-train
Novel Methods-Parse Selection Want to select training examples for one parser (student) labeled by the other (teacher) so as to minimize noise and maximize training utility. –Top-n: choose the n examples for which the teacher assigned the highest scores. –Difference: choose the examples for which the teacher assigned a higher score than the student by some threshold. –Intersection: choose the examples that received high scores from the teacher but low scores from the student. –Disagreement: choose the examples for which the two parsers provided different analyses and the teacher assigned a higher score than the student.
Effect of Parse Selection
CFG-LTAG Co-training
Re-rankers Co-training What is Re-ranking? –A re-ranker reorders the output of an n- best (probabilistic) parser based on features of the parse –While parsers use local features to make decisions, re-rankers use features that can span the entire tree –Instead of co-training parsers, co-train different re-rankers
Re-rankers Co-training Motivation: Why re-rankers? – Speed parse data once reordered many times – Objective function The lower runtime of re-rankers allows us to explicitly maximize agreement between parses
Re-rankers Co-training Motivation: Why re-rankers? – Accuracy Re-rankers can improve performance of existing parsers Collins ’00 cites a 13 percent reduction of error rate by re-ranking – Task closer to classification A re-ranker can be seen as a binary classifier: either a parse is the best for a sentence or it isn’t This is the original domain cotraining was intended for
Re-rankers Co-training Experimental. But much to be explored. Remember: a re-ranker is easier to develop –Reranker 1: Log linear model –Reranker 2: Linear perceptron model – Room for improvement: Current best parser: 89.7 Oracle that picks best parse from top 50: 95 +
JHU SW2002 Conclusion –Largest experimental study to date on the use of unlabelled data for improving parser performance. –Co-training enhances performance for parsers and taggers trained on small (500—10,000 sentences) amounts of labeled data. –Co-training can be used for porting parsers trained on one genre to parse on another without any new human-labeled data at all, improving on state-of-the-art for this task. –Even tiny amounts of human-labelled data for the target genre enhace porting via co-training. –New methods for Parse Selection have been developed, and play a crucial role.
How to Improve Our Parser? Similar setting: limited labeled data(Penn CTB) large amount of unlabeled and somewhat deferent domain data(PKU People Daily) To try: –Re-rankers’ developing cycle is much shorter, worthy of trying. Many ML techniques may be utilized. –Re-rankers’ agreement is still an open question