In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009
Syntax 101 Given a sentence, produce a syntax tree (parse) Example: ‘Mary like books’ Software which does this known as a parser
Grammars Context-Free Grammar (CFG) ▫Simple rules describing potential configurations ▫From example: S → NP VP NP → Mary VP → V NP V → likes NP → books Problems with ambiguity
Tree Substitution Grammar (TSG) Incorporates larger tree fragments Substitution operator (◦) combines fragments Context-free grammar is a trivial TSG ◦◦ =
Treebanks Database of sentences and corresponding syntax trees ▫Trees are hand-annotated Penn Treebanks among most commonly used Grammars can be created automatically from a treebank (training) ▫Extract rules (CFG) or fragments (TSG) directly from trees
Learning Grammar from Treebank Many rules or fragments will occur repeatedly ▫Incorporate frequencies into grammar ▫Probabilistic Context-Free Grammar (PCFG), Stochastic Tree Substitution Grammar (STSG) Data-Oriented Parsing (DOP) model ▫DOP1 (1992): Type of STSG ▫Describes how to extract fragments from a treebank for inclusion in grammar (model) ▫Generally limit fragments to a certain max depth
Penn Chinese Treebank Latest version 6.0 (2007) ▫Xinhua newswire (7339 sentences) ▫Sinorama news magazine (7106 sentences) ▫Hong Kong news (519 sentences) ▫ACE Chinese broadcast news (9246 sentences)
Penn Chinese Treebank and DOP Latest version 6.0 (2007) ▫Xinhua newswire (7339 sentences) ▫Sinorama news magazine (7106 sentences) ▫Hong Kong news (519 sentences) ▫ACE Chinese broadcast news (9246 sentences) Previous experiments (2004) with Penn Chinese Treebank and DOP1 ▫1473 trees selected from Xinhua newswire ▫Fragment depth limited to three levels or less
An improved DOP model: DOP* Challenges with DOP1 model ▫Computationally inefficient (exponential increase in number of fragments extracted) ▫Statistically inconsistent A new estimator: DOP* (2005) ▫Limits fragment extraction by estimating optimal fragments using subsets of training corpus Linear rather than exponential increase in fragments ▫Statistically consistent (accuracy increases as size of training corpus increases)
Research Question & Hypothesis Will a DOP* parser applied to the Penn Chinese Treebank show significant improvement in accuracy for a model incorporating fragments up to depth five compared to a model incorporating only fragments up to depth three? Hypothesis: Yes, accuracy will significantly increase ▫Deeper fragments allow parser to capture non- local dependencies in syntax usage/preference
Selecting training and testing data Subset of Xinhua newswire (2402 sentences) ▫Includes only IP trees (no headlines or fragments) Excluded sentences of average or greater length Remaining 1402 sentences divided three times into random training/test splits ▫Each test split has 140 sentences ▫Other 1262 sentences used for training
Preparing the trees Penn Treebank converted to dopdis format Chinese characters converted to alphanumeric codes Standard tree normalizations ▫Removed empty nodes ▫Removed A over A and X over A unaries ▫Stripped functional tags Original: (IP (NP-PN-SBJ (NR 上海 ) (NR 浦东 )) (VP … Converted: (ip,[(np,[(nr,[(hmeiahodpp_,[])]),(nr,[(hodoohmejc_,[])])]),(vp, …
Training & testing the parser DOP* parser is created by training a model with the training trees The parser is then tested by processing the test sentences ▫Parse trees returned by parser are compared with original parse trees from treebank Standard evaluation metrics computed: labeled recall, labeled precision, and f-score (mean) Repeated for each depth level, test/training split
Parsing Results DepthLabeled Recall Labeled Precision F-score %58.14%58.57% %67.42%69.47% %67.80%69.96%
Other interesting statistics Depth#Fragments Extracted Total Training Time (hours) Total Testing Time (hours) Seconds / Sentence 16, , , Training time at depth-3 and depth-5 is similar, even though depth-5 has much higher fragment count Testing time though at depth-5 is ten times higher than testing time at depth-3!
Conclusion Obtain parsing results for other two testing / training splits, if similar: Increasing fragment extraction depth from three to five does not significantly improve accuracy for a DOP* parser over the Penn Chinese Treebank ▫Determine statistical significance ▫Practical benefit is negated by increased parsing time
Future Work Increase size of training corpus ▫DOP* estimation consistency: accuracy should increase as larger training corpus used Perform experiment with DOP1 model ▫Accuracy obtained with DOP* lower than previous experiments using DOP1 (Hearne & Way 2004) Qualitative analysis ▫What constructions are captured more accurately?
Future Work Perform experiments with other corpora ▫Other sections of Chinese Treebank ▫Other treebanks: Penn Arabic Treebank, … Increase capacity and stability of dopdis system ▫Encountered various failures on larger runs, crashing after as long as 36 hours ▫Efficiency could be increased by larger memory support (64-bit architecture), storage and indexing using a relational database system