Portability, Parallelism and Efficiency in Parsing Dan Bikel University of Pennsylvania March 11th, 2002
Slide 1 Parsing: Where are we now? Pounding away at Penn Treebank, §23 –Collins (1999): LR 88.0, LP 88.3 –Charniak (2000): LR 89.6, LP 89.5 –Collins (2000): LR 89.6, LP 89.9 Henderson & Brill (1999) on §22: LR 90.1, LP 92.4 Room to grow: new domains, better performance
Slide 2 The Right Architecture for Parallel Parsing CKY Client 1CKY Client 2CKY Client N Language Language package DecoderServer N ModelCollection Switchboard Object server DecoderServer 1 ModelCollection
Slide 3 Architecture for Parallel Parsing II Highly parallel, multi-threaded –New cluster about to come on-line; poised to take advantage Fully fault-tolerant Significant flexibility: layers of abstraction Optimized for speed Highly portable for new domains, including new languages
Slide 4 Layer of Abstraction: Probability Structure P(t h,w h ) H (t h,w h )M i (t i,w i )M i-1 (t i-1, w i-1 ) Collins BBN
Slide 5 Plug-’n’-play Probability Models New engine capable of implementing a wide variety of models, including Collins, BBN Have meticulously replicated Collins’ model and performance –Cleaned up probabilistic “oddities” –Code is thoroughly documented –Will release to public
Slide 6 Fast Portability to New Data Sets Parsers operate over augmented tree space, T + Generative models define joint probability P(S,T,T + ) Chiang & Bikel (2002, in submission) provide –New, portable syntax for augmenting tree nodes –Method for reestimating parser models in the augmented space such that P(S,T) is maximized
Slide 7 Rapid Portability to New Languages with High Accuracy Bikel & Chiang (2000) described porting two parsing models developed for English to Chinese –BBN: LR 69.0, LP 74.8 (≤ 40 words) –Chiang: LR 76.8, LP 77.8 (≤ 40 words) New engine designed from ground up for multi-lingual processing: language package –Original design goal for new parsing engine: develop new language packages in 1–2 weeks Developed Chinese language package for new engine in one and a half days Compared to other known Chinese parsers on the CTB, recall is equivalent and precision is significantly superior –LR 77.0, LP 81.6 (≤ 40 words)
Slide 8 What’s in store… Incorporating richer lexical information into parsing/language processing, specifically… Incorporating word sense information into a parsing model, building on both –previous work extending BBN parsing model to include word sense –recent work with David Chiang, viewing word sense as yet another component of “hidden” data in a Treebank
Slide 9 FIN