Download presentation
Presentation is loading. Please wait.
1
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland
2
The Treebank Bottleneck High-quality parsers need training examples with hand-annotated syntactic information Annotation is labor intensive and time consuming There is no sizable treebank for most languages other than English [[S [NP-SBJ Ford Motor Co. ] [VP acquired [NP [NP 5 % ] [PP of [NP [NP the shares] [PP-LOC in [NP Jaguar PLC]]]]]]]. ]
3
State of the Art Parsing LanguageTreebankSizeParser Performance EnglishPenn Treebank 1M words 40k sentences ~90% Chinese Treebank 100K words 4k sentences ~75% Others (e.g., Hindi, Arabic) ???
4
Research Questions How can we induce a non-English language treebank quickly and automatically? –Bootstrap from available English resources –Project syntactic dependency relationship across bilingual sentences How good is the resulting treebank? –Can we use it to train a new parser? –How can we improve its quality?
5
Roadmap Overview of the framework –Direct projection algorithm Problematic cases –Post projection transformation Remaining challenges –Filtering Experiment –Direct evaluation of the projected trees –Evaluation of a Chinese parser trained on the induced treebank Future Work
6
Overview of Our Framework bilingual corpus EnglishChinese English dependency parser word alignment model dependency parser projected Chinese dependency treebank Filtering Transformation Projection unseen Chinese sentences train dependency trees for unseen sentences
7
TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject Necessary Resources: 1. Bilingual Sentences
8
TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod Necessary Resources 2. English (Dependency) Parser
9
TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod Necessary Resources 3. Word Alignment
10
TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod obj subj adj mod Projected Chinese Dependency Tree
11
Direct Projection Algorithm If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words
12
Problematic Case: Unaligned English thisregardingsubject det mod 对此
13
Problematic Case: Unaligned English thisregardingsubject det mod 对此 *e* det mod
14
Problematic Case: many-to-1 thisregardingsubject det mod 对此
15
Problematic Case: many-to-1 thisregardingsubject det mod 对此
16
Problematic Case: Unaligned Chinese ChineseexpressedThe 中国方面表示 subj *e* det
17
Problematic Case: Unaligned Chinese ChineseexpressedThe 中国方面表示 subj *e* subj det
18
Problematic Case: 1-to-many Chineseexpressed 中国方面表示 subj The *e* det
19
Problematic Case: 1-to-many Chineseexpressed 中国方面表示 *M* mac subj The *e* det
20
TheChinese satisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjdet mod obj subj Output of the Direct Projection Algorithm *M* *e* mod det mac
21
Post Projection Transformation Handles One-to-Many mapping –Select head based on (projected) part-of-speech categories Handles some Unaligned-Chinese cases –Only addressing close-class words Functional words (e.g., aspectual, measure words) Easily enumerable lexical categories (e.g., $, RMB, yen) Remove empty nodes introduced by the Unaligned- English cases by promoting its head child
22
Remaining Challenges Handling divergences Incorporating unaligned foreign words into the projected tree Removing cross dependencies AB ab CD dc
23
Filtering Projected treebank is noisy –Mistakes introduced by the projection algorithm –Mistakes introduced by component errors Use aggressive filtering techniques to remove the worst projected trees –Filter out a sentence pair if many English words were unaligned –Filter out a sentence pair if many Chinese words are aligned to the same English word –Filter out a sentence pair if many of the projected links caused crossing dependencies
24
Experiments Direct evaluation of the projection framework –Compare the (pre-filtered) projected trees against human annotated gold standard Evaluation of the projected treebank –Use the (post-filtered) treebank to train a Chinese parser –Test the parser on unseen sentences and compare the output to human annotated gold standard
25
Direct Evaluation Bilingual data: 88 Chinese Treebank sentences with their English translations Apply projection and transformation under idealized conditions –Given human-corrected English parse trees and hand-drawn word-alignments Apply projection and transformation under realistic conditions –English parse trees generated from Collins parser (trained on Penn Treebank) –Word-alignments generated from IBM MT Model (trained on ~56K Hong Kong News bilingual sentences)
26
Direct Evaluation Results ConditionAccuracy* Ideal67% English parses from the Collins parser 62% Word-alignments from the IBM MT Model 39% *Accuracy = f-score based on unlabeled precision & recall
27
Evaluating Trained Parser Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank Filter out badly-aligned sentence pairs to reduce noise Train a Chinese parser with the (filtered) projected treebank Test the Chinese parser on unseen test set (88 Chinese Treebank sentences)
28
Parser Evaluation Results MethodTraining Corpus Corpus SizeParser Accuracy Modify Prev (baseline) --13.5 Modify Next (baseline) --35.7 Stat. ParserHKNews (Filtered) 528442.3 Stat. Parser (upper bound) Chinese Treebank 387075.6
29
Conclusion We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge A parser trained on the induced treebank can outperform some baselines
30
Future Work Obtain larger parallel corpus Reduce error rates of the word-alignment models Develop more sophisticated techniques to filter out noise in the induced treebank Improve the projection algorithm to handle unaligned words and inconsistent trees
31
Reserve slides
32
DPA Case 1: One-to-One AB ab
33
DPA Case 2: Many-to-One ab A1BA2A3C c
34
DPA Case 3: One-to-Many AB a1ba2a3*a*
35
DPA Case 4: Many-to-Many *a*b BC c a1a2 A1A2A3
36
DPA Case 5: Unaligned English Word AB a C c
37
DPA Case 6: Unaligned Foreign Word A ab C c
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.