Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.

Slides:



Advertisements
Similar presentations
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
Parser Adaptation and Projection with Quasi-Synchronous Grammar Features David A. Smith (UMass Amherst) Jason Eisner (Johns Hopkins) 1.
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Statistical Phrase Alignment Model Using Dependency Relation Probability Toshiaki Nakazawa and Sadao Kurohashi Kyoto University.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
Supertagging CMSC Natural Language Processing January 31, 2006.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank Prudhvi Kosaraju, Bharat Ram Ambati, Samar Husain Dipti Misra Sharma,
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Slides are from Rebecca Hwa, Ray Mooney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
CSC 594 Topics in AI – Natural Language Processing
Parsing in Multiple Languages
Authorship Attribution Using Probabilistic Context-Free Grammars
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Soft Cross-lingual Syntax Projection for Dependency Parsing
Translingual Knowledge Projection and Statistical Machine Translation
Statistical Machine Translation Papers from COLING 2004
A Path-based Transfer Model for Machine Translation
Presentation transcript:

Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland

The Treebank Bottleneck High-quality parsers need training examples with hand-annotated syntactic information Annotation is labor intensive and time consuming There is no sizable treebank for most languages other than English [[S [NP-SBJ Ford Motor Co. ] [VP acquired [NP [NP 5 % ] [PP of [NP [NP the shares] [PP-LOC in [NP Jaguar PLC]]]]]]]. ]

State of the Art Parsing LanguageTreebankSizeParser Performance EnglishPenn Treebank 1M words 40k sentences ~90% Chinese Treebank 100K words 4k sentences ~75% Others (e.g., Hindi, Arabic) ???

Research Questions How can we induce a non-English language treebank quickly and automatically? –Bootstrap from available English resources –Project syntactic dependency relationship across bilingual sentences How good is the resulting treebank? –Can we use it to train a new parser? –How can we improve its quality?

Roadmap Overview of the framework –Direct projection algorithm Problematic cases –Post projection transformation Remaining challenges –Filtering Experiment –Direct evaluation of the projected trees –Evaluation of a Chinese parser trained on the induced treebank Future Work

Overview of Our Framework bilingual corpus EnglishChinese English dependency parser word alignment model dependency parser projected Chinese dependency treebank Filtering Transformation Projection unseen Chinese sentences train dependency trees for unseen sentences

TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject Necessary Resources: 1. Bilingual Sentences

TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod Necessary Resources 2. English (Dependency) Parser

TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod Necessary Resources 3. Word Alignment

TheChinesesidesatisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjadj det mod obj subj adj mod Projected Chinese Dependency Tree

Direct Projection Algorithm If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words

Problematic Case: Unaligned English thisregardingsubject det mod 对此

Problematic Case: Unaligned English thisregardingsubject det mod 对此 *e* det mod

Problematic Case: many-to-1 thisregardingsubject det mod 对此

Problematic Case: many-to-1 thisregardingsubject det mod 对此

Problematic Case: Unaligned Chinese ChineseexpressedThe 中国方面表示 subj *e* det

Problematic Case: Unaligned Chinese ChineseexpressedThe 中国方面表示 subj *e* subj det

Problematic Case: 1-to-many Chineseexpressed 中国方面表示 subj The *e* det

Problematic Case: 1-to-many Chineseexpressed 中国方面表示 *M* mac subj The *e* det

TheChinese satisfactionexpressedthisregarding 中国方面对表示满意此 subject subjobjdet mod obj subj Output of the Direct Projection Algorithm *M* *e* mod det mac

Post Projection Transformation Handles One-to-Many mapping –Select head based on (projected) part-of-speech categories Handles some Unaligned-Chinese cases –Only addressing close-class words Functional words (e.g., aspectual, measure words) Easily enumerable lexical categories (e.g., $, RMB, yen) Remove empty nodes introduced by the Unaligned- English cases by promoting its head child

Remaining Challenges Handling divergences Incorporating unaligned foreign words into the projected tree Removing cross dependencies AB ab CD dc

Filtering Projected treebank is noisy –Mistakes introduced by the projection algorithm –Mistakes introduced by component errors Use aggressive filtering techniques to remove the worst projected trees –Filter out a sentence pair if many English words were unaligned –Filter out a sentence pair if many Chinese words are aligned to the same English word –Filter out a sentence pair if many of the projected links caused crossing dependencies

Experiments Direct evaluation of the projection framework –Compare the (pre-filtered) projected trees against human annotated gold standard Evaluation of the projected treebank –Use the (post-filtered) treebank to train a Chinese parser –Test the parser on unseen sentences and compare the output to human annotated gold standard

Direct Evaluation Bilingual data: 88 Chinese Treebank sentences with their English translations Apply projection and transformation under idealized conditions –Given human-corrected English parse trees and hand-drawn word-alignments Apply projection and transformation under realistic conditions –English parse trees generated from Collins parser (trained on Penn Treebank) –Word-alignments generated from IBM MT Model (trained on ~56K Hong Kong News bilingual sentences)

Direct Evaluation Results ConditionAccuracy* Ideal67% English parses from the Collins parser 62% Word-alignments from the IBM MT Model 39% *Accuracy = f-score based on unlabeled precision & recall

Evaluating Trained Parser Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank Filter out badly-aligned sentence pairs to reduce noise Train a Chinese parser with the (filtered) projected treebank Test the Chinese parser on unseen test set (88 Chinese Treebank sentences)

Parser Evaluation Results MethodTraining Corpus Corpus SizeParser Accuracy Modify Prev (baseline) Modify Next (baseline) Stat. ParserHKNews (Filtered) Stat. Parser (upper bound) Chinese Treebank

Conclusion We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge A parser trained on the induced treebank can outperform some baselines

Future Work Obtain larger parallel corpus Reduce error rates of the word-alignment models Develop more sophisticated techniques to filter out noise in the induced treebank Improve the projection algorithm to handle unaligned words and inconsistent trees

Reserve slides

DPA Case 1: One-to-One AB ab

DPA Case 2: Many-to-One ab A1BA2A3C c

DPA Case 3: One-to-Many AB a1ba2a3*a*

DPA Case 4: Many-to-Many *a*b BC c a1a2 A1A2A3

DPA Case 5: Unaligned English Word AB a  C c

DPA Case 6: Unaligned Foreign Word A ab C c 