Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Morphological Generation of German for Statistical Machine Translation Alexander Fraser Institute for NLP University of Stuttgart MTML - U. Haifa, January.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax Alexander Fraser ICL, U. Heidelberg CIS, LMU München
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Statistical Machine Translation Part X – Dealing with Morphology for Translating to German Alexander Fraser ICL, U. Heidelberg CIS, LMU München
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Statistical Machine Translation Part VI – Dealing with Morphology for Translating to German Alexander Fraser Institute for Natural Language Processing.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Statistical Machine Translation Part V - Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Computational Linguistics Ling 200 Spring 2006.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München Seminar: Open Source.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
A Language Independent Method for Question Classification COLING 2004.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
PRESENTED BY: PEAR A BHUIYAN
Seminar: Advanced Topics in SMT
Approaches to Machine Translation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Information Retrieval
Presentation transcript:

Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart INFuture2009: Digital Resources and Knowledge Sharing Nov 4 th 2009, Zagreb

Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

IfNLP Stuttgart  The Institute for Natural Language Processing (IfNLP/IMS) at the University of Stuttgart  Dogil (Phonetics and Speech)  Large department  Kuhn/Rohrer (LFG syntax and semantics)  Cahill (LFG generation)  Heid (Terminology extraction, morphology)  Padó (Semantics, lexical semantics)  Schütze (Statistical NLP and Information Retrieval)  More on next slide

IfNLP – Statistical NLP Group  Hinrich Schütze (director since 2004)  Bernd Möbius – Speech recognition and synthesis  Helmut Schmid - Parsing, morphology (known for TreeTagger, BitPar)  Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics  Michael Walsh – Speech, exemplar theoretic syntax  Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval  General department areas of research  New statistical NLP models and methods  Semi-supervised and active learning  Cognitive/linguistic representation models  Applied to: NLP, retrieval, MT, speech, e-learning, …

IfNLP - Partnerships  Partnerships  Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing  Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA- based German processing)  International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN  Industrial: various projects with publishers (many focusing on terminology)

Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

What is bitext parsing?  Bitext: a text and its translation  Sentences and their translations are aligned  Sometimes called a parallel corpus  Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse)  Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext  We will use the complementarity of the syntax of the two languages to obtain improved parses

Motivation for bitext parsing  Many advances in syntactic parsing come from better modeling  But the overall bottleneck is the size of the treebank  Our research asks a different question:  Where can we (cheaply) obtain additional information, which helps to supplement the treebank?  A new information source for resolving ambiguity is a translation  The human translator understands the sentence and disambiguates for us!  Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment

Clause attachment ambiguity Parse 1:high attachment (wrong) Parse 2: low attachment (correct)

Not ambiguous in German  Number agreement disambiguates  FRAU (woman) and HATTE (had) agree  Unambiguous low attachment

Parse reranking of bitext  Goal: improve English parsing accuracy  Parse English sentence, obtain list of 100 best parse candidates  Parse German sentence, obtain single best parse  Determine the correspondence of German to English words using a word alignment  Calculate syntactic divergence of each English parse candidate and the projection of the German parse  Choose probable English parse candidate with low syntactic divergence

Measuring syntactic divergence P(e | g) = exp ∑ m λ m h m (g, e, a) ∑ e exp ∑ m λ m h m (g, e, a)  Define features to capture different (overlapping) aspects of syntactic divergence. Functions of:  Candidate English parse e  German parse g  Word alignment a  Combine in log-linear model  Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)

Rich bitext projection features  Defined 36 features by looking at common English parsing errors  No monolingual features, except baseline parser probability  General features  Is there a probable label correspondence between German and the hypothesized English parse?  How expected is the size of each constituent in the hypothesized English parse given the German parse?  Specific features  Are coordinations realized identically?  Is the NP structure the same?  Mix of probabilistic and heuristic features

Training  Use BitPar syntactic forest parser  English BitPar trained on Penn Treebank  German BitPar trained on Tiger Treebank  Probabilistic feature functions built using large parallel text (Europarl)  Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German  Minimum error rate training using F score

Reranking English parses  Difficult task  German is difficult to parse  Our knowledge source, the German parser, is out-of- domain (poor performance)  Baseline English parser we are trying to improve is in- domain (good performance)  Test set has long sentences  Result: 0.70% F1 improvement on test data (stat. significant)

New results  Reranking German parses  We needed German gold standard parses (and English translations)  Sebastian Pado has made a small parallel treebank for Europarl available  No engineering on German yet  We are using the same syntactic divergence features which were designed to improve English parsing  There are German specific ambiguities which could be modeled, such as subject- object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”)  But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain)  2.3% F1 improvement currently, we think this can be further improved

Summary: bitext parsing  I showed you an approach for bitext parsing  Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse  I then showed our first results for reranking German parses using a single English parse  The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking  Machine translation involves morphosyntactic correspondence  And this is where we are interested in looking at Croatian

Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

Morphosyntactic processing  I am co-PI of a new IfNLP project funded by the DFG (German Science Foundation)  Project: morphosyntactic modeling for statistical machine translation (SMT)  SMT research, up until recently, has been dominated by translation into English  English expresses a lot of information through word order, very little through inflection  Approaches to translating morphologically rich languages to English are preprocessing based

Present: linguistic preprocessing  Linguistic preprocessing for SMT (stat. machine translation)  From: freer syntax, morphologically rich language  To: rigid syntax, morphologically poor language  Existing examples: German to English, Czech to English

Present: linguistic preprocessing  How this works  Produce morphosyntactic analysis of German (or Czech)  Reorder words in the German/Czech sentence to be in English order  Reduce morphological inflection (for instance, remove case marking, remove all agreement on adjectives, etc)  For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns)  Use statistics on this “simplified” German or Czech to map directly to English using SMT

Present: linguistic preprocessing  How well does this work?  German to English SMT with linguistic preprocessing (Stuttgart system)  Results from 2008 ACL workshop on machine translation (extensive human evaluation)  Only system limited to organizer’s data competitive with:  The best system of 5 rule-based MT systems  Saarbrücken hybrid rule-based/SMT system  Google Translate, which does not use linguistic preprocessing but does use vastly more data

Future: modeling  What about translating from English to German or to Slavic languages?  Problem: morphological generation is more difficult  It is easy to reduce multiple inflections to one (for instance, stemming)  Harder to learn to generate the right inflection

Future: modeling  Current work on morphological generation  Work at Charles University in Prague on Czech  Tectogrammatical representation is not (yet) competitive with simple statistics (little explicit knowledge of morphology or syntax)  Best English to German SMT systems also use little or no morphological knowledge  And they are much worse than rule-based English to German systems  Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing  morphosyntactic modeling

Morphosyntactic correspondence  In fact, all multilingual problems involve morphosyntactic correspondence:  If we have a source parse tree, and source text, and we would like a target text, this is machine translation  If we have a source parse tree, source text and target text, and we would like a target parse, this is bitext parsing  If we would like to know which word in the target text is a translation of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment  The same thinking can be used for cross-lingual information retrieval  Very relevant when one of the languages is morphologically rich

Conclusion  I introduced the IfNLP Stuttgart  I presented a new approach to improving parsing using morphosyntactic correspondence: bitext parsing  I discussed the general challenge of using morphosyntactic correspondence, focusing on statistical machine translation  Biggest challenge is translating into freer word order, morphologically rich (e.g., German and particularly Slavic languages)  We are interested in the challenge of building systems to translate to Croatian  To do this: we need partners who are working on Croatian analysis!  We also request that you think about multilingual applications when producing Croatian NLP resources  The type of approach I showed for bitext parsing is useful for other multilingual applications

Thank you!

Title  text

Statistical Approach  Using statistical models  Create many alternatives, called hypotheses  Give a score to each hypothesis  Find the hypothesis with the best score through search  Disadvantages  Difficulties handling structurally rich models (math and computation)  Need data to train the model parameters  Difficult to understand decision process made by system  Advantages  Avoid hard decisions  Speed can be traded with quality, no all-or-nothing  Works better in the presence of unexpected input  Learns automatically as more data becomes available Modified from Vogel

Morphosyntactic knowledge  We use: morphological analyzers & treebanks, which are combined in parsing models learned from treebanks  English models have little morphological analysis (suffix analysis to determine POS for unknown words)  German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological Analyzer)  Given inflected form, SMOR returns possible fine-grained POS tags  E.g., for nouns/adjectives: POS, case, gender, number, definiteness  BitPar puts possible analyses in the chart, and disambiguates  Slavic languages require even more morphological knowledge than German

Transferring syntactic knowledge  Need knowledge source!  English syntactic parser  About 90% bracketing accuracy  Mapping  Requires bitext  Work discussed here uses German/English Europarl (European Parliament Proceedings)  Resource for Croatian: Acquis Communautaire  Automatically generated word alignment

Additional details in the paper  Formalization of bitext parsing as a parse reranking task  Definitions of bitext feature functions  Analysis of feature functions through feature selection  Comparison of MERT (minimum error rate training) with SVM- Rank