Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
1 Asking What No One Has Asked Before : Using Phrase Similarities To Generate Synthetic Web Search Queries CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG.
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
Opinion Mining of Customer Feedback Data on the Web Presented By Dongjoo Lee, Intelligent Databases Systems Lab. 1 Dongjoo Lee School of Computer Science.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn)
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Measuring Monolinguality
Semantic Processing with Context Analysis
Presented by: Prof. Ali Jaoua
Presentation transcript:

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel

Paraphrases I exposed my secret about my personal life I spilled the beans and told Jacky I loved her China did not change its policy toward Taiwan Beijing’s policy toward Taiwan remains unchanged phrase level sentence level

Motivation? MT coverage problem Arabic covered ngrams parallel corpus size

Related work on paraphrasing Continuing our previous work on Arabic synonyms (Bar and Dershowitz, AMTA, 2010) Using parallel corpus (Callison-Burch et al., 2006) Using monolingual corpus (Marton et al., 2009) Using comparable documents (Wang and Callison-Burch, 2011)

Why Arabic? Being a Semitic language, Arabic is highly inflected وتدرسها direct object root pattern conjunction and she learns it =

Extracting paraphrases Inspired by: Extracting Paraphrases from a Parallel Corpus, Regina Barzilay and Kathleen R. McKeown (2001) Working on Arabic comparable documents

Preparing the corpus Using Arabic Gigaword. We automatically paired documents – – published on the same day – maximize the cosine similarity over the lemma-frequency vector AFP XIN max cos similarity

Preparing the corpus 690 document pairs Manual evaluation by two Arabic speakers: – randomly selected 120 document pairs – question: “Do both documents discuss the same event”?

Preprocessing AMIRAN [Diab et al. – to appear] is a tool for finding context-sensitive morpho-syntactic information – Segmentation – Diacritized lemma – Stem – Full part-of-speech tag – Base-phrase tag – Named-entity-recognition (NER) tag

Extracting paraphrases: co-training technique extracting pairs of phrases co-training (context phrase) iterations paraphrases alignment paraphrases ✗ ✗ ✔

Extracting pairs of phrases Phrases: containing at least one non-functional word do not break base-phrase in the middle A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A strong undersea earthquake hit eastern Taiwan Wednesday A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A Strong Undersea Earthquake hit eastern Taiwan Wednesday

Co-training dEA xAfyyr swlAnA Almnsq Al>Ely llsyAsp AlxArjyp wAl>mnyp dEA xAfyyr swlAnA Almmvl Al>ElY llsyAsp AlxArjyp fy Inner (Phrase) Outer (Context)

Extracting paraphrases We maintain two sets unlabeled labeled positive = paraphrases negative = NOT paraphrases instances

Single iteration Unlabeled Labeled Training Outer 1 1 Using Outer 2 2 Training Inner 3 3 Using Inner 4 4 Deterministic labeling next iteration paraphrases

Deterministic labeling of potential paraphrases Labeling similar phrases as positive A strong undersea earthquake hit eastern Taiwan Wednesday, and there are no immediate reports of damage or casualties, according to reports from Taipei. The earthquake registering 6.0 on the Richter scale struck at 11:24 a.m. local time (0324 GMT), was about 76 km southeast of Hualien on the eastern coast, at a depth of 4 km, Taiwan's Central Weather Bureau said in a statement. A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. Wednesday in the waters off Hualian, eastern Taiwan, with no immediate reports of casualties or property damage, the Central Weather Bureau (CWB) said. The quake's epicenter was 76 kilometers southeast of Hualien, according to the CWB.

Deterministic labeling of potential paraphrases Negative examples are also labeled – in the first iteration (single words): words don’t have similar gloss values – not using in subsequent iterations

The outer (context) classifier Features FeatureDescription lemma, POS, NER, BPof each context word gloss-match rateleft and right lemma-match rateleft and right Using SVM, quadratic kernel

The inner (phrase) classifier Features FeatureDescription POS, NER, BPof each phrase word morphological features (Boolean): conjunction, possessive, determiner, prepositions of each phrase word lengthnumber of words n-gram score2-4 grams

Experiments & results Arabic – 240 document pairs (165K words) – 5 iterations

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71, ,790,574 After iteration 325,349,489 (+2,623) 74, ,198,253 After iteration 426,221,889 (+451) 74, ,557,931 After iteration 526,900,833(+101) 74, ,987 Total1,773

Evaluation 2 native speakers Pairs are provided with their context 4 labels: – paraphrases – entailment (e.g. a magnitude 6.0 earthquake  the quiver ) – related (e.g. San Diego ~ Los Angeles ) – wrong (e.g. a poor and little-developed province ≠ its resource-rich northwestern province)

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision % % % % Total %

Inner classifier, morphological features ExperimentExtracted pairsPrecision Outer+Inner65368% Outer740523% Outer+Inner+ no-morph-features 21162% Tested on 40 document pairs Evaluation of 200 pairs

Conclusions We will try to better understand the effect of the morphological features on Arabic Utilize the paraphrases for improving Arabic-English translation system corpus sizeextracted document pairs pairs used in paraphrasing words used in inference unique paraphrases Precision Arabic~20,000, ,3691,77366% English~1,000, , %

Thank you

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision % % % % Total % English

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization876,94732,9723,597 After iteration 1960,840(+868) 33,84086,648 After iteration 21,058,970 (+1,633) 35, ,648 After iteration 31,109,746(+1,194) 36, ,332 After iteration 41,127,643(+339) 37,006946,677 After iteration 51,128,475 (+52) 37,058241,490 Total525 English

Co-training was 76 kilometers southeast of Hualien according to the about 76 km southeast of Hualien on the eastern Inner (Phrase) Outer (Context)

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision % % % % Total % Arabic

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71, ,790,574 After iteration 325,349,489 (+2,623) 74, ,198,253 After iteration 426,221,889 (+451) 74, ,557,931 After iteration 526,900,833(+101) 74, ,987 Total1,773 Arabic