Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Expectation Maximization Algorithm
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Natural Language Processing Expectation Maximization.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Approaches to Machine Translation
Statistical NLP: Lecture 13
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Presentation transcript:

Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Definitions Paraphrase – “phrases, sentences, or longer natural language expressions that convey almost the same information” Textual Entailment – “pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true” (Androutsopoulos and Malakasiotis, 2010)‏ אזרח ומאבטח השתלטו על אדם שרצה לשדוד סניף דואר בב " ש YNET, ) ) ‏ אזרח ומאבטח השתלטו על שודד בסניף בנק הדואר המרכזי בב " ש NRG, ) ‏ ( בשנת 1999 זכתה קבוצת פנאתינייקוס בגביע אירופה בשנת 1999 עבר קטש לקבוצת פנאתינייקוס היוונית, ואף זכה להניף את גביע אירופה איתה ( ויקיפדיה ) ‏

Tasks Paraphrase Extraction – Extract paraphrases occurring within text. Paraphrase Identification – Determine if two given sentences are paraphrases Paraphrase Generation – Generate paraphrases of a given input sentence.

Common Stages in Learning Paraphrases Obtain monolingual corpus. Align paragraphs and sentences. Learn Paraphrases. Apply Learned rules to solve NLP tasks.

Research Questions Are there specific properties of the Hebrew language that allow paraphrasing? Which datasets can be used to collect and identify a database of paraphrases in Hebrew? Could approaches taken on other languages (especially English) be applied for Hebrew? How could paraphrases in Hebrew be learned (encoded) in order to help in NLP tasks?

Applications Article summarization Textual entailment Thesaurus Enrich automatic generation of text Machine translation

Previous Work In Other Languages Alignment - Gale and Church: They hypothesized that when looking at paraphrases, each character in the source sentence will give rise to a certain (language dependant) number of characters in the target language.

Previous Work In Other Languages Alignment - Gale and Church: This model combined with empirical results from their test corpus generated a fairly simple alignment algorithm, which only looks at the input sentences length.

Previous Work In Other Languages Alignment - Gale and Church: Only allowed for alignments of the types below.

Previous Work In Other Languages Paraphrase Identification (Barzilay, McKeown, 2001): –Their dataset consists of multiple English translations of foreign books. –Assumption: different translators will introduce paraphrases when translating the same source text.

Previous Work In Other Languages Paraphrase Identification (Barzilay, McKeown, 2001): –Continued by applying an iterative model for extracting paraphrases rules from aligned sentences. –They have created rules of two types: Contextual rules, and morpho-syntactic rules, these two are co-trained on the aligned corpus and lexical paraphrases are extracted.

Previous Work In Other Languages Contextual Rules: left1 = (VB0 TO1) right1 = (PRP$2,) “Tried to console her” left2 = (VB0 TO1) right2 = (PRP$2,) “Tried to comfort her” Morpho-Syntactic Rules : VB0 TO1 VB1 PRP1 “used to love her” VB0 TO1 VB2 NN1 IN PRP1 “used to feel affection for her” Lexical Paraphrases: (love, feel affection for)‏

Previous Work In Other Languages Generation – Microsoft :Microsoft –The Microsoft NLP team created a system to produce paraphrases of an input English sentence. –Their system gathered a large automated training set from news sites, upon which they performed sentence alignment –They have used statistical learning tools upon this dataset to learn generation lattices

Previous Work In Other Languages Generation –Malakasiotis and Androutsopoulos (Generate and rank)‏: –Have created a method for ranking candidates for paraphrase which gives weight to for grammaticality, meaning preservation, and diversity of the paraphrases. –They have used this ranking component to create a new paraphrase generator. This generator creates many paraphrasing candidates by using other available techniques for paraphrasing. –It then uses the ranking component to rank these candidates and returns the most likely ones –Have published their dataset of paraphrase pairs with hand tagged judgment ranks.

Previous Work In Other Languages Extraction - Hashimoto et al (2011): –Their work concentrates on the extraction of Japanese paraphrases from the web. –They scan the web for what they call a "definition sentence" – a sentence which describes a term. –In order to identify such sentences they parse match them against a sentential template – certain order of part of speech tags which their hypothesis claim that a definition sentence should adhere to. –Following this, they have coupled sentences from the mining which contained the same subject, in assumption that this couple is likely to contain paraphrases. Using this method they report achieveing a large collection of 300K paraphrases with estimated precision of ~94%.

Previous Work In Hebrew (Ordan, Wintner. 2011): –have developed a medium scale Wordnet for Hebrew, consisting of ~5300 groups of synonymous lexical items (synsets). –The approach they have taken was to form the Wordnet by aligning English and Hebrew expressions, and infer relations from the English available Wordnet onto their created Hebrew Wordnet. –They state that this method (called MultiWordNet) is preferable over building the Wordnet from scratch since the Hebrew language is poor on computational linguistic resources. The lack of monolingual dictionaries in Hebrew is given as an example of such resource.

Initial Work Data Mining Leading news sites will, with high probability, report on same event during a day’s time Collect hourly news headlines – our assumption is that finding paraphrases within a day’s mining is a simple task. Full story – richer examples?

Initial Work Data Mining – Examples synonym סתיו שפיר נפגעה קל מאוד בהפגנה ליד בית שר האוצר סתיו שפיר נפצעה קל מאוד בהפגנה מול בית שר האוצר The bad הולנד: להטיל סנקציות על הבנק המרכזי באיראן צרפת: להטיל סנקציות בהיקף חסר תקדים על איראן The good השר שלום: מפגן האחדות הפלסטיני מחסל שיחות ישירות עם הרשות השר שלום: מפגן האחדות הפלסטיני סותם הגולל על מומ ישיר

Initial Work Headlines Alignment Baseline alignment method was created: –For each two headlines in a day compute probability of alignment as (2 * #common words) / (#total words) -For each news headline in a news source – align with a headline in another source for which the probability is over a certain threshold. Produces fairly good resultsresults

Initial Work Full Stories Alignment Testing with dynamic programming approach (which gives weights to identical words) in order to align full stories seems to yield some interesting results

Initial Work Full Stories Alignment חכ זהבה גלאון ממרצ תקפה את ראש הממשלה, בנימין נתניהו. במהלך דיון בכנסת בעקבות חתימות של חכים: אם תנסה להרוס את הדמוקרטיה, תקבל התקוממות עממית. הפרת את שבועת האמונים שלך לאזרחי המדינה ולחוקיה כאשר התחלת בקמפיין לחיסול הדמוקרטיה במדינת ישראל. דמוקרטיה לא נבחנת רק בשלטון הרוב. אלא גם בכיבוד זכויות האדם של המיעוט ואתה הפרת את שבועת האמונים שלך. חברת הכנסת זהבה גלאון ממרצ טענה כי ראש הממשלה, בנימין נתניהו, הפר את שבועת האמונים שלו לאזרחי המדינה בכך שהחל בקמפיין לחיסול הדמוקרטיה במדינת ישראל: דמוקרטיה לא נבחנת רק בשלטון הרוב, אלא גם בכיבוד זכויות האדם של המיעוט. אתה הפרת את שבועת האמונים שלך, כשהחלטת לחסל את המיעוט ולפגוע בזכויות היסוד שלו. אם תנסה להרוס את הדמוקרטיה, תקבל התקוממות עממית, הכריזה.

Initial Work Full Stories Alignment 1. חכ זהבה גלאון מממרצ תקפה את ראש הממשלה, 2. בנימין נתניהו. 3. במהלך דיון בכנסת בעקבות חתימות של חכים: 4. אם תנסה להרוס את הדמוקרטיה, 5. תקבל התקוממות עממית. 6. הפרת את שבועת האמונים שלך לאזרחי המדינה ולחוקיה כאשר התחלת בקמפיין לחיסול הדמוקרטיה במדינת ישראל. 7.דמוקרטיה לא נבחנת רק בשלטון הרוב. 8.אלא גם בכיבוד זכויות האדם של המיעוט ואתה הפרת את שבועת האמונים שלך חברת הכנסת זהבה גלאון ממרצ טענה כי ראש הממשלה, 2. בנימין נתניהו, הפר את שבועת האמונים שלו לאזרחי המדינה בכך שהחל בקמפיין לחיסול הדמוקרטיה במדינת ישראל: 7. דמוקרטיה לא נבחנת רק בשלטון הרוב, 8. אלא גם בכיבוד זכויות האדם של המיעוט. אתה הפרת את שבועת האמונים שלך, 9. כשהחלטת לחסל את המיעוט ולפגוע בזכויות היסוד שלו. 10.אם תנסה להרוס את הדמוקרטיה, 11.תקבל התקוממות עממית, 12.הכריזה.

Future Work Plan Align full stories using a baseline method (7.12) Provide a better alignment method: –Using tagger in order to exploit POS knowledge. (14.12) –Giving weight to Proper noun (e.g. names) (21.12) and Named Entities: "The Cassini spacecraft, which is en route to Saturn, is about to make a close pass of the ringed planet's mysterious moon Phoebe“ vs.: "On its way to an extended mission at Saturn, the Cassini probe on Friday makes its closest rendezvous with Saturn's dark moon Phoebe.“ (C. Quirk, C. Brockett and W. Dolan (Microsoft Research), 2004)‏

Future Work Plan: Publish alignments dataset (28.12) and estimate its precision rate. Try to incorporate LDA in the system (7.1) to get better results Try to formulate a method (14.1) for synonyms extraction of this dataset. Explore ways of learning and (21.1) encoding paragraph rules from the aligned dataset.