Evaluation of a Cross-lingual Romanian-English Multi-document Summariser Constantin Orasan and Oana Andreea Chiorean Research Group in Computational Linguistics.

Slides:



Advertisements
Similar presentations
What is plagiarism? "To plagiarize means to deliberately take and use another person's invention, idea or writing and claim it, directly or indirectly,
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
MT Evaluation The DARPA measures and MT Proficiency Scale.
IELTS Question2 Preparation By mark. Introduction The title Thinking time Brain storming The topic sentence The topic paragraph The supporting materials.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MACHINE TRANSLATION A precious key to communicate beyond linguistic barriers 1.
Introduction to Current Contents Connect. What is CCC? A multidisciplinary current awareness resource –Browse and search journals, books and websites.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Software Estimation and Function Point Analysis Presented by Craig Myers MBA 731 November 12, 2007.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Amy Dai Machine learning techniques for detecting topics in research papers.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Chapter 23: Probabilistic Language Models April 13, 2004.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
JOURNALISTIC TRANSLATION. TYPES OF JOURNALISM Newspapers – daily events and happenings, sometimes with political commentary of some kind. Magazines –
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Multimedia Information Retrieval
English Language and Literature Specification B Food Anthology
Connecting the Dots Between News Article
Presentation transcript:

Evaluation of a Cross-lingual Romanian-English Multi-document Summariser Constantin Orasan and Oana Andreea Chiorean Research Group in Computational Linguistics University of Wolverhampton Wolverhampton, United Kingdom

Structure 1. Introduction 2. The summarisation method used 3. Evaluation of Romanian summaries 4. Evaluation of English summaries 5. Conclusions & future work

Introduction Automatic summarisation (AS) offers a way to access large amounts of information by giving you its gist Machine translation (MT) offers a way to access information in a language not known by the reader AS + MT = Cross-lingual multi-document summarisation We investigate whether cross-lingual multi- document summarisation offers a good way to access Romanian information by English speakers

The summarisation method The method used to produce summaries is Maximal Marginal Relevance (MMR) (Goldstein et al., 2000) The method works on clusters of related documents linked to a user topic ht://dig was used to extract these clusters Snippets with up to 10,000 characters and first 50 snippets returned were used

Maximal Marginal Relevance (MMR) Chosen because requires very few language dependent tools: it requires a sentence splitter, a tokenizer and a stoplist The formula used has two components: Maximises the similarity to the user topic Minimises the redundant information in the summary A factor λ controls the influence of each of these components The summary is built in an iterative process

Evaluation For both Romanian and English summaries a task- based method was used A corpus of Romanian articles published between 2001 and 2005 was built 5 topics were selected: ARDAF wants to pay to stop Petrovschi scandal Basescu forms the government with UDMR and PUR American bases in Romania Flat-tax rate from 1st of January 2005 Romanian journalists kidnaped in Iraq

Evaluation (II) Multiple choice questions were manually produced without looking at the produced summaries Judges had to answer the questions on the basis of the summaries given and not their knowledge about the events The quality of a summary was given by the number of correctly answered questions An “I don’t know” answer was added so the judges do not try to guess the answer Coherence marked on a scale from 1 to 5.

Examples of questions With which parties is Basescu hoping to achieve a parliamentary majority? PUR and UDMR PSD and PRM PUR and PSD UDMR and PSD I don’t know Is NATO interested in establishing military bases in Romania? Yes No I don’t know

Evaluation of Romanian summaries Summaries evaluated: Baseline: the first sentence of the retrieved articles until the desired limit was reached “Perfect summaries”: human produced summaries 4 versions of MMR: λ = 0.5, 0.6, truncation to 5 and 6 characters Summaries of about 2000 characters including whitespaces 60 judges, 10 different people evaluated each summary

Evaluation results 1 – Human summaries 2 – Baseline 3 – 6 MMR The best results for automatic summaries truncation to 6 chars, λ = 0.6, stoplist TF*IDF

Evaluation of English summaries The summaries produced by the best method were automatically translated using eTranslator The questions and their answers manually translated 29 judges answered 414 questions

Evaluation of English summaries On all the topics the number of correctly answered questions reduces Attempts to identify a whether a category of questions (Yes/No, questions which have the answer a number) could be answered better than other did not reveal any pattern Feedback from the judges indicated that even though they could locate the answer of a question, in many cases they could not understand the whole summary due to poor translation

Poor translation In momentul de fata, Belu cistiga aproximativ de euro pe luna, in timp ce Bitang de euro. In the girlish moment, Belu gains about of euro on month, in while Bitang of euro. Ministrul Mircea Pascu a declarat ieri ca instalarea unor baze americane in Romania este o consecinta a statului nostru de viitor membru al NATO, aflat la granita Aliantei. The minister Mircea stated Pascu yesterday as the the of a installation american bases in Romania is stood our of future limb of NATO, finded out to boundary Aliantei.

Poor translation Judges : said “The meaning of the texts seemed almost graspable, but just beyond my mental powers.” compared the texts with a certain character’s speech from ‘The fast show’, a British comedy programme. It seems that the summaries contain important information, but it is highly unlikely that anyone will discover it because readers will give up reading the summary after one or two sentences

Conclusions & future work Cross-lingual Romanian-English Multi-document Summarisation may be an option if the quality of translation engine improves (try the Romanian to English google translation engine?) Romanian to English google translation engine Develop a summarisation method which “guesses” how easily it is to translate a sentence Translate and evaluate more summaries Try this approach for other pairs of languages which have better translation engines

In momentul de fata, Belu cistiga aproximativ de euro pe luna, in timp ce Bitang de euro. Today, Belu cistiga approximately 1,000 euros per month, while Bitang euros. Ministrul Mircea Pascu a declarat ieri ca instalarea unor baze americane in Romania este o consecinta a statului nostru de viitor membru al NATO, aflat la granita Aliantei. Minister Mircea Pascu said yesterday that the installation of American bases in Romania is a consequence of state our future member of NATO at the Alliance of the border.