1 Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg.

Slides:

Advertisements

Similar presentations

Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.

Evaluating Search Engine

Search Engines and Information Retrieval

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Web Page Usability. Determine User Goals Brainstorm: Brainstorm: Why would users come to your page? Why would users come to your page? What level of information.

Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

INFO 624 Week 3 Retrieval System Evaluation

Writing tips Based on Michael Kremer’s “Checklist”,

Module 5 Writing the Results and Discussion (Chapter 3 and 4)

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

The Manager as Leader 3.1 The Importance of Leadership

1 Multi-document Summarization and Evaluation. 2 Task Characteristics  Input: a set of documents on the same topic  Retrieved during an IR search 

Quantitative Research

Overview of Search Engines

Allyn & Bacon 2003 Social Work Research Methods: Qualitative and Quantitative Approaches Topic 2: The Basics of Social Work Research Learn.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Modern Retrieval Evaluations Hongning Wang

Human Capital Policies in Education: Further Research on Teachers and Principals 5 rd Annual CALDER Conference January 27 th, 2012.

MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

What Makes an Essay an Essay. Essay is defined as a short piece of composition written from a writer’s point of view that is most commonly linked to an.

COMM 250 Agenda - Week 12 Housekeeping RP2 Due Wed. RAT 5 – Wed. (FBK 12, 13) Lecture Experiments Descriptive and Inferential Statistics.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

User Study Evaluation Human-Computer Interaction.

The effects of relevance of on-screen information on gaze behaviour and communication in 3-party groups Emma L Clayes University of Glasgow Supervisor:

Tamil Summary Generation for a Cricket Match

1 McGraw-Hill Professional Learn More. Do More. Search & Browse..…………………………….………..…………..3-9 Using My AccessScience Profiles …………

To return to the chapter summary click Escape or close this document. Chapter Resources Click on one of the following icons to go to that resource. Image.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Confidence Interval Estimation For statistical inference in decision making:

L643: Evaluation of Information Systems Week 13: March, 2008.

1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman.

Textbook Recommendation Reports. Report purpose u Starts with a stated need u Evaluates various options –Uses clearly defined criteria –Rates options.

Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.

REPORT Valentina Widya.S.

Information Retrieval

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

1 Integrative negotiations Multiple issues Differing strengths of preference Differing interests Future relationship Multiple alternatives.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

ASSOCIATIVE BROWSING Evaluating 1 Jin Y. Kim / W. Bruce Croft / David Smith by Simulation.

Experimental Psychology PSY 433 Chapter 5 Research Reports.

Plan for today Introduction Graph Matching Method Theme Recognition Comparison Conclusion.

A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.

Sampath Jayarathna Cal Poly Pomona

Experimental Psychology

Evaluation of IR Systems

Method Separate subheadings for participants, materials, and procedure (3 marks in total) Participants (1 mark) Include all info provided in the assignment.

IR Theory: Evaluation Methods

Presentation transcript:

1 Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University

2

3 Status of Multi-Document Summarization  Robust  Many existing systems (e.g. DUC 2004)    Extensive quantitative evaluation (intrinsic)  DUC 2001 – DUC 2005  Comparison of system summary content against human models  Do system generated summaries help end- users to make better use of the news?

4 Extrinsic Evaluation  Task-based evaluation of single document summarization using IR  TIPSTER-II, Brandow et al, Mani et al, Mochizuki&Okumura  Other factors can determine result (Jing et al)  Evaluation of evaluation metrics using similar task as ours  Amigo et al

5 Task Evaluation  Hypothesis: multi-document summaries enable users to find information efficiently  Task: fact-gathering given topic and questions  Resembles intelligence analyst task  Compared 4 parallel news browsing systems  Level 1: Source documents only  Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents  Level 3: Newsblaster multi-document summaries linked to documents  Level 4: Human written multi-document summaries linked to documents

6 Results Preview  Quality of facts gathered significantly better  Newsblaster vs. documents alone  User satisfaction higher  Newsblaster and human summaries vs. documents and 1 sentence summaries  Summaries contributed important facts  Newsblaster and human summaries vs. 1 sentence summaries  Full multi-document summarization more powerful than no documents or single sentence summarization

7 Outline  Study design and execution  Scoring  Results

8 Evaluation Goals  Do summaries help users find information needed to perform a fact gathering task?  Do users use information from the summary in gathering their facts?  Do summaries increase user satisfaction with the online news system?  Do users create better fact sets with an online news system that includes summaries than one without?  How does type of summary (i.e., 1-sentence, system generated, human generated) affect quality of task output and user satisfaction?

9 Experimental Design  Subjects performed four 30-minute fact- gathering scenarios  Prompt: topic description plus three questions  Given a web page as sole resource  Space in which to compose response  Instructed to cut and paste from summary or article  Four event clusters per page u Two centrally relevant, two less relevant u 10 documents per cluster on average  Complete survey after each scenario

10 Prompt  The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the "road map for peace," a diplomatic effort sponsored by the United States, Russia, the E.U. and the U.N., has suffered setbacks. However unofficial negotiators have developed a plan known as the Geneva Accord for finding a permanent solution to the conflict.  Who participated in the negotiations that produced the Geneva Accord?  Apart from direct participants, who supported the Geneva Accord preparations and how?  What has the response been to the Geneva Accord by the Palestinians and Israelis?

11 Experimental Design  Subjects performed four 30-minute fact- gathering scenarios  Prompt: topic description plus three questions  Produced a report containing a list of facts  Given a web page as sole resource  Space in which to compose response  Instructed to cut and paste from summary or article and make citation  Four event clusters per page u Two centrally relevant, two less relevant u 10 documents per cluster on average  Complete survey after each scenario

12 Level 1: Documents only, no summary

13 Level 2: 1-sentence summary for each event cluster, 1-sentence summary for each article

14 Full multi-document summaries Neither humans nor systems had access to the prompt  Level 3: Generated by Newsblaster for each event cluster  Level 4  Human written summary for each event cluster  Summary writers hired to write summaries u English or Journalism students with high verbal SAT

15 Levels 3 and 4: full summary for each event cluster

16 Experimental Design  Subjects performed four 30-minute fact- gathering scenarios  Prompt: topic description plus three questions  Produced a report containing a list of facts  Given a web page as sole resource  Space in which to compose response  Instructed to cut and paste from summary or article and make citation  Four event clusters per page u Two centrally relevant, two less relevant u 10 documents per cluster on average  Complete survey after each scenario

17 Study Execution  45 Subjects with varied background  73% students (BS, BA, journalism, law)  Native speakers of English  Paid, with promise of monetary prize for best report  3 studies, controlling for scenario and level order, ~11 subjects/scenario/level

18 Results – What was Measured  Report content across summary conditions: levels 1-4  User satisfaction per summary condition based on user surveys  Source of report content (summary or article) by counting fact citations

19 Scoring Report Content  Compare subject reports against a gold standard  Used the Pyramid method [HLT2004]  Avoids postulating an ideal exhaustive report  Predicts multiple equally good reports  Provides a metric for comparison  Gold standard for report x = pyramid of facts constructed from all reports except x  Relative importance of facts determined by report writers  34 reports per pyramid on average -> very stable

20 Pyramid representation  Tiers of differentially weighted facts  Top: few facts, high weight  Bottom: many facts, low weight  Report facts that don’t appear in pyramid have weight 0  Duplicate report facts get weight 0 W=1 W=33 W=34 …

21 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

22 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

23 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

24 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

25 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

26 Ideally informative report  Does not include a fact from a lower tier unless all facts from higher tiers are included as well

Report Length  Wide variation in length impacts scores  We restricted report length < 1 standard deviation above the mean by truncating question answers

28 Results - Content Summary LevelPyramid Score Level 1 (docs only).3354 Level 2 (1 sentence).3757 Level 3 (Newsblaster).4269 Level 4 (Human).4027 Report quality improves from level 1 to level 3. (One scenario was dropped from results as it was problematic for subjects)

29 Statistical Analysis  ANOVA shows summary is marginally significant factor  Bonferonni method applied to determine differences in summary levels  Difference between Newsblaster and documents-only significant (P=.05)  Differences between Newsblaster and 1-sentence or human not significant  ANOVA shows that scenario, question and subject also significant factors

30 Results - User Satisfaction  6 questions in exit survey required response from a 1-5 scale  Average increases by summary type Level 1Level 2Level 3Level 4 Average

31 With full summaries, users read less Question Level 1Level 2Level 3Level 4 A. What best describes your experience reading source articles? I read a LOT more than I needed to 5. I only read those articles I needed to read

32 With Summaries, easier to write report and tended to have more time Question Level 1Level 2Level 3Level 4 B. How difficult do you think it was to write the report? Very difficult 5. Very easy C. Do you feel you had enough time to write the report? I needed more time 5. I had more than enough time

33 Usefulness improves with summary quality Human summaries help best with time Question Level 1Level 2Level 3Level 4 D. What best describes your experience using article summaries? n/a They had nothing useful to say 5. Everything I needed to know E. Did you feel that the automatic summaries saved you time, wasted time, or had no impact on your time budget? n/a Summaries wasted time 5. Summaries saved me time

34 Multiple Choice Survey Questions QuestionLevel 2Level 3Level 4 1. Which was most helpful? Source articles helped most 64%48%29% Equally helpful 32%29% Summaries helped most 5%24%43% 2. How did you budget your time? Most searching, some writing 55%48%67% Half searching, half writing 39%29%19% Most writing, some searching 7%24%14%

35 Citation Patterns  Report writers were significantly more likely to extract facts from summaries with Newsblaster and human summaries Level 2Level 3Level 4 Citations from summaries 8%17%27%

36 What we Learned  With summaries, a significant increase in quality of report  We hypothesized summaries would reduce reading time  As summary quality increases, users significantly more often draw facts from summary without decrease in report quality  Users claim they read fewer full documents with level 3 and 4 summaries  Full multi-document summarization better than 1 sentence summaries  Almost 5 times the proportion of subjects using Newsblaster summaries say summaries are helpful than subjects using 1 sentence summaries

37 Need for Follow-on Studies  Why no significant increase in report quality from level 2 to level 3?  Interface differences u Level 2 had summary for each article, level 3 did not u Level 3 required extra clicks to see list of articles  Studies to investigate controlling report length  Studies to investigate impact of scenario and question

38

39

40 Need for Follow-on Studies  Why no significant increase in report quality from level 2 to level 3?  Interface differences u Level 2 had summary for each article, level 3 did not u Level 3 required extra clicks to see list of articles  Studies to investigate controlling report length  Studies to investigate impact of scenario and question

41 Conclusions  Do summaries help?  Yes  Our task-based, extrinsic evaluation yielded significant conclusions  Full multi-document summarization (Newsblaster, human summaries) helps users perform better at fact-gathering than documents only  Users are more satisfied with full multi- document summarization than Google News style 1-sentence summaries