John Frazier and Jonathan perrier

Slides:



Advertisements
Similar presentations
Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Advertisements

Chapter 5: Introduction to Information Retrieval
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Graph-based Text Summarization
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
1 Lecture 4 Topics –Problem solving Subroutine Theme –REC language class The class of solvable problems Closure properties.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
§ Algorithms, Algorithms, Algorithms Recall that an algorithm is a set of procedures or rules that, when followed, lead to a ‘solution’ to.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
Chapter 5: Information Retrieval and Web Search
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
Text Analysis Everything Data CompSci Spring 2014.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
LexRank: Graph-based Centrality as Salience in Text Summarization
LexRank: Graph-based Centrality as Salience in Text Summarization
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Chapter 6: Information Retrieval and Web Search
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Summarizing Contrastive Viewpoints in Opinionated Text Michael J. Paul, ChengXiang Zhai, Roxana Girju EMNLP ’ 10 Speaker: Hsin-Lan, Wang Date: 2010/12/07.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
Extractive Summarisation via Sentence Removal: Condensing Relevant Sentences into a Short Summary Marco Bonzanini, Miguel Martinez-Alvarez, and Thomas.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Towards an Extractive Summarization System Using Sentence Vectors and Clustering John Cadigan, David Ellison, Ethan Roday.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
NUS at DUC 2007: Using Evolutionary Models of Text Ziheng Lin, Tat-Seng Chua, Min-Yen Kan, Wee Sun Lee, Long Qiu and Shiren Ye Department of Computer Science.
Introduction toData structures and Algorithms
Semantic Processing with Context Analysis
IST 516 Fall 2011 Dongwon Lee, Ph.D.
DATA ANALYTICS AND TEXT MINING
Minimum-Cost Spanning Tree
Basic Information Retrieval
Contributors Jeremy Brown, Bryan Winters, and Austin Ray
Approximation Algorithms
6. Implementation of Vector-Space Retrieval
Matthew Renner, Trish Beeksma, Patch Kenny
Chapter 5: Information Retrieval and Web Search
Note-Taking and summary making
CS224N: Query Focused Multi-Document Summarization
Mining Anchor Text for Query Refinement
Vocabulary Algorithm - A precise sequence of instructions for processes that can be executed by a computer Low level programming language: A programming.
Relevance and Reinforcement in Interactive Browsing
Semi-Supervised Time Series Classification
Programming Techniques Assessment (Computer Science) Date :
Multilingual Summarization with Polytope Model
Information Retrieval and Web Design
Minimum-Cost Spanning Tree
Presented by Nick Janus
Presentation transcript:

John Frazier and Jonathan perrier Text Summarization John Frazier and Jonathan perrier

Natural Language Problem Statement Given some piece of text, we want to create an accurate summary in the least amount of time possible with the fewest resources possible

Formal Language Problem Statement Let 𝑁= 𝑛 1 , 𝑛 2 , …, 𝑛 𝑚 be a text document with 𝑛 1 , 𝑛 2 , …, 𝑛 𝑚 each being a sentence of the document in the order that it appears. Let 𝑤 𝑟, 𝑠 ∈𝑁 be words of the document where 𝑟 is the 𝑛𝑡ℎ sentence of the document and 𝑠 is the 𝑠𝑡ℎ word of the sentence Given a document 𝑁, extract a subset S= 𝑛 𝑖 , …, 𝑛 𝑗 ∋𝑆⊆𝑁 and score summary quality with the Rouge-1 metric

Algorithm 1: LexRank Create a graph by constructing a vertex at each sentence Create edges between sentences using IDF-modified cosine similarity Apply PageRank to the graph PageRank weights based on number of references other sentences make to a given sentence Return sentences based on PageRank rankings for the sentences O( 𝑛 2 ) complexity

Algorithm 2: Luhn’s Algorithm Give all sentences a weight based on a significance factor 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹𝑎𝑐𝑡𝑜𝑟= # 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 # 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 Cluster size is determined by placing the first and last use of a significant word as the beginning and end of an array and counting all words in the array Significant words are words with < 4 insignificant words in-between each repetition of a significant word Return sentences with highest significance factor O(𝑤) complexity

Algorithm 3: Brute Force/Naïve Weigh each word in the document by getting the summation of all times the word is repeated in the document 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑊𝑒𝑖𝑔ℎ𝑡= 𝑖=1 𝑠 𝑤 𝑟, 𝑖 𝑠 Return sentences with highest weight O(𝑤)

Evaluation Method: Rouge-1 Score Run Rouge-1 metric on summaries compared to a “gold standard” summary written by us for Recall and Precision 𝑅𝑒𝑐𝑎𝑙𝑙= # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑜𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑖𝑛 𝑏𝑜𝑡ℎ 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑜𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑖𝑛 𝑏𝑜𝑡ℎ 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑔𝑜𝑙𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 # 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑚𝑜𝑑𝑒𝑙 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 Each individual word is a unigram Want to maximize recall and precision

Evaluation: Determining Loop Value When collecting data, the algorithms were looped to reduce the impact of program overhead and timer inefficiency.

Evaluation: Algorithm Run Times

System Name Avg. Recall Avg. Precision LexRank 50% 0.44444 0.50847 Luhn 50% 0.43704 0.35758 Naïve 50% 0.25926 0.47945 LexRank 100% 0.63704 0.32453 Luhn 100% 0.6 0.24179 Naïve 100% 0.48148 0.38012 LexRank 200% 0.71852 0.20815 Luhn 200% 0.71111 0.16901 Naïve 200% 0.65926 0.19911

System Name Avg_Recall Avg_Precision LexRank 50% 0.20096 0.58333 Luhn 50% 0.22967 0.48485 Naïve 50% 0.2488 0.56522 Luhn 100% 0.41148 0.48315 LexRank 100% 0.4067 0.47753 Naïve 100% 0.45455 0.53371 LexRank 200% 0.58373 0.37888 Luhn 200% 0.57895 0.3467 Naïve 200% 0.5933 0.35028

System Name Avg_Recall Avg_Precision LexRank 50% 0.23243 0.34677 Luhn 50% 0.25946 0.2069 Naïve 50% 0.08108 0.28302 LexRank 100% 0.41081 0.27839 Luhn 100% 0.40541 0.20776 Naïve 100% 0.26486 0.28324 LexRank 200% 0.57838 0.18838 Luhn 200% 0.56757 0.16006 Naïve 200% 0.53514 0.24627

Questions Why should we have pre-processed the text for stop-words in our naïve algorithm? Because stop-words are so common and lack significant meaning, they skew sentence weight in favor of stop-words rather than more meaningful words. What is the difference between extractive and abstractive summarization? Extraction pulls full sentences directly from the text while abstraction uses machine learning to condense text in a heuristic manner. What is the difference between recall and precision? Recall is the ratio between shared unigrams and a gold standard. Precision is the ratio between shared unigrams and the summarized model.

Questions Continued What does PageRank do in the LexRank summary? PageRank determines sentence weight by measuring the number of sentences that reference a given sentence Why does Luhn’s Algorithm only have O(𝑤) complexity? Because it only counts repetition within each sentence rather than compared to the document as a whole.