Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University.

Slides:



Advertisements
Similar presentations
Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Graph-based Text Summarization
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Overview of Web Data Mining and Applications Part I
Using Social Networking Techniques in Text Mining Document Summarization.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Link Analysis Hongning Wang
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Diversity in Ranking via Resistive Graph Centers Avinava Dubey IBM Research India Soumen Chakrabarti IIT Bombay Chiranjib Bhattacharyya IISc Bangalore.
LexRank: Graph-based Centrality as Salience in Text Summarization
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
LexRank: Graph-based Centrality as Salience in Text Summarization
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Link Analysis Hongning Wang Standard operation in vector space Recap: formula for Rocchio feedback Original query Rel docs Non-rel docs Parameters.
CS 590 Term Project Epidemic model on Facebook
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
1 CS 430: Information Discovery Lecture 5 Ranking.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
NUS at DUC 2007: Using Evolutionary Models of Text Ziheng Lin, Tat-Seng Chua, Min-Yen Kan, Wee Sun Lee, Long Qiu and Shiren Ye Department of Computer Science.
Applying Key Phrase Extraction to aid Invalidity Search
Junghoo “John” Cho UCLA
Relevance and Reinforcement in Interactive Browsing
Presentation transcript:

Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University of Singapore, Singapore

Using Evolutionary Models of Text for Multi-document summarization 2TextGraphs 2 at HLT/NAACL 2007 Summarization Traditionally, heuristics for extractive summarization – Cue/stigma phrases – Sentence position (relative to document, section, paragraph) – Sentence length – TF×IDF, TF scores – Similarity (with title, context, query) With the advent of machine learning, heuristic weights for different features are tuned by supervised learning In last few years, graphical representations of text have shed new light on the summarization problem

Using Evolutionary Models of Text for Multi-document summarization 3TextGraphs 2 at HLT/NAACL 2007 Prestige as sentence selection One motivation of using graphical methods was to model the problem as finding prestige of nodes in a social network PageRank used random walk to smooth the effect of non-local context HITS and SALSA to model hubs and authorities In summarization, lead to TextRank and LexRank Contrast with previous graphical approaches (Salton et al. 1994) Did we leave anything out of our representation for summarization? Yes, the notion of an evolving network

Using Evolutionary Models of Text for Multi-document summarization 4TextGraphs 2 at HLT/NAACL 2007 Social networks change! Natural evolving networks (Dorogovtsev and Mendes, 2001) – Citation networks: New papers can cite old ones, but the old network is static – The Web: new pages are added with an old page connecting it to the web graph, old pages may update links

Using Evolutionary Models of Text for Multi-document summarization 5TextGraphs 2 at HLT/NAACL 2007 Evolutionary models for summarization Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way Consider the evolution of texts using a very simplistic model – Writers write from the first sentence onwards in a text – Readers read from the first sentence onwards of a text A simple model: sentences get added incrementally to the graph

Using Evolutionary Models of Text for Multi-document summarization 6TextGraphs 2 at HLT/NAACL 2007 Timestamped Graph Construction Approach – These assumptions suggest us to iteratively add sentences into the graph in chronological order. – At each iteration, consider which edges to add to the graph. – For single document: simple and straightforward: add 1 st sentence, followed by the 2 nd, and so forth, until the last sentence is added – For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1 st sentences of all documents, followed by all 2 nd sentences, and so forth Doesn’t really model chronological ordering between articles, fix later

Using Evolutionary Models of Text for Multi-document summarization 7TextGraphs 2 at HLT/NAACL 2007 Timestamped Graph Construction Model: Documents as columns – d i = document i Sentences as rows –s j = j th sentence of document

Using Evolutionary Models of Text for Multi-document summarization 8TextGraphs 2 at HLT/NAACL 2007 Timestamped Graph Construction A multi document example doc1 doc2 doc3 sent1 sent2 sent3

Using Evolutionary Models of Text for Multi-document summarization 9TextGraphs 2 at HLT/NAACL 2007 An example TSG: DUC 2007 D0703A-A

Using Evolutionary Models of Text for Multi-document summarization 10TextGraphs 2 at HLT/NAACL 2007 Properties of nodes Timestamped Graph Construction Properties of edges Input text transformation function These are just one instance of TSGs Let’s generalize and formalize them Def: A timestamped graph algorithm tsg(M) is a 9-tuple (d, e, u, f,σ, t, i, s, τ) that specifies a resulting algorithm that takes as input the set of texts M and outputs a graph G

Using Evolutionary Models of Text for Multi-document summarization 11TextGraphs 2 at HLT/NAACL 2007 Edge properties (d, e, u, f) Edge Direction (d) – Forward, backward, or undirected Edge Number (e) – number of edges to instantiate per timestep Edge Weight (u) – weighted or unweighted edges Inter-document factor (f) – penalty factor for links between documents in multi-document sets.

Using Evolutionary Models of Text for Multi-document summarization 12TextGraphs 2 at HLT/NAACL 2007 Node properties ( σ, t, i, s) Vertex selection function σ(u, G) – One strategy: among those nodes not yet connected to u in G, choose the one with highest similarity according to u – Similarity functions: Jaccard, cosine, concept links (Ye et al ) Text unit type (t) – Most extractive algorithms use sentences as elementary units Node increment factor (i) – How many nodes get added at each timestep Skew degree (s) – Models how nodes in multi-document graphs are added – Skew degree = how many iterations to wait before adding the 1 st sentence of the next document – Let’s illustrate this …

Using Evolutionary Models of Text for Multi-document summarization 13TextGraphs 2 at HLT/NAACL 2007 Skew Degree Examples time(d1) < time(d2) < time(d3) < time(d4) d1 d2 d3 d4 Skewed by 1Skewed by 2 Freely skewed d1 d2 d3 d4 Freely skewed = Only add a new document when it would be linked by some node using vertex function σ

Using Evolutionary Models of Text for Multi-document summarization 14TextGraphs 2 at HLT/NAACL 2007 Input text transformation function (τ) Document Segmentation Function (τ) – Problem observed in some clusters where some documents in a multi-document cluster are very long – Takes many timestamps to introduce all of the sentences, causing too many edges to be drawn –Τ(G) segments long documents into several sub docs Solution is too hacked – hope to investigate more in current and future work d5 d5bd5a

Using Evolutionary Models of Text for Multi-document summarization 15TextGraphs 2 at HLT/NAACL 2007 Timestamped Graph Construction Representations – We can model a number of different algorithms using this 9-tuple formalism: (d, e, u, f,σ, t, i, s,τ) – The given toy example: (f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null) – LexRank graphs: (u, N, 1, 1, cosine-based, sentence, L max, 0, null) N = total number of sentences in the cluster; L max = the max document length i.e., all sentences are added into the graph in one timestep, each connected to all others, and cosine scores are given to edge weights

TSG-based summarization Methodology Evaluation Analysis

Using Evolutionary Models of Text for Multi-document summarization 17TextGraphs 2 at HLT/NAACL 2007 System Overview Sentence splitting –Detect and mark sentence boundaries –Annotate each sentence with the doc ID and the sentence number –E.g., XIE : 4 March 1998 from Xinhua News; XIE : the 14 th sentence of this document Graph construction –Construct TSG in this phase

Using Evolutionary Models of Text for Multi-document summarization 18TextGraphs 2 at HLT/NAACL 2007 System Overview Sentence Ranking – Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes Sentence extraction – Extract the top-ranked sentences – Two different modified MMR re- rankers are used, depending on whether it is main or update task

Using Evolutionary Models of Text for Multi-document summarization 19TextGraphs 2 at HLT/NAACL 2007 Evaluation Dataset: DUC 2005, 2006 and Evaluation tool: ROUGE: n-gram based automatic evaluation Each dataset contains 50 or 45 clusters, each cluster contains a query and 25 documents Evaluate on some parameters –Do different e values affect the summarization process? –How do topic-sensitivity and edge weighting perform in running PageRank? –How does skewing the graph affect the information flow in the graph?

Using Evolutionary Models of Text for Multi-document summarization 20TextGraphs 2 at HLT/NAACL 2007 Evaluation on number of edges (e) Tried different e values Optimal performance: e = 2 At e = 1, graph is too loosely connected, not suitable for PageRank → very low performance At e = N, a LexRank system N N N e = 2

Using Evolutionary Models of Text for Multi-document summarization 21TextGraphs 2 at HLT/NAACL 2007 Evaluation (other edge parameters) PageRank: generic vs topic-sensitive Edge weight (u): unweighted vs weighted Optimal performance: topic-sensitive PageRank and weighted edges Topic- sensitive Weighted edges ROUGE-1ROUGE-2 No YesNo NoYes Yes

Using Evolutionary Models of Text for Multi-document summarization 22TextGraphs 2 at HLT/NAACL 2007 Evaluation on skew degree (s) Different skew degrees: s = 0, 1 and 2 Optimal performance: s = 1 s = 2 introduces a delay interval that is too large Need to try freely skewed graphs Skew degreeROUGE-1ROUGE

Using Evolutionary Models of Text for Multi-document summarization 23TextGraphs 2 at HLT/NAACL 2007 Holistic Evaluation in DUC We participated in DUC 2007 with an extractive-based TSG system Main task: 12 th for ROUGE-2, 10 th for ROUGE-SU4 among 32 systems Update task: 3 rd for ROUGE-2, 4 th for ROUGE-SU4 among 24 systems Used a modified version of maximal marginal relevance to penalize links in previously read articles – Extension of inter-document factor (f) TSG formalism better tailored to deal with update / incremental text tasks New method that may be competitive with current approaches – Other top scoring systems may do sentence compression, not just extraction

Using Evolutionary Models of Text for Multi-document summarization 24TextGraphs 2 at HLT/NAACL 2007 Conclusion Proposed a timestamped graph model for text understanding and summarization – Adds sentences one at a time Parameterized model with nine variables – Canonicalizes representation for several graph based summarization algorithms Future Work Freely skewed model Empirical and theoretical properties of TSGs (e.g., in-degree distribution)

Backup Slides 25 Minute talk total 26 Apr 2007, 11:50-12:15

Using Evolutionary Models of Text for Multi-document summarization 26TextGraphs 2 at HLT/NAACL 2007 Differences for main and update task processing Main task: 1.Construct a TSG for input cluster 2.Run topic-sensitive PageRank on the TSG 3.Apply first modified version of MMR to extract sentences Update task: Cluster A: – Construct a TSG for cluster A – Run topic-sensitive PageRank on the TSG – Apply the second modified version of MMR to extract sentences Cluster B: – Construct a TSG for clusters A and B – Run topic-sensitive PageRank on the TSG; only retain sentences from B – Apply the second modified version of MMR to extract sentences Cluster C: – Construct a TSG for clusters A, B and C – Run topic-sensitive PageRank on the TSG; only retain sentences from C – Apply the second modified version of MMR to extract sentences

Using Evolutionary Models of Text for Multi-document summarization 27TextGraphs 2 at HLT/NAACL 2007 Sentence Ranking Once a timestamped graph is built, we want to compute an prestige score for each node PageRank: use an iterative method that allows the weights of the nodes to redistribute until stability is reached Similarities as edges → weighted edges; query → topic-sensitive Topic sensitive (Q) portion Standard random walk term

Using Evolutionary Models of Text for Multi-document summarization 28TextGraphs 2 at HLT/NAACL 2007 Sentence Extraction – Main task Original MMR: integrates a penalty of the maximal similarity of the candidate document and one selected document Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total similarity of the candidate sentence and all selected sentences Score(s) = PageRank score of s; S = selected sentences This is used in the main task Penalty: All previous sentence similarity

Using Evolutionary Models of Text for Multi-document summarization 29TextGraphs 2 at HLT/NAACL 2007 Sentence Extraction – Update task Update task assumes readers already read previous cluster(s) – implies we should not select sentences that have redundant information with previous cluster(s) Propose a modified MMR for the update task: – consider the total similarity of the candidate sentence with all selected sentences and sentences in previously-read cluster(s) P contains some top-ranked sentences in previous cluster(s) Previous cluster overlap

Using Evolutionary Models of Text for Multi-document summarization 30TextGraphs 2 at HLT/NAACL 2007 References Günes Erkan and Dragomir R. Radev LexRank: Graph-based centrality as salience in text summari-zation. Journal of Artificial Intelligence Research, (22). Rada Mihalcea and Paul Tarau TextRank: Bring-ing order into texts. In Proceedings of EMNLP S.N. Dorogovtsev and J.F.F. Mendes Evolution of networks. Submitted to Advances in Physics on 6th March Sergey Brin and Lawrence Page The anatomy of a large-scale hypertextual Web search engine. Com-puter Networks and ISDN Systems, 30(1-7). Jon M. Kleinberg Authoritative sources in a hy-perlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan NUS at DUC 2005: Understanding docu-ments via concepts links. In Proceedings of DUC 2005.