Presented by Nick Janus

Slides:

Advertisements

Similar presentations

1 Random Walks on Graphs: An Overview Purnamrita Sarkar.

Advertisements

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

Information Networks Link Analysis Ranking Lecture 8.

Graphs, Node importance, Link Analysis Ranking, Random walks

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Link Analysis: PageRank

Graph-based Text Summarization

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

Random Walks Ben Hescott CS591a1 November 18, 2002.

Overview of Markov chains David Gleich Purdue University Network & Matrix Computations Computer Science 15 Sept 2011.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Lecture 21: Spectral Clustering

Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011

Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

HCC class lecture 22 comments John Canny 4/13/05.

HCC class lecture 14 comments John Canny 3/9/05. Administrivia.

Using Social Networking Techniques in Text Mining Document Summarization.

Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.

Clustering Unsupervised learning Generating “classes”

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Entropy Rate of a Markov Chain

Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.

Link Analysis Hongning Wang

1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

LexRank: Graph-based Centrality as Salience in Text Summarization

LexRank: Graph-based Centrality as Salience in Text Summarization

6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 3.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Ranking Link-based Ranking (2° generation) Reading 21.

Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Link Analysis Hongning Wang Standard operation in vector space Recap: formula for Rocchio feedback Original query Rel docs Non-rel docs Parameters.

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

Date: 2005/4/25 Advisor: Sy-Yen Kuo Speaker: Szu-Chi Wang.

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.

A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç

Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.

Roberto Battiti, Mauro Brunato

Content Selection: Topics, Graphs, & Supervision

Semantic Processing with Context Analysis

Quality of a search engine

Link-Based Ranking Seminar Social Media Mining University UC3M

Spectral Clustering.

Compact Query Term Selection Using Topically Related Text

Lecture 22 SVD, Eigenvector, and Web Search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

John Frazier and Jonathan perrier

PageRank algorithm based on Eigenvectors

Link Structure Analysis

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS224N: Query Focused Multi-Document Summarization

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Presentation transcript:

Presented by Nick Janus LexRank: Graph-based Lexical Centrality as Salience in Text Summarization Presented by Nick Janus

Background Text summarization Intuitively, what sentences do we want? Extractive summarization Chooses subset of the original document's sentences Better results Abstractive summarization Complicated - requires semantic inference and language generation Often uses extractive summarization as a pre- processor Intuitively, what sentences do we want? This presentation is a mix of both!

Problem Statement Multi-document text summarization Documents mostly share unknown topic Cluster of documents represented by a network Nodes in center are more salient to topic How are edges defined? How is centrality computed? Topological clustering of documents is often noisy

Degree Centrality Top degree nodes are the most important Cosine similarity is used to calculate edge weights: A threshold is used to eliminate insignificant relationships Use a bag of words model with N words Each sentence/node is encoded as a N-dimensional vector - covariance matrix Results in undirected graph

Degree Centrality Example The threshold has a considerable impact on graph structure and ranking.

LexRank with threshold So far, all nodes hold equal votes More important sentences should have greater centrality (p()): The vectors of each sentence form a stochastic matrix To guarantee that the matrix converges to a stationary distribution, use a damping factor d, giving us PageRank: For a graph to converge it must be irreducible and aperiodic

LexRank with threshold (cont.) So how do we calculate LexRank? Algorithm redux: Get the same cosine covariance matrix as before Bin the values of the covariance matrix with the threshold Regularize each value with neighbor degrees Apply the power method so that the covariance matrix converges The power method returns an eigenvector which contains the scores of all the sentences where U is a square matrix with all elements being equal to 1/N. The transition kernel [dU + (1 − d)B] of the resulting Markov chain is a mixture of two kernels U and B. A random walker on this Markov chain chooses one of the adjacent states of the current state with probability 1− d, or jumps to any state in the graph, including the current state, with probability d.

Continuous LexRank What’s the problem with LexRank? Discretizes node probabilities with a Throws out information about sentences Instead keep the data and use cosine similarity values, normalized to form a stochastic matrix:

Performance Baseline: Centroid-based summarizer Evaluation setting: Compares sentences with a centroid meta-sentence containing high idf scoring words from document. Evaluation setting: Implemented with MEAD Summarization toolkit Document Understanding Conference data sets Model summaries and document clusters Rouge metric Measures unigram co-occurence

curated sets DUC Data Sets

17% noisy Noisy Data Sets

Summing Up Centrality methods may be more resilient in the when dealing with noisy document sets Multi-document case Difficult evaluation Better performance Relation to PageRank