Content Selection: Topics, Graphs, & Supervision

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

Text Categorization.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Clustering Unsupervised learning Generating “classes”
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
LexRank: Graph-based Centrality as Salience in Text Summarization
LexRank: Graph-based Centrality as Salience in Text Summarization
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
National Taiwan University, Taiwan
Vector Space Models.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
1 CS 430: Information Discovery Lecture 5 Ranking.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Content Selection: Supervision & Discourse
Ling 573 Systems and Applications May 3, 2016
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Methods and Apparatus for Ranking Web Page Search Results
Data Mining K-means Algorithm
Relation Extraction CSCI-GA.2591
Summarization: Overview
Ling573 Systems & Applications April 7, 2016
Entity- & Topic-Based Information Ordering
Summarizing Entities: A Survey Report
Compact Query Term Selection Using Topically Related Text
Applying Key Phrase Extraction to aid Invalidity Search
Statistical NLP: Lecture 9
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
John Frazier and Jonathan perrier
Representation of documents and queries
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
N-Gram Model Formulas Word sequences Chain rule of probability
Text Categorization Berlin Chen 2003 Reference:
Learning to Rank Typed Graph Walks: Local and Global Approaches
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presented by Nick Janus
Presentation transcript:

Content Selection: Topics, Graphs, & Supervision Ling573 Systems & Applications April 12, 2016

Roadmap MEAD: classic end-to-end system Bayesian topic models Cues to content extraction Bayesian topic models Graph-based approaches Random walks Supervised selection Term ranking with rich features

MEAD Exemplar centroid-based summarization system Radev et al, 2000, 2001, 2004 Exemplar centroid-based summarization system Tf-idf similarity measures Multi-document summarizer Publically available summarization implementation (No warranty) Solid performance in DUC evaluations Standard non-trivial evaluation baseline

Main Ideas Select sentences central to cluster: Cluster-based relative utility Measure of sentence relevance to cluster

Main Ideas Select sentences central to cluster: Cluster-based relative utility Measure of sentence relevance to cluster Select distinct representative from equivalence classes Cross-sentence information subsumption Sentences including same info content said to subsume A) John fed Spot; B) John gave food to Spot and water to the plants. I(B) subsumes I(A) If mutually subsume, form equivalence class

Centroid-based Models Assume clusters of topically related documents Provided by automatic or manual clustering

Centroid-based Models Assume clusters of topically related documents Provided by automatic or manual clustering Centroid: “pseudo-document of terms with Count * IDF above some threshold” Intuition: centroid terms indicative of topic

Centroid-based Models Assume clusters of topically related documents Provided by automatic or manual clustering Centroid: “pseudo-document of terms with Count * IDF above some threshold” Intuition: centroid terms indicative of topic Count: average # of term occurrences in cluster IDF computed over larger side corpus (e.g. full AQUAINT)

MEAD Content Selection Input: Sentence segmented, cluster documents (n sents) Compression rate: e.g. 20%

MEAD Content Selection Input: Sentence segmented, cluster documents (n sents) Compression rate: e.g. 20% Output: n * r sentence summary

MEAD Content Selection Input: Sentence segmented, cluster documents (n sents) Compression rate: e.g. 20% Output: n * r sentence summary Select highest scoring sentences based on: Centroid score Position score First-sentence overlap (Redundancy)

Score Computation Score(si) = wcCi+wpPi+wfFi

Score Computation Score(si) = wcCi+wpPi+wfFi Ci=ΣiCw,I Sum over centroid values of words in sentence

Score Computation Score(si) = wcCi+wpPi+wfFi Ci=ΣiCw,I Sum over centroid values of words in sentence Pi=((n-i+1)/n)*Cmax Positional score: Cmax:score of highest sent in doc Scaled by distance from beginning of doc

Score Computation Score(si) = wcCi+wpPi+wfFi Ci=ΣiCw,I Sum over centroid values of words in sentence Pi=((n-i+1)/n)*Cmax Positional score: Cmax:score of highest sent in doc Scaled by distance from beginning of doc Fi = S1*Si Overlap with first sentence TF-based inner product of sentence with first in doc Alternate weighting schemes assessed Diff’t optima in different papers

Managing Redundancy Alternative redundancy approaches: Redundancymax: Excludes sentences with cosine overlap > threshold

Managing Redundancy Alternative redundancy approaches: Redundancymax: Excludes sentences with cosine overlap > threshold Redundancy penalty: Subtracts penalty from computed score Rs = 2 * # overlapping wds/(# wds in sentence pair) Weighted by highest scoring sentence in set

System and Evaluation Information ordering:

System and Evaluation Information ordering: Information realization: Chronological by document date Information realization:

System and Evaluation Information ordering: Information realization: Chronological by document date Information realization: Pure extraction, no sentence revision Participated in DUC 2001, 2003 Among top-5 scoring systems Varies depending on task, evaluation measure Solid straightforward system Publicly available; will compute/output weights

Bayesian Topic Models Perspective: Generative story for document topics Multiple models of word probability, topics General English Input Document Set Individual documents Select summary which minimizes KL divergence Between document set and summary: KL(PD||PS) Often by greedily selecting sentences Also global models

Graph-Based Models LexRank (Erkan & Radev, 2004) Key ideas: Graph-based model of sentence saliency Draws ideas from PageRank, HITS, Hubs & Authorities Contrasts with straight term-weighting models Good performance: beats tf*idf centroid

Graph View Centroid approach: Central pseudo-document of key words in cluster

Graph View Centroid approach: Graph-based approach: Central pseudo-document of key words in cluster Graph-based approach: Sentences (or other units) in cluster link to each other Salient if

Graph View Centroid approach: Graph-based approach: Central pseudo-document of key words in cluster Graph-based approach: Sentences (or other units) in cluster link to each other Salient if similar to many others More central or relevant to the cluster Low similarity with most others, not central

Constructing a Graph Graph: Nodes: Edges:

Constructing a Graph Graph: How do we compute similarity b/t nodes? Nodes: sentences Edges: measure of similarity between sentences How do we compute similarity b/t nodes?

Constructing a Graph Graph: How do we compute similarity b/t nodes? Nodes: sentences Edges: measure of similarity between sentences How do we compute similarity b/t nodes? Here: tf*idf (could use other schemes) How do we compute overall sentence saliency?

Constructing a Graph Graph: How do we compute similarity b/t nodes? Nodes: sentences Edges: measure of similarity between sentences How do we compute similarity b/t nodes? Here: tf*idf (could use other schemes) How do we compute overall sentence saliency? Degree centrality LexRank

Example Graph

Degree Centrality Centrality: # of neighbors in graph Threshold = 0: Edge(a,b) if cosine_sim(a,b) >= threshold Threshold = 0:

Degree Centrality Centrality: # of neighbors in graph Threshold = 0: Edge(a,b) if cosine_sim(a,b) >= threshold Threshold = 0: Fully connected  uninformative Threshold = 0.1, 0.2:

Degree Centrality Centrality: # of neighbors in graph Threshold = 0: Edge(a,b) if cosine_sim(a,b) >= threshold Threshold = 0: Fully connected  uninformative Threshold = 0.1, 0.2: Some filtering, can be useful Threshold >= 0.3:

Degree Centrality Centrality: # of neighbors in graph Threshold = 0: Edge(a,b) if cosine_sim(a,b) >= threshold Threshold = 0: Fully connected  uninformative Threshold = 0.1, 0.2: Some filtering, can be useful Threshold >= 0.3: Only two connected pairs in example Also uninformative

LexRank Degree centrality: 1 edge, 1 vote Possibly problematic:

LexRank Degree centrality: 1 edge, 1 vote LexRank idea: Possibly problematic: E.g. erroneous doc in cluster, some sent. may score high LexRank idea: Node can have high(er) score via high scoring neighbors

LexRank Degree centrality: 1 edge, 1 vote LexRank idea: Possibly problematic: E.g. erroneous doc in cluster, some sent. may score high LexRank idea: Node can have high(er) score via high scoring neighbors Same idea as PageRank, Hubs & Authorities Page ranked high b/c pointed to by high ranking pages

LexRank Degree centrality: 1 edge, 1 vote LexRank idea: Possibly problematic: E.g. erroneous doc in cluster, some sent. may score high LexRank idea: Node can have high(er) score via high scoring neighbors Same idea as PageRank, Hubs & Authorities Page ranked high b/c pointed to by high ranking pages

Power Method Input: Initialize p0 (uniform) t=0 repeat Adjacency matrix M Initialize p0 (uniform) t=0 repeat t= t+1 pt=MTpt-1 Until convergence Return pt

LexRank Can think of matrix X as transition matrix of Markov chain i.e. X(i,j) is probability of transition from state i to j

LexRank Can think of matrix X as transition matrix of Markov chain i.e. X(i,j) is probability of transition from state i to j Will converge to a stationary distribution (r) Given certain properties (aperiodic, irreducible) Probability of ending up in each state via random walk

LexRank Can think of matrix X as transition matrix of Markov chain i.e. X(i,j) is probability of transition from state i to j Will converge to a stationary distribution (r) Given certain properties (aperiodic, irreducible) Probability of ending up in each state via random walk Can compute iteratively to convergence via: “Lexical PageRank”  “LexRank (power method computes eigenvector )

LexRank Score Example For earlier graph:

Continuous LexRank Basic LexRank ignores similarity scores Except for initial thresholding of adjacency

Continuous LexRank Basic LexRank ignores similarity scores Except for initial thresholding of adjacency Could just use weights directly (rather than degree)

Advantages vs Centroid Captures information subsumption Highly ranked sentences have greatest overlap w/adj Will promote those sentences

Advantages vs Centroid Captures information subsumption Highly ranked sentences have greatest overlap w/adj Will promote those sentences Reduces impact of spurious high-IDF terms Rare terms get very high weight (reduce TF) Lead to selection of sentences w/high IDF terms Effect minimized in LexRank

Example Results Beat official DUC 2004 entrants: All versions beat baselines and centroid

Example Results Beat official DUC 2004 entrants: All versions beat baselines and centroid Continuous LR > LR > degree Variability across systems/tasks

Example Results Beat official DUC 2004 entrants: All versions beat baselines and centroid Continuous LR > LR > degree Variability across systems/tasks Common baseline and component

Supervised Word Selection RegSumm: Improving the Estimation of Word Importance for News Multi-Document Summarization (Hong & Nenkova, ’14) Key ideas: Supervised method for word selection Diverse, rich feature set: unsupervised measures, POS, NER, position, etc Identification of common “important” words via side corpus of news articles and human summaries

Basic Approach Learn keyword importance Contrasts with unsupervised selection, learning sentences

Basic Approach Learn keyword importance Contrasts with unsupervised selection, learning sentences Train regression over large number of possible features Supervision over words Did document word appear in summary or not?

Basic Approach Learn keyword importance Contrasts with unsupervised selection, learning sentences Train regression over large number of possible features Supervision over words Did document word appear in summary or not? Greedy sentence selection: Highest scoring sentences: average word weight Do not add if >= 0.5 cosine similarity w/any curr sents

Features I Unsupervised measures: Used as binary features given some threshold

Features I Unsupervised measures: Word probability: count(w)/N Used as binary features given some threshold Word probability: count(w)/N Computed over input cluster

Features I Unsupervised measures: Word probability: count(w)/N Used as binary features given some threshold Word probability: count(w)/N Computed over input cluster Log likelihood ratio: Gigaword as background corpus

Features I Unsupervised measures: Word probability: count(w)/N Used as binary features given some threshold Word probability: count(w)/N Computed over input cluster Log likelihood ratio: Gigaword as background corpus Markov Random Walk (MRW): Graphical model approach similar to LexRank Nodes: words Edges: # syntactic dependencies b/t wds in sentences Weights via PageRank algorithm

Features II “Global” word importance: Question: Are there words which are intrinsically likely to show up in (news) summaries?

Features II “Global” word importance: Question: Are there words which are intrinsically likely to show up in (news) summaries? Approach: Build language models on NYT corpus of articles+summs One model on articles, one model on summaries

Features II “Global” word importance: Question: Are there words which are intrinsically likely to show up in (news) summaries? Approach: Build language models on NYT corpus of articles+summs One model on articles, one model on summaries Measures: PrA(w), PrA(w)-PrG(w), PrA(w)/PrG(w) KL(A||G) = PrA(w)*ln (PrA(w)/PrG(w)) KL(G||A) = PrG(w)*ln (PrG(w)/PrA(w))

Features II “Global” word importance: Question: Are there words which are intrinsically likely to show up in (news) summaries? Approach: Build language models on NYT corpus of articles+summs One model on articles, one model on summaries Measures: PrA(w), PrA(w)-PrG(w), PrA(w)/PrG(w) KL(A||G) = PrA(w)*ln (PrA(w)/PrG(w)) KL(G||A) = PrG(w)*ln (PrG(w)/PrA(w)) Binary features: top-k or bottom-k features

Features III Adaptations of common features:

Features III Adaptations of common features: Word position as proportion of document [0,1] Earliest first, latest last, average, average first

Features III Adaptations of common features: Word position as proportion of document [0,1] Earliest first, latest last, average, average first Word type: POS, NER

Features III Adaptations of common features: Word position as proportion of document [0,1] Earliest first, latest last, average, average first Word type: POS, NER Emphasizes NNS, NN, capitalization; ORG, PERS, LOC MPQA and LIWC features: MPQA:

Features III Adaptations of common features: Word position as proportion of document [0,1] Earliest first, latest last, average, average first Word type: POS, NER Emphasizes NNS, NN, capitalization; ORG, PERS, LOC MPQA and LIWC features: MPQA: sentiment, subjectivity terms Strong sentiment likely or not?

Features III Adaptations of common features: Word position as proportion of document [0,1] Earliest first, latest last, average, average first Word type: POS, NER Emphasizes NNS, NN, capitalization; ORG, PERS, LOC MPQA and LIWC features: MPQA: sentiment, subjectivity terms Strong sentiment likely or not? NOT LIWC: words for 64 categories: +: death, anger, money Neg: pron, neg, fn words, swear, adverbs, etc

Assessment: Words Select N highest ranked keywords via regression Compute F-measure over words in summaries Gi: i = # of summaries in which word appears

Assessment: Summaries Compare summarization w/ROUGE-1,2,4 Basic Systems State of The Art Systems

CLASSY “Clustering, Linguistics and Statistics for Summarization Yield” Conroy et al. 2000-2011 Highlights: High performing system Often rank 1 in DUC/TAC, commonly used comparison Topic signature-type system (LLR) HMM-based content selection Redundancy handling

Using LLR for Weighting Compute weight for all cluster terms weight(wi) = 1 if -2log λ> 10, 0 o.w. Use that to compute sentence weights How do we use the weights? One option: directly rank sentences for extraction LLR-based systems historically perform well Better than tf*idf generally

HMM Sentence Selection CLASSY strategy: Use LLR as feature in HMM How does HMM map to summarization?

HMM Sentence Selection CLASSY strategy: Use LLR as feature in HMM How does HMM map to summarization? Key idea: Two classes of states: summary, non-summary Feature(s)?

HMM Sentence Selection CLASSY strategy: Use LLR as feature in HMM How does HMM map to summarization? Key idea: Two classes of states: summary, non-summary Feature(s)?: log(#sig+1) (tried: length, position,..) Lower cased, white-space tokenized (a-z), stopped Topology:

HMM Sentence Selection CLASSY strategy: Use LLR as feature in HMM How does HMM map to summarization? Key idea: Two classes of states: summary, non-summary Feature(s)?: log(#sig+1) (tried: length, position,..) Lower cased, white-space tokenized (a-z), stopped Topology:

HMM Sentence Selection CLASSY strategy: Use LLR as feature in HMM How does HMM map to summarization? Key idea: Two classes of states: summary, non-summary Feature(s)?: log(#sig+1) (tried: length, position,..) Lower cased, white-space tokenized (a-z), stopped Topology: Select sentences with highest posterior (in “summary”)

Matrix-based Selection Redundancy minimizing selection Create term x sentence matrix If term in sentence, weight is nonzero

Matrix-based Selection Redundancy minimizing selection Create term x sentence matrix If term in sentence, weight is nonzero Loop: Select highest scoring sentence Based on Euclidean norm

Matrix-based Selection Redundancy minimizing selection Create term x sentence matrix If term in sentence, weight is nonzero Loop: Select highest scoring sentence Based on Euclidean norm Subtract those components from remaining sentences Until enough sentences

Matrix-based Selection Redundancy minimizing selection Create term x sentence matrix If term in sentence, weight is nonzero Loop: Select highest scoring sentence Based on Euclidean norm Subtract those components from remaining sentences Until enough sentences Effect: selects highly ranked but different sentences Relatively insensitive to weighting schemes

Combining Approaches Both HMM and Matrix method select sentences Can combine to further improve

Combining Approaches Both HMM and Matrix method select sentences Can combine to further improve Approach: Use HMM method to compute sentence scores (e.g. rather than just weight based) Incorporates context information, prior states

Combining Approaches Both HMM and Matrix method select sentences Can combine to further improve Approach: Use HMM method to compute sentence scores (e.g. rather than just weight based) Incorporates context information, prior states Loop: Select highest scoring sentence

Combining Approaches Both HMM and Matrix method select sentences Can combine to further improve Approach: Use HMM method to compute sentence scores (e.g. rather than just weight based) Incorporates context information, prior states Loop: Select highest scoring sentence Update matrix scores Exclude those with too low matrix scores Until enough sentences are found

Other Linguistic Processing Sentence manipulation (before selection): Remove uninteresting phrases based on POS tagging Gerund clauses, restr. rel. appos, attrib, lead adverbs

Other Linguistic Processing Sentence manipulation (before selection): Remove uninteresting phrases based on POS tagging Gerund clauses, restr. rel. appos, attrib, lead adverbs Coreference handling (Serif system) Created coref chains initially Replace all mentions with longest mention (# caps) Used only for sentence selection

Outcomes HMM, Matrix: both effective, better combined Linguistic pre-processing improves Best ROUGE-1,ROUGE-2 in DUC Coref handling improves: Best ROUGE-3, ROUGE-4; 2nd ROUGE-2