Download presentation
Presentation is loading. Please wait.
1
Finding Story Chains in Newswire Articles
Xianshu Zhu, Tim Oates University of Maryland, Baltimore County 11/15/2018
2
Outline Introduction and Motivation Related Work
Random walk-based finding story chain algorithm Experimental Result Conclusion and Future Work 11/15/2018
3
Overview of O.J.Simpson Trial
Introduction Story Chain: a set of events linking together. Story: a set of events Event: a news article Overview of O.J.Simpson Trial 11/15/2018
4
Introduction Information Overload !!! 11/15/2018
5
Introduction Conventional search engine display unstructured search result Search results ranked by relevance Keyword-based ranking method PageRank etc. None of the ranking algorithm above can help to organize the search results by evolution of the story
6
Introduction Limitation of unstructured search results:
Missing the big picture on complex stories
7
Example: Hurricane Katrina
too much information. Don’t know what to look at 11/15/2018
8
Introduction Limitation of unstructured search results:
Search engines have little support for complex queries Hard to find hidden relationships between two events 11/15/2018
9
Example: How O.J.Simpson trial relates to racial problem?
11/15/2018
10
Motivation Can we build a search tool that can automatically find story chains given start and end articles as input? An algorithm that can find out how two events are correlated by finding a chain of events that coherently connect them together? 11/15/2018
11
A Good Story Chain Relevance: The articles on the chain should be relevant to the events connecting the two articles Coherence: The transition between nodes on the chain should be smooth, with no concept jumping or jittering 11/15/2018
12
A Good Story Chain Low redundancy 11/15/2018
13
A Good Story Chain 1994/6/12: Nicole Brown and Ronald Goldman are stabbed to death 1994/6/17: Simpson arrested for murder 11/15/2018
14
Related Work Group articles into events according to the similarity of their contents and time stamps Nallapati et al. “Event threading within news topics”, CIKM’04 Mei et al. “Discovering evolutionary theme patterns from text”, KDD’05 Little effort went into presenting data in a meaningful and coherence manner 11/15/2018
15
Related Work Shahaf et al. connecting dots between news articles (KDD’10) Chain coherence: determined by the weakest link on the chain Finding story chain by a Linear Programming (LP) problem Objective: Maximize the strength of the weakest link Drawbacks: (1) Not Efficient: O(|D|) random walks to calculate word importance The LP has O(|D|2|W|) variables (2) Does not address Redundancy issue 11/15/2018
16
Our approach: Random walk based algorithm with pruning
Efficient Low redundancy s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018
17
Our Approach: Random walk based algorithm with pruning
Efficient Low redundancy A s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018
18
Our Approach: Random walk based algorithm with pruning
Efficient Low redundancy B A s t C Prune least relevant articles Insert an article to the chain Prune redundant articles Stopping rule: no more articles can be added to the chain that increases link strength 11/15/2018
19
Compute link strength Problem with word-based similarity
Picking documents similar to s and t works well when s and t are close, but not for complex chains d3 d2 d1 Katrina hurricane tornado storm damage Random walk approach can help to find relationship between documents that don’t have overlapped words d1 and d2 are related: about the Katrina nature disaster d2 is also about tornado and damage, but do not contain any words in d1 (a) d1 and d3 should be correlated (b) random walk start from d1 can find the relationship Nicole Brown and Ronald Goldman are stabbed to death Jury panel selected: eight black, one white, one hispanic, two mixed race 11/15/2018
20
Compute link strength Document-word bipartite graph: Document dj, which can be frequently reached from di, is highly related to di. Documents Words .2/(.2+.2)=.5 d1 .2 w1 Random walk is used as a measure for document similarity. .7 .2/(.2+.2)=.5 .1 w2 d2 .4 .4/.4=1 .6 .7/( )=.33 .2 w3 d3 .8 w4 11/15/2018
21
Step 1: Prune least relevant articles
D Apriori Principle: any subset of a frequent itemset must be frequent. 11/15/2018
22
Step 2: Insert an article to the chain
Goal: Find the best articles to be added to the link so as to improve the strength of that link 11/15/2018
23
Step 3: Prune redundant articles
11/15/2018
24
Step3: Prune redundant articles
Random walk starts from document nodes, will be more likely to reach articles that are in the same time bin and close in content Hierarchical method to construct transition matrix t1 t2 time 0.5 1 Documents Words w1 d1 0.2 0.7 0.1 w2 d2 0.4 0.6 w3 0.2 d3 0.8 w4 11/15/2018
25
Step 3: Prune redundant articles
Extended random walk formula: Where is a vector indicating which nodes the random walk will jump to after a restart 11/15/2018 d1 w1 d2 d3 w2 w3 w4 Documents Words t1 t2 time 0.5 1 0.2 0.7 0.1 0.4 0.6 0.8 11/15/2018
26
Experimental Results Goal: Evaluation Method:
Can our algorithm produce good story chains efficiently? How different pruning methods affect story chain construction and the efficiency of the algorithm Evaluation Method: Amazon’s Mechanical Turk (MTurk) Workers rank the story chain from best (5) to worst (1) by relevance, coherence, coverage and redundancy
27
Experimental results News Stories: Data source: O.J.Simpson Trial
Hurricane Katrina The earthquake in Japan 2011 Data source: “North American News Text” from LDC (Linguistic Data Consortium) “New York Times Annotated Corpus” from LDC Crawl news articles from multiple sources (NYTime, Washington Post, Reuters, CNN etc.) 11/15/2018
28
Experimental results Over articles contain key word “O.J. Simpson” Start: O.J. Simpson’s Ex-Wife Found Dead in Double Homicide (06/13/94) End: O.J. Simpson Jury Reaches Verdict (10/02/95) 11/15/2018
29
Experimental Results
30
Experimental results Random walk without any pruning 11/15/2018
31
Experimental results With two pruning methods 11/15/2018
32
Experimental Results
33
Conclusion and Future work
A random walk-based finding story chain algorithm Experiments show that the algorithm can generate coherence story chains with no redundancy and with low computational complexity Future work: detect and find story chains with different branches
34
Questions? Thank you !!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.