Finding Story Chains in Newswire Articles Xianshu Zhu, Tim Oates University of Maryland, Baltimore County 11/15/2018
Outline Introduction and Motivation Related Work Random walk-based finding story chain algorithm Experimental Result Conclusion and Future Work 11/15/2018
Overview of O.J.Simpson Trial Introduction Story Chain: a set of events linking together. Story: a set of events Event: a news article Overview of O.J.Simpson Trial 11/15/2018
Introduction Information Overload !!! 11/15/2018
Introduction Conventional search engine display unstructured search result Search results ranked by relevance Keyword-based ranking method PageRank etc. None of the ranking algorithm above can help to organize the search results by evolution of the story
Introduction Limitation of unstructured search results: Missing the big picture on complex stories
Example: Hurricane Katrina too much information. Don’t know what to look at 11/15/2018
Introduction Limitation of unstructured search results: Search engines have little support for complex queries Hard to find hidden relationships between two events 11/15/2018
Example: How O.J.Simpson trial relates to racial problem? 11/15/2018
Motivation Can we build a search tool that can automatically find story chains given start and end articles as input? An algorithm that can find out how two events are correlated by finding a chain of events that coherently connect them together? 11/15/2018
A Good Story Chain Relevance: The articles on the chain should be relevant to the events connecting the two articles Coherence: The transition between nodes on the chain should be smooth, with no concept jumping or jittering 11/15/2018
A Good Story Chain Low redundancy 11/15/2018
A Good Story Chain 1994/6/12: Nicole Brown and Ronald Goldman are stabbed to death 1994/6/17: Simpson arrested for murder 11/15/2018
Related Work Group articles into events according to the similarity of their contents and time stamps Nallapati et al. “Event threading within news topics”, CIKM’04 Mei et al. “Discovering evolutionary theme patterns from text”, KDD’05 Little effort went into presenting data in a meaningful and coherence manner 11/15/2018
Related Work Shahaf et al. connecting dots between news articles (KDD’10) Chain coherence: determined by the weakest link on the chain Finding story chain by a Linear Programming (LP) problem Objective: Maximize the strength of the weakest link Drawbacks: (1) Not Efficient: O(|D|) random walks to calculate word importance The LP has O(|D|2|W|) variables (2) Does not address Redundancy issue 11/15/2018
Our approach: Random walk based algorithm with pruning Efficient Low redundancy s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018
Our Approach: Random walk based algorithm with pruning Efficient Low redundancy A s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018
Our Approach: Random walk based algorithm with pruning Efficient Low redundancy B A s t C Prune least relevant articles Insert an article to the chain Prune redundant articles Stopping rule: no more articles can be added to the chain that increases link strength 11/15/2018
Compute link strength Problem with word-based similarity Picking documents similar to s and t works well when s and t are close, but not for complex chains d3 d2 d1 Katrina hurricane tornado storm damage Random walk approach can help to find relationship between documents that don’t have overlapped words d1 and d2 are related: about the Katrina nature disaster d2 is also about tornado and damage, but do not contain any words in d1 (a) d1 and d3 should be correlated (b) random walk start from d1 can find the relationship Nicole Brown and Ronald Goldman are stabbed to death Jury panel selected: eight black, one white, one hispanic, two mixed race 11/15/2018
Compute link strength Document-word bipartite graph: Document dj, which can be frequently reached from di, is highly related to di. Documents Words .2/(.2+.2)=.5 d1 .2 w1 Random walk is used as a measure for document similarity. .7 .2/(.2+.2)=.5 .1 w2 d2 .4 .4/.4=1 .6 .7/(.7+.6+.8)=.33 .2 w3 d3 .8 w4 11/15/2018
Step 1: Prune least relevant articles D Apriori Principle: any subset of a frequent itemset must be frequent. 11/15/2018
Step 2: Insert an article to the chain Goal: Find the best articles to be added to the link so as to improve the strength of that link 11/15/2018
Step 3: Prune redundant articles 11/15/2018
Step3: Prune redundant articles Random walk starts from document nodes, will be more likely to reach articles that are in the same time bin and close in content Hierarchical method to construct transition matrix t1 t2 time 0.5 1 Documents Words w1 d1 0.2 0.7 0.1 w2 d2 0.4 0.6 w3 0.2 d3 0.8 w4 11/15/2018
Step 3: Prune redundant articles Extended random walk formula: Where is a vector indicating which nodes the random walk will jump to after a restart 11/15/2018 d1 w1 d2 d3 w2 w3 w4 Documents Words t1 t2 time 0.5 1 0.2 0.7 0.1 0.4 0.6 0.8 11/15/2018
Experimental Results Goal: Evaluation Method: Can our algorithm produce good story chains efficiently? How different pruning methods affect story chain construction and the efficiency of the algorithm Evaluation Method: Amazon’s Mechanical Turk (MTurk) Workers rank the story chain from best (5) to worst (1) by relevance, coherence, coverage and redundancy
Experimental results News Stories: Data source: O.J.Simpson Trial Hurricane Katrina The earthquake in Japan 2011 Data source: “North American News Text” from LDC (Linguistic Data Consortium) “New York Times Annotated Corpus” from LDC Crawl news articles from multiple sources (NYTime, Washington Post, Reuters, CNN etc.) 11/15/2018
Experimental results Over 10000 articles contain key word “O.J. Simpson” Start: O.J. Simpson’s Ex-Wife Found Dead in Double Homicide (06/13/94) End: O.J. Simpson Jury Reaches Verdict (10/02/95) 11/15/2018
Experimental Results
Experimental results Random walk without any pruning 11/15/2018
Experimental results With two pruning methods 11/15/2018
Experimental Results
Conclusion and Future work A random walk-based finding story chain algorithm Experiments show that the algorithm can generate coherence story chains with no redundancy and with low computational complexity Future work: detect and find story chains with different branches
Questions? Thank you !!