Finding Story Chains in Newswire Articles

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Connecting the Dots Between News Articles Dafna Shahaf and Carlos Guestrin.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.

Connecting the Dots Between News Articles Dafna Shahaf and Carlos Guestrin.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Overview of Web Data Mining and Applications Part I

1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

Tag-based Social Interest Discovery

Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Data Mining By Dave Maung.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Algorithmic Detection of Semantic Similarity WWW 2005.

1 Internet Research Third Edition Unit A Searching the Internet Effectively.

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Cohesive Subgraph Computation over Large Graphs

An Efficient Algorithm for Incremental Update of Concept space

Neighborhood - based Tag Prediction

Data Mining, Neural Network and Genetic Programming

An Automatic Construction of Arabic Similarity Thesaurus

Entity- & Topic-Based Information Ordering

Personalized Social Image Recommendation

CSE 454 Advanced Internet Systems University of Washington

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

A Comparative Study of Link Analysis Algorithms

Information Retrieval

Author: Kazunari Sugiyama, etc. (WWW2004)

Information Organization: Clustering

Visualizing Document Collections

Searching with context

Text Categorization Berlin Chen 2003 Reference:

Information Retrieval and Web Design

Connecting the Dots Between News Article

Presentation transcript:

Finding Story Chains in Newswire Articles Xianshu Zhu, Tim Oates University of Maryland, Baltimore County 11/15/2018

Outline Introduction and Motivation Related Work Random walk-based finding story chain algorithm Experimental Result Conclusion and Future Work 11/15/2018

Overview of O.J.Simpson Trial Introduction Story Chain: a set of events linking together. Story: a set of events Event: a news article Overview of O.J.Simpson Trial 11/15/2018

Introduction Information Overload !!! 11/15/2018

Introduction Conventional search engine display unstructured search result Search results ranked by relevance Keyword-based ranking method PageRank etc. None of the ranking algorithm above can help to organize the search results by evolution of the story

Introduction Limitation of unstructured search results: Missing the big picture on complex stories

Example: Hurricane Katrina too much information. Don’t know what to look at 11/15/2018

Introduction Limitation of unstructured search results: Search engines have little support for complex queries Hard to find hidden relationships between two events 11/15/2018

Example: How O.J.Simpson trial relates to racial problem? 11/15/2018

Motivation Can we build a search tool that can automatically find story chains given start and end articles as input? An algorithm that can find out how two events are correlated by finding a chain of events that coherently connect them together? 11/15/2018

A Good Story Chain Relevance: The articles on the chain should be relevant to the events connecting the two articles Coherence: The transition between nodes on the chain should be smooth, with no concept jumping or jittering 11/15/2018

A Good Story Chain Low redundancy 11/15/2018

A Good Story Chain 1994/6/12: Nicole Brown and Ronald Goldman are stabbed to death 1994/6/17: Simpson arrested for murder 11/15/2018

Related Work Group articles into events according to the similarity of their contents and time stamps Nallapati et al. “Event threading within news topics”, CIKM’04 Mei et al. “Discovering evolutionary theme patterns from text”, KDD’05 Little effort went into presenting data in a meaningful and coherence manner 11/15/2018

Related Work Shahaf et al. connecting dots between news articles (KDD’10) Chain coherence: determined by the weakest link on the chain Finding story chain by a Linear Programming (LP) problem Objective: Maximize the strength of the weakest link Drawbacks: (1) Not Efficient: O(|D|) random walks to calculate word importance The LP has O(|D|2|W|) variables (2) Does not address Redundancy issue 11/15/2018

Our approach: Random walk based algorithm with pruning Efficient Low redundancy s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018

Our Approach: Random walk based algorithm with pruning Efficient Low redundancy A s t Prune least relevant articles Insert an article to the chain Prune redundant articles 11/15/2018

Our Approach: Random walk based algorithm with pruning Efficient Low redundancy B A s t C Prune least relevant articles Insert an article to the chain Prune redundant articles Stopping rule: no more articles can be added to the chain that increases link strength 11/15/2018

Compute link strength Problem with word-based similarity Picking documents similar to s and t works well when s and t are close, but not for complex chains d3 d2 d1 Katrina hurricane tornado storm damage Random walk approach can help to find relationship between documents that don’t have overlapped words d1 and d2 are related: about the Katrina nature disaster d2 is also about tornado and damage, but do not contain any words in d1 (a) d1 and d3 should be correlated (b) random walk start from d1 can find the relationship Nicole Brown and Ronald Goldman are stabbed to death Jury panel selected: eight black, one white, one hispanic, two mixed race 11/15/2018

Compute link strength Document-word bipartite graph: Document dj, which can be frequently reached from di, is highly related to di. Documents Words .2/(.2+.2)=.5 d1 .2 w1 Random walk is used as a measure for document similarity. .7 .2/(.2+.2)=.5 .1 w2 d2 .4 .4/.4=1 .6 .7/(.7+.6+.8)=.33 .2 w3 d3 .8 w4 11/15/2018

Step 1: Prune least relevant articles D Apriori Principle: any subset of a frequent itemset must be frequent. 11/15/2018

Step 2: Insert an article to the chain Goal: Find the best articles to be added to the link so as to improve the strength of that link 11/15/2018

Step 3: Prune redundant articles 11/15/2018

Step3: Prune redundant articles Random walk starts from document nodes, will be more likely to reach articles that are in the same time bin and close in content Hierarchical method to construct transition matrix t1 t2 time 0.5 1 Documents Words w1 d1 0.2 0.7 0.1 w2 d2 0.4 0.6 w3 0.2 d3 0.8 w4 11/15/2018

Step 3: Prune redundant articles Extended random walk formula: Where is a vector indicating which nodes the random walk will jump to after a restart 11/15/2018 d1 w1 d2 d3 w2 w3 w4 Documents Words t1 t2 time 0.5 1 0.2 0.7 0.1 0.4 0.6 0.8 11/15/2018

Experimental Results Goal: Evaluation Method: Can our algorithm produce good story chains efficiently? How different pruning methods affect story chain construction and the efficiency of the algorithm Evaluation Method: Amazon’s Mechanical Turk (MTurk) Workers rank the story chain from best (5) to worst (1) by relevance, coherence, coverage and redundancy

Experimental results News Stories: Data source: O.J.Simpson Trial Hurricane Katrina The earthquake in Japan 2011 Data source: “North American News Text” from LDC (Linguistic Data Consortium) “New York Times Annotated Corpus” from LDC Crawl news articles from multiple sources (NYTime, Washington Post, Reuters, CNN etc.) 11/15/2018

Experimental results Over 10000 articles contain key word “O.J. Simpson” Start: O.J. Simpson’s Ex-Wife Found Dead in Double Homicide (06/13/94) End: O.J. Simpson Jury Reaches Verdict (10/02/95) 11/15/2018

Experimental Results

Experimental results Random walk without any pruning 11/15/2018

Experimental results With two pruning methods 11/15/2018

Experimental Results

Conclusion and Future work A random walk-based finding story chain algorithm Experiments show that the algorithm can generate coherence story chains with no redundancy and with low computational complexity Future work: detect and find story chains with different branches

Questions? Thank you !!