On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

A Domain Level Personalization Technique A. Campi, M. Mazuran, S. Ronchi.
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Information Retrieval in Practice
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Multi-Agent Research Tool (MART) Second Phase Madhukar Kumar.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Implementing Query Classification HYP: End of Semester Update prepared Minh.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Chapter 6: Information Retrieval and Web Search
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Query Based Event Extraction along a Timeline H.L. Chieu and Y.K. Lee DSO National Laboratories, Singapore (SIGIR 2004)
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
1 Clarifying Sensor Anomalies using Social Network feeds * University of Illinois at Urbana Champaign + U.S. Army Research Lab ++ IBM Research, USA Prasanna.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Information Retrieval and Web Search
Yi-Chia Wang LTI 2nd year Master student
Toshiyuki Shimizu (Kyoto University)
Data Mining Chapter 6 Search Engines
Citation-based Extraction of Core Contents from Biomedical Articles
Chapter 5: Information Retrieval and Web Search
Pei Lee, ICDE 2014, Chicago, IL, USA
Information Retrieval and Web Design
Presentation transcript:

On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009

Theodoros Lappas Outline  The Problem: How to effectively search through large document sequences (e.g. newspapers)  Previous Work  Using Bursty Terms to identify Events  Modeling Burstiness using Discrepancy Theory  Our Search Framework  Experiments

SIGKDD 2009 Theodoros Lappas The Problem  Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query.  Consider the San Francisco Call : a daily 1900s newspaper  We are given the query  Two candidate events, relevant to the query:  The disastrous fire of 1903 in the Iroquois Theater in Chicago  A disastrous performance given by an actor in a local theater  Clearly the first event is far more influential: articles on this event should be ranked higher!

SIGKDD 2009 Theodoros Lappas Previous Work  Burstiness explored in different domains  Burst Detection - Kleinberg 2002  Stream clustering - He et al  Graph Evolution - Kumar et al  Event Detection - Fung et al  Nothing on Burstiness-aware Search:  Standard Information Retrieval techniques do not consider the underlying events discussed in the collection.  Event Detection Techniques do not consider user input.

SIGKDD 2009 Theodoros Lappas Burstiness  Bursty periods: periods of “unusually” high frequency  Unusual?  Deviating from an expected baseline.  Major Events are discussed in numerous articles for an extended timeframe.  The event’s keywords exhibit high frequency bursts during the timeframe  Frequency of the term “earthquake”, as it appeared in the SF Call, ( ).

SIGKDD 2009 Theodoros Lappas Modeling Burstiness using Discrepancy Theory  Discrepancy: Used to express and quantify the deviation from the norm  In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency  Maximal Interval : One that does not include and is not included in an interval of higher score.  MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.

SIGKDD 2009 Theodoros Lappas Baseline - Discussion  Baseline can be dynamic : – frequency sequence(s) from previous year(s) – Time Series Decomposition to extract Seasonal, Trend and Irregular Components

SIGKDD 2009 Theodoros Lappas A Diagram of our framework

SIGKDD 2009 Theodoros Lappas Phase 1 : Preprocessing  The output is the set of terms to be monitored  The input is a raw document sequence. Preprocessing Methods:  Stemming, Synonym matching, etc.  Stopwords Removal  Frequency Pruning for rare words

SIGKDD 2009 Theodoros Lappas Phase 2 – Retrieval of Bursty Intervals  Input: A term  Output: Set of non- overlapping intervals + their burstiness scores 1) Create the frequency sequence for the term. 2) Extract bursty intervals using the MAX-1 algorithm

SIGKDD 2009 Theodoros Lappas Phase 3 – Interval Indexing  Input: Set of bursty intervals for each term  Output: An Index of Intervals Simple, easily updatable structure Need to support multi-term queries

SIGKDD 2009 Theodoros Lappas Inverted Interval Index  Up Next: Query Evaluation

SIGKDD 2009 Theodoros Lappas Phase 4 : Top- k Evaluation for Multi-Term Queries  Customized Version of the Threshold Algorithm (TA) for top-k Evaluation.  Standard Version: – Terms-to-Documents – Each document either appears in a term’s list or not  Our Version (TA*): – Terms-to-Intervals – A bursty interval of a term t 1 may overlap multiple intervals of a term t 2.  Up Next: Experiments

SIGKDD 2009 Theodoros Lappas Empirical Evaluation  San Francisco Call : a daily newspaper with publication dates between ~400,000 articles  List of Major Events from (from Wikipedia) + query for each event.

SIGKDD 2009 Theodoros Lappas Major Events List

SIGKDD 2009 Theodoros Lappas Experiment 1 - Query Expansion 1)Submit respective query for each event in Major Events List. 2)Get top interval 3)Report the 10 terms that appear in the most document titles within the interval

SIGKDD 2009 Theodoros Lappas Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police

SIGKDD 2009 Theodoros Lappas Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine

SIGKDD 2009 Theodoros Lappas Experiment 2 – Burst Detection 1)Submit respective query for each event in Major Events List. 2)Get top reported interval 3)Compare with actual event date  We use MAX-1, MAX-2 to extract bursty intervals.  MAX-2 : –Re-run MAX-1 on each interval –Obtain nested structure

SIGKDD 2009 Theodoros Lappas Examples  Event: A fire at the Iroquois Theater in Chicago kills 600.  Query: ACTUALMAX-1MAX-2 Dec Dec - 20 Aug31 Dec - 26 Jan  Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021.  Query: ACTUALMAX-1MAX-2 Jun May - 4 Sep16 Jun - 20 Jun

SIGKDD 2009 Theodoros Lappas Conclusion  The 1 st efficient end-to-end framework for burstiness-aware search in document sequences.  Future Work: – Evaluate on even larger Corpora – Evaluate on more types of text

SIGKDD 2009 Theodoros Lappas Thank you!!!