Download presentation
Presentation is loading. Please wait.
1
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009
2
Theodoros Lappas Outline The Problem: How to effectively search through large document sequences (e.g. newspapers) Previous Work Using Bursty Terms to identify Events Modeling Burstiness using Discrepancy Theory Our Search Framework Experiments
3
SIGKDD 2009 Theodoros Lappas The Problem Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query. Consider the San Francisco Call : a daily 1900s newspaper We are given the query Two candidate events, relevant to the query: The disastrous fire of 1903 in the Iroquois Theater in Chicago A disastrous performance given by an actor in a local theater Clearly the first event is far more influential: articles on this event should be ranked higher!
4
SIGKDD 2009 Theodoros Lappas Previous Work Burstiness explored in different domains Burst Detection - Kleinberg 2002 Stream clustering - He et al. 2007 Graph Evolution - Kumar et al. 2003 Event Detection - Fung et al. 2005 Nothing on Burstiness-aware Search: Standard Information Retrieval techniques do not consider the underlying events discussed in the collection. Event Detection Techniques do not consider user input.
5
SIGKDD 2009 Theodoros Lappas Burstiness Bursty periods: periods of “unusually” high frequency Unusual? Deviating from an expected baseline. Major Events are discussed in numerous articles for an extended timeframe. The event’s keywords exhibit high frequency bursts during the timeframe Frequency of the term “earthquake”, as it appeared in the SF Call, (1908 - 1909).
6
SIGKDD 2009 Theodoros Lappas Modeling Burstiness using Discrepancy Theory Discrepancy: Used to express and quantify the deviation from the norm In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency Maximal Interval : One that does not include and is not included in an interval of higher score. MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.
7
SIGKDD 2009 Theodoros Lappas Baseline - Discussion Baseline can be dynamic : – frequency sequence(s) from previous year(s) – Time Series Decomposition to extract Seasonal, Trend and Irregular Components
8
SIGKDD 2009 Theodoros Lappas A Diagram of our framework
9
SIGKDD 2009 Theodoros Lappas Phase 1 : Preprocessing The output is the set of terms to be monitored The input is a raw document sequence. Preprocessing Methods: Stemming, Synonym matching, etc. Stopwords Removal Frequency Pruning for rare words
10
SIGKDD 2009 Theodoros Lappas Phase 2 – Retrieval of Bursty Intervals Input: A term Output: Set of non- overlapping intervals + their burstiness scores 1) Create the frequency sequence for the term. 2) Extract bursty intervals using the MAX-1 algorithm
11
SIGKDD 2009 Theodoros Lappas Phase 3 – Interval Indexing Input: Set of bursty intervals for each term Output: An Index of Intervals Simple, easily updatable structure Need to support multi-term queries
12
SIGKDD 2009 Theodoros Lappas Inverted Interval Index Up Next: Query Evaluation
13
SIGKDD 2009 Theodoros Lappas Phase 4 : Top- k Evaluation for Multi-Term Queries Customized Version of the Threshold Algorithm (TA) for top-k Evaluation. Standard Version: – Terms-to-Documents – Each document either appears in a term’s list or not Our Version (TA*): – Terms-to-Intervals – A bursty interval of a term t 1 may overlap multiple intervals of a term t 2. Up Next: Experiments
14
SIGKDD 2009 Theodoros Lappas Empirical Evaluation San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles List of Major Events from 1900-1909 (from Wikipedia) + query for each event.
15
SIGKDD 2009 Theodoros Lappas Major Events List
16
SIGKDD 2009 Theodoros Lappas Experiment 1 - Query Expansion 1)Submit respective query for each event in Major Events List. 2)Get top interval 3)Report the 10 terms that appear in the most document titles within the interval
17
SIGKDD 2009 Theodoros Lappas Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police
18
SIGKDD 2009 Theodoros Lappas Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine
19
SIGKDD 2009 Theodoros Lappas Experiment 2 – Burst Detection 1)Submit respective query for each event in Major Events List. 2)Get top reported interval 3)Compare with actual event date We use MAX-1, MAX-2 to extract bursty intervals. MAX-2 : –Re-run MAX-1 on each interval –Obtain nested structure
20
SIGKDD 2009 Theodoros Lappas Examples Event: A fire at the Iroquois Theater in Chicago kills 600. Query: ACTUALMAX-1MAX-2 Dec 30 190322 Dec - 20 Aug31 Dec - 26 Jan Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021. Query: ACTUALMAX-1MAX-2 Jun 15 190414 May - 4 Sep16 Jun - 20 Jun
21
SIGKDD 2009 Theodoros Lappas Conclusion The 1 st efficient end-to-end framework for burstiness-aware search in document sequences. Future Work: – Evaluate on even larger Corpora – Evaluate on more types of text
22
SIGKDD 2009 Theodoros Lappas Thank you!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.