On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009
Theodoros Lappas Outline The Problem: How to effectively search through large document sequences (e.g. newspapers) Previous Work Using Bursty Terms to identify Events Modeling Burstiness using Discrepancy Theory Our Search Framework Experiments
SIGKDD 2009 Theodoros Lappas The Problem Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query. Consider the San Francisco Call : a daily 1900s newspaper We are given the query Two candidate events, relevant to the query: The disastrous fire of 1903 in the Iroquois Theater in Chicago A disastrous performance given by an actor in a local theater Clearly the first event is far more influential: articles on this event should be ranked higher!
SIGKDD 2009 Theodoros Lappas Previous Work Burstiness explored in different domains Burst Detection - Kleinberg 2002 Stream clustering - He et al Graph Evolution - Kumar et al Event Detection - Fung et al Nothing on Burstiness-aware Search: Standard Information Retrieval techniques do not consider the underlying events discussed in the collection. Event Detection Techniques do not consider user input.
SIGKDD 2009 Theodoros Lappas Burstiness Bursty periods: periods of “unusually” high frequency Unusual? Deviating from an expected baseline. Major Events are discussed in numerous articles for an extended timeframe. The event’s keywords exhibit high frequency bursts during the timeframe Frequency of the term “earthquake”, as it appeared in the SF Call, ( ).
SIGKDD 2009 Theodoros Lappas Modeling Burstiness using Discrepancy Theory Discrepancy: Used to express and quantify the deviation from the norm In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency Maximal Interval : One that does not include and is not included in an interval of higher score. MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.
SIGKDD 2009 Theodoros Lappas Baseline - Discussion Baseline can be dynamic : – frequency sequence(s) from previous year(s) – Time Series Decomposition to extract Seasonal, Trend and Irregular Components
SIGKDD 2009 Theodoros Lappas A Diagram of our framework
SIGKDD 2009 Theodoros Lappas Phase 1 : Preprocessing The output is the set of terms to be monitored The input is a raw document sequence. Preprocessing Methods: Stemming, Synonym matching, etc. Stopwords Removal Frequency Pruning for rare words
SIGKDD 2009 Theodoros Lappas Phase 2 – Retrieval of Bursty Intervals Input: A term Output: Set of non- overlapping intervals + their burstiness scores 1) Create the frequency sequence for the term. 2) Extract bursty intervals using the MAX-1 algorithm
SIGKDD 2009 Theodoros Lappas Phase 3 – Interval Indexing Input: Set of bursty intervals for each term Output: An Index of Intervals Simple, easily updatable structure Need to support multi-term queries
SIGKDD 2009 Theodoros Lappas Inverted Interval Index Up Next: Query Evaluation
SIGKDD 2009 Theodoros Lappas Phase 4 : Top- k Evaluation for Multi-Term Queries Customized Version of the Threshold Algorithm (TA) for top-k Evaluation. Standard Version: – Terms-to-Documents – Each document either appears in a term’s list or not Our Version (TA*): – Terms-to-Intervals – A bursty interval of a term t 1 may overlap multiple intervals of a term t 2. Up Next: Experiments
SIGKDD 2009 Theodoros Lappas Empirical Evaluation San Francisco Call : a daily newspaper with publication dates between ~400,000 articles List of Major Events from (from Wikipedia) + query for each event.
SIGKDD 2009 Theodoros Lappas Major Events List
SIGKDD 2009 Theodoros Lappas Experiment 1 - Query Expansion 1)Submit respective query for each event in Major Events List. 2)Get top interval 3)Report the 10 terms that appear in the most document titles within the interval
SIGKDD 2009 Theodoros Lappas Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police
SIGKDD 2009 Theodoros Lappas Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine
SIGKDD 2009 Theodoros Lappas Experiment 2 – Burst Detection 1)Submit respective query for each event in Major Events List. 2)Get top reported interval 3)Compare with actual event date We use MAX-1, MAX-2 to extract bursty intervals. MAX-2 : –Re-run MAX-1 on each interval –Obtain nested structure
SIGKDD 2009 Theodoros Lappas Examples Event: A fire at the Iroquois Theater in Chicago kills 600. Query: ACTUALMAX-1MAX-2 Dec Dec - 20 Aug31 Dec - 26 Jan Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021. Query: ACTUALMAX-1MAX-2 Jun May - 4 Sep16 Jun - 20 Jun
SIGKDD 2009 Theodoros Lappas Conclusion The 1 st efficient end-to-end framework for burstiness-aware search in document sequences. Future Work: – Evaluate on even larger Corpora – Evaluate on more types of text
SIGKDD 2009 Theodoros Lappas Thank you!!!