Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.

Similar presentations


Presentation on theme: "On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009."— Presentation transcript:

1 On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009

2 Theodoros Lappas Outline  The Problem: How to effectively search through large document sequences (e.g. newspapers)  Previous Work  Using Bursty Terms to identify Events  Modeling Burstiness using Discrepancy Theory  Our Search Framework  Experiments

3 SIGKDD 2009 Theodoros Lappas The Problem  Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query.  Consider the San Francisco Call : a daily 1900s newspaper  We are given the query  Two candidate events, relevant to the query:  The disastrous fire of 1903 in the Iroquois Theater in Chicago  A disastrous performance given by an actor in a local theater  Clearly the first event is far more influential: articles on this event should be ranked higher!

4 SIGKDD 2009 Theodoros Lappas Previous Work  Burstiness explored in different domains  Burst Detection - Kleinberg 2002  Stream clustering - He et al. 2007  Graph Evolution - Kumar et al. 2003  Event Detection - Fung et al. 2005  Nothing on Burstiness-aware Search:  Standard Information Retrieval techniques do not consider the underlying events discussed in the collection.  Event Detection Techniques do not consider user input.

5 SIGKDD 2009 Theodoros Lappas Burstiness  Bursty periods: periods of “unusually” high frequency  Unusual?  Deviating from an expected baseline.  Major Events are discussed in numerous articles for an extended timeframe.  The event’s keywords exhibit high frequency bursts during the timeframe  Frequency of the term “earthquake”, as it appeared in the SF Call, (1908 - 1909).

6 SIGKDD 2009 Theodoros Lappas Modeling Burstiness using Discrepancy Theory  Discrepancy: Used to express and quantify the deviation from the norm  In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency  Maximal Interval : One that does not include and is not included in an interval of higher score.  MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.

7 SIGKDD 2009 Theodoros Lappas Baseline - Discussion  Baseline can be dynamic : – frequency sequence(s) from previous year(s) – Time Series Decomposition to extract Seasonal, Trend and Irregular Components

8 SIGKDD 2009 Theodoros Lappas A Diagram of our framework

9 SIGKDD 2009 Theodoros Lappas Phase 1 : Preprocessing  The output is the set of terms to be monitored  The input is a raw document sequence. Preprocessing Methods:  Stemming, Synonym matching, etc.  Stopwords Removal  Frequency Pruning for rare words

10 SIGKDD 2009 Theodoros Lappas Phase 2 – Retrieval of Bursty Intervals  Input: A term  Output: Set of non- overlapping intervals + their burstiness scores 1) Create the frequency sequence for the term. 2) Extract bursty intervals using the MAX-1 algorithm

11 SIGKDD 2009 Theodoros Lappas Phase 3 – Interval Indexing  Input: Set of bursty intervals for each term  Output: An Index of Intervals Simple, easily updatable structure Need to support multi-term queries

12 SIGKDD 2009 Theodoros Lappas Inverted Interval Index  Up Next: Query Evaluation

13 SIGKDD 2009 Theodoros Lappas Phase 4 : Top- k Evaluation for Multi-Term Queries  Customized Version of the Threshold Algorithm (TA) for top-k Evaluation.  Standard Version: – Terms-to-Documents – Each document either appears in a term’s list or not  Our Version (TA*): – Terms-to-Intervals – A bursty interval of a term t 1 may overlap multiple intervals of a term t 2.  Up Next: Experiments

14 SIGKDD 2009 Theodoros Lappas Empirical Evaluation  San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles  List of Major Events from 1900-1909 (from Wikipedia) + query for each event.

15 SIGKDD 2009 Theodoros Lappas Major Events List

16 SIGKDD 2009 Theodoros Lappas Experiment 1 - Query Expansion 1)Submit respective query for each event in Major Events List. 2)Get top interval 3)Report the 10 terms that appear in the most document titles within the interval

17 SIGKDD 2009 Theodoros Lappas Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police

18 SIGKDD 2009 Theodoros Lappas Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine

19 SIGKDD 2009 Theodoros Lappas Experiment 2 – Burst Detection 1)Submit respective query for each event in Major Events List. 2)Get top reported interval 3)Compare with actual event date  We use MAX-1, MAX-2 to extract bursty intervals.  MAX-2 : –Re-run MAX-1 on each interval –Obtain nested structure

20 SIGKDD 2009 Theodoros Lappas Examples  Event: A fire at the Iroquois Theater in Chicago kills 600.  Query: ACTUALMAX-1MAX-2 Dec 30 190322 Dec - 20 Aug31 Dec - 26 Jan  Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021.  Query: ACTUALMAX-1MAX-2 Jun 15 190414 May - 4 Sep16 Jun - 20 Jun

21 SIGKDD 2009 Theodoros Lappas Conclusion  The 1 st efficient end-to-end framework for burstiness-aware search in document sequences.  Future Work: – Evaluate on even larger Corpora – Evaluate on more types of text

22 SIGKDD 2009 Theodoros Lappas Thank you!!!


Download ppt "On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009."

Similar presentations


Ads by Google