Presentation is loading. Please wait.

Presentation is loading. Please wait.

Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop.

Similar presentations


Presentation on theme: "Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop."— Presentation transcript:

1 Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop Presented by: Yelena Mejova 1

2 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 2

3 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 3

4 Event Tracking People talk: - What - When -How much 4

5 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 5

6 Data Set Published by Spinn3r.com 44 million blog posts August 1, 2008 – October 1, 2008 No comments 6

7 Data Set Languages 7

8 Data Set Document Length 8

9 Data Set Document Distribution by Date 9

10 Data Set Popular Categories 10

11 Data Set Our subset: – 1 million documents (4% of all English posts) – English only – Inlink threshold of 400 11

12 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 12

13 Tracking Approach Phase I: Estimate relevance-based topic models Phase II: Estimate topical intensity training docs docs topic models topic models topic models topic models docs 13

14 Relevance-based Topical Model 14

15 Relevance-based Topical Model b b ekek ekek t o (d) w w N D K BACKGROUND TOPIC (EX: COMMON ENGLISH WORDS) EVENT TOPIC OTHER DOCUMENT- SPECIFIC TOPIC OBSERVED WORD TOKEN TRAINING DOCUMENT TRAINING DOCUMENTS FOR AN EVENT ALL TRAINING SETS FOR ALL K EVENTS 15

16 Relevance-based Topical Model Inference – Given a training set for each event considered b- All documents e k - Event training documents, not the rest t o (d)- One document, not the rest 16

17 Estimating intensities From a subset (slice) Window: 5 days Intensity(e i,t) = Σ log[p(d|e i )] d ∈ [t,t+w] Log-likelihood of document given an event At a particular window in time 17

18 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 18

19 Related Work Topic Evolution Extraction Zhou et al 2006, Mei & Zhai 2005 Topic Detection and Tracking Allan 2002, Allan et al 1998 Blog Mining Attardi & Simi 2006, Aschenbrenner & Miksch 2005, Kumar et al 2003, Glance, Hurst, Tomokiyo 2004 Relevance Modeling Robertson & Sparck-Jones 1988, Lavrenko & Croft 2001 19

20 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 20

21 Event Tracking News Events sources: wikipedia.org + news sites training subsets: retrieved using Lucene 2 US Presidential Election Economic Financial Crisis Hurricane Tropical Storms US Open Tennis Russia Georgia Conflict Beijing Olympics China Milk Powder Scandal Thai Political Crisis Delhi India Bomb Blast Pakistan Impeachment 21

22 Event Tracking Topic Estimation Beijing Olympics wordP(w|BO) olymp0.075 beij0.071 phelp0.043 china0.041 game0.040 gold0.023 august0.021 michael0.021 US Presidential Election wordP(w|USPE) obama0.064 mccain0.050 palin0.041 democrat0.034 republican0.030 clinton0.019 biden0.018 convent0.017 22

23 Event Tracking Running mate announcements, National Conventions Olympics: Aug 8-24 Phelps’ Eighth Medal: Aug 17 Impeachment launched: Aug 7 Formal impeachment charges: Aug 17 Musharraf’s formal resignation: Aug 18 Several Hurricanes 23

24 Event Tracking Are the spikes due to sampling process? Topic Latency – How long does it take for discussion to start? What is the effect of topic interference? – Ex: Beijing Olympics China / China Milk Scandal What kinds of subtopics contribute to the main topics? 24

25 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 25

26 Sub-Event Tracking Training set: event-specific WASNOW common Englishcommon English + common topical eventsub-event other doc-specific 26

27 Sub-Event Tracking 27

28 Sub-Event Tracking Sub-topic Estimation Democratic Convention wordP(w|DC) obama0.041 dnc0.040 democrat0.038 clinton0.034 biden0.034 denver0.027 barack0.021 hillari0.012 Republican Convention wordP(w|RC) palin0.073 republican0.063 mccain0.050 sarah0.029 rnc0.025 song0.009 paul0.009 gop0.009 28

29 Sub-Event Tracking Democratic Convention: August 25 - 28 Republican Convention: September 1 - 4 29

30 Sub-Event Tracking Named: August 15 Landfall: August 18 Named: August 25 Landfall: September 1 Named: September 1 Landfall: September 13 30

31 Sub-Event Tracking Deeper hierarchies Re-define sub-topics – Opinion, locale, other demographics 31 Financial Crisis Federal Reserve Bailout AIGGoldman Sachs Taxpayer Reaction Congressional Reaction Conflicts of Interest Taxpayer Reaction Financial Market Reaction

32 Conclusions Topic modeling – Excluding non-relevant background and document-specific terms Topic tracking – Closely corresponds with real world – Hierarchical Scalability 32

33 Outline Motivation: Topic Tracking Explore the weblog collection Event tracking approach Related work Results Event tracking Sub-event tracking Future directions 33

34 Future Directions Baseline – standard ad hoc retrieval approaches? Evaluation – gold standard? Dynamic Topic Tracking – moving time window Community Dynamics Topical Sentiment Analysis 34

35 Thank You 35

36 Works Cited [1] Blei, M., Ng. A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 2003. [2] Apache Lucene. http://lucene.apache.org/java/docs/ 36


Download ppt "Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop."

Similar presentations


Ads by Google