Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Similar presentations


Presentation on theme: "Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)"— Presentation transcript:

1 Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa) Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs) 1

2 … Time Any clues in the companion news stream? 2 Dow Jones Industrial Average [Source: Yahoo Finance] Text Mining for Understanding Time Series What might have caused the stock market crash? Sept 11 Attack!

3 Analysis of Presidential Prediction Markets What might have caused the sudden drop of price for this candidate? What “mattered” in this election? … Time Any clues in the companion news stream? Tax cut? 3

4 Joint Analysis of Text and Time Series to Discover “Causal Topics” Input: – Time series – Text data produced in a similar time period (text stream) Output – Topics whose coverage in the text stream has strong correlations with the time series (“causal” topics) Tax cut Gun control … 4

5 Related Work Topic modeling (e.g., [Hofmann 99], [Blei et al. 03], …) – Extract topics from text data and reveal their patterns – No consideration of time series  topics extracted may not be correlated with time series Stream data mining (e.g., [Agrawal 02]) – Clustering & categorization of time series data – No topics being generated for text data Temporal text retrieval and prediction (e.g., [Efron 10], [Smith10]) – Incorporating time factor in retrieval or text-based prediction – No topics being generated New Problem: Discover causal topics from text streams with time series data for supervision 5

6 Background: Topic Models Topic = multinomial distribution over words (unigram language models) Text is assumed to be a sample of words drawn from a mixture of multiple (unknown) topics Parameter estimation and Bayesian inference “reveal” – All the unknown topics in a text collection – The coverage of each topic in each document – Prior can be imposed to bias the inference of both topics and topic coverage 6

7 Document as a Sample of Mixed Topics Topic  1 Topic  k Topic  2 … Background  k government 0.3 response 0.2... donate 0.1 relief 0.05 help 0.02... city 0.2 new 0.1 orleans 0.05... is 0.05 the 0.04 a 0.03... [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … 7 Generative Topic Model Inference/Estimation Of topics Prior can be added on them

8 When a topic model applied to text stream … Time Topic  1 Topic  k Topic  2 … Background  k government 0.3 response 0.2... donate 0.1 relief 0.05 help 0.02... city 0.2 new 0.1 orleans 0.05... is 0.05 the 0.04 a 0.03... 8

9 New Text Mining Framework: Iterative Causal Topic Modeling 9 Non-text Time Series Sep 2001 Oct … 2001 Text Stream Causal Topics Topic 1Topic 2 Topic 3Topic 4 Zoom into W ord Level Split Words Feedback as Prior Causal Words Topic 1 Topic Modeling Topic 2 Topic 3Topic 4 Topic 1-2 W2 -- W4 -- Topic 1-1 W1 + W3 + Topic 1 W1 + W2 -- W3 + W4 -- W5 …

10 Iterative Causal Topic Modeling Framework 10 Non-text Time Series Sep 2001 Oct … 2001 Text Stream Causal Topics Topic 1Topic 2 Topic 3Topic 4 Zoom into W ord Level Split Words Feedback as Prior Causal Words Topic 1 Topic Modeling Topic 2 Topic 3Topic 4 Topic 1-2 W2 -- W4 -- Topic 1-1 W1 + W3 + Topic 1 W1 + W2 -- W3 + W4 -- W5 … General Framework for any topic modeling and any causality measure Naturally incorporate non-text time series in the process Topic level + Word level  Efficiency + Granularity

11 Heuristic Optimization of Causality + Coherence 11

12 Pearson correlation – Basic correlation Granger Test – For two time series x (topic), y (stock), time lag p Significance test if lagged x terms should be retained or not 12 Causality Measures Auto-regression Lagged values

13 Feedback Prior Generation TopicWordImpact Significance (%) 1 Social+99 Security+96 Gun-98 Control-96 5 September-99 Airline-99 Terrorism-97 … (5 more words) Attack-96 Good+96 13 TopicWordProb 1 Social0.8 security0.2 2 Gun0.75 Control0.25 3 September0.1 Airline0.1 Terrorism0.075 … (5 more) Attack0.05 Good0.0

14 Time: June 2000 – Dec. 2011 Text data – New York Times Time series – American Airlines stock (AAMRQ) – Apple stock (AAPL) Question: any “causal topics” to explain fluctuation of the stocks of the two companies? 14 Experiment Design 1: Stock Market Analysis

15 15 Experiment Design 2: 2000 Presidential election campaign

16 Measuring Topic Quality Causality Confidence of a topic – Based on p-value of causality test (Granger, Pearson) for the topic Topic Purity – Consistency in the direction of “causal” relation with the time series (“are all words in the topic positively correlated with the time series?”) – Based on entropy of distribution of positive/negative words 16

17 Topic Purity TopicWordImpactSignificance (%) 1 Social+99 Security+96 Gun-98 Control-96 5 September-99 Airline-99 Terrorism-97 Attack-96 Good+96 P(T=“pos”) H(T) 1.0 00.51.0 P(T=“pos”)=p(T=“neg”)=1/2  Highest entropy  Lowest purity(0) P(T=“pos”)=1/5 p(T=“neg”)=4/5  Lower entropy  Higher purity 17

18 AAMRQAAPL russia russian putin europe europeangermany bush gore presidential police court judge airlines airport air united trade terrorism food foods cheese nets scott basketball tennis williams open awards gay boy moss minnesota chechnya paid notice st russia russian europe olympicgames olympics she her ms oil ford prices black fashion blacks computer technology software internet com web football giants jets japan japanese plane … 18 - Significant topic list of two different external time series. AAMRQ: airline, terrorism topic AAPL: IT industry topic  Topics discovered depend on external time series Sample Result 1: Topics discovered for AAMRQ vs. AAPL

19 Effect of Iterations on Causality Confidence & Purity 19

20 Iter Different Feedback Strength (µ) 20 Significant improvement in confidence, number of significant topics by feedback – Clear benefit of feedback Large µ guarantees topic purity improvement µ=10 µ=50 µ=100 µ=500 µ=1000 Iter

21 Sample Result 2: Major Topics in 2000 Presidential Election Revealed several important issues – E.g. tax cut, abortion, gun control, oil energy – Such topics are also cited in political science literature [Pomper `01] and Wikipedia [Link]Link 21 Top Three Words in Significant Topics tax cut 1 screen pataki guiliani enthusiasm door symbolic oil energy prices news w top pres al vice love tucker presented partial abortion privatization court supreme abortion gun control nra

22 Additional Results: http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_ Election/dashboard/Dashboard.html http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_ Election/dashboard/Dashboard.html 22

23 Conclusions & Future Work Meaningful topics can be extracted from text stream by using time series for supervision Such “causal” topics provide potential explanations for changes in the time series data Preliminary experiment results on 2000 presidential prediction markets are promising Future work (discussion) – Issues related to topic models (e.g., local maxima, # of topics, interpretation of topics) – Issues related to causality analysis (e.g., “local” causality) – Unified analysis model – System to support online interactive analysis of causal topics (time series can be derived from text too) 23

24 Thank You! Questions/Comments? 24


Download ppt "Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)"

Similar presentations


Ads by Google