Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
KDD 2011 Research Poster Content - Driven Trust Propagation Framwork V. G. Vinod Vydiswaran, ChengXiang Zhai, and Dan Roth University of Illinois at Urbana-Champaign.
Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
Statistical Topic Models for Integrating and Analyzing Opinions in Blog articles Yue Lu Qiaozhu Mei ChengXiang Zhai.
Complementary Information How do Equity Markets Complete? Seminario Desarrollo del Mercado Bursátil en Chile SVS-ICARE-UAI Junio 2008.
Time-Sensitive Web Image Ranking and Retrieval via Dynamic Multi-Task Regression Gunhee Kim Eric P. Xing 1 School of Computer Science, Carnegie Mellon.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
By: Kortny Case. The Pentagon Before 9/11 The Pentagon On 9/11.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Ch. 23 The 2000s Election  Al Gore (D) vs. George W. Bush (R)- very close  Gore won the popular vote but Bush won the electoral vote  Florida.
BASIC INFORMATION ABOUT ITS INFLUENCE ON THE AMERICAN ECONOMY The Stock Market.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
Probabilistic Topic Models
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Causal Connections Between Scientometric Indicators R. D. Shelton, Tarek Fadel, Patricia Foland Which Ones Best Explain High-Technology Manufacturing Outputs?
Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Aileen Wang Period 5 Computer Systems Lab 2010 TJSTAR June 3, 2010 An Analysis of Dynamic Applications of Black-Scholes.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Relevance Feedback Hongning Wang
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
A Study of Poisson Query Generation Model for Information Retrieval
EFFICIENT MARKETS. The Efficient Market Hypothesis Most tests of EMH:  How fast information is incorporated in prices  Not whether information is correctly.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Hierarchical Clustering & Topic Models
TextScope: Enhance Human Perception via Text Mining
List 1 expense that a business needs money for
Probabilistic Topic Model
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†
TextScope: Enhance Human Perception via Text Mining
Tuesday, March 21, 2017 Objective: Students will be able to assess ways to be a wise investor in the stock market and in other personal investment options.
Election of Al Gore -George W. Bush -very close election
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Machine Learning on Data Lecture 9b- Clustering
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Language Models for TR Rong Jin
Presentation transcript:

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa) Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs) 1

… Time Any clues in the companion news stream? 2 Dow Jones Industrial Average [Source: Yahoo Finance] Text Mining for Understanding Time Series What might have caused the stock market crash? Sept 11 Attack!

Analysis of Presidential Prediction Markets What might have caused the sudden drop of price for this candidate? What “mattered” in this election? … Time Any clues in the companion news stream? Tax cut? 3

Joint Analysis of Text and Time Series to Discover “Causal Topics” Input: – Time series – Text data produced in a similar time period (text stream) Output – Topics whose coverage in the text stream has strong correlations with the time series (“causal” topics) Tax cut Gun control … 4

Related Work Topic modeling (e.g., [Hofmann 99], [Blei et al. 03], …) – Extract topics from text data and reveal their patterns – No consideration of time series  topics extracted may not be correlated with time series Stream data mining (e.g., [Agrawal 02]) – Clustering & categorization of time series data – No topics being generated for text data Temporal text retrieval and prediction (e.g., [Efron 10], [Smith10]) – Incorporating time factor in retrieval or text-based prediction – No topics being generated New Problem: Discover causal topics from text streams with time series data for supervision 5

Background: Topic Models Topic = multinomial distribution over words (unigram language models) Text is assumed to be a sample of words drawn from a mixture of multiple (unknown) topics Parameter estimation and Bayesian inference “reveal” – All the unknown topics in a text collection – The coverage of each topic in each document – Prior can be imposed to bias the inference of both topics and topic coverage 6

Document as a Sample of Mixed Topics Topic  1 Topic  k Topic  2 … Background  k government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans is 0.05 the 0.04 a [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … 7 Generative Topic Model Inference/Estimation Of topics Prior can be added on them

When a topic model applied to text stream … Time Topic  1 Topic  k Topic  2 … Background  k government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans is 0.05 the 0.04 a

New Text Mining Framework: Iterative Causal Topic Modeling 9 Non-text Time Series Sep 2001 Oct … 2001 Text Stream Causal Topics Topic 1Topic 2 Topic 3Topic 4 Zoom into W ord Level Split Words Feedback as Prior Causal Words Topic 1 Topic Modeling Topic 2 Topic 3Topic 4 Topic 1-2 W2 -- W4 -- Topic 1-1 W1 + W3 + Topic 1 W1 + W2 -- W3 + W4 -- W5 …

Iterative Causal Topic Modeling Framework 10 Non-text Time Series Sep 2001 Oct … 2001 Text Stream Causal Topics Topic 1Topic 2 Topic 3Topic 4 Zoom into W ord Level Split Words Feedback as Prior Causal Words Topic 1 Topic Modeling Topic 2 Topic 3Topic 4 Topic 1-2 W2 -- W4 -- Topic 1-1 W1 + W3 + Topic 1 W1 + W2 -- W3 + W4 -- W5 … General Framework for any topic modeling and any causality measure Naturally incorporate non-text time series in the process Topic level + Word level  Efficiency + Granularity

Heuristic Optimization of Causality + Coherence 11

Pearson correlation – Basic correlation Granger Test – For two time series x (topic), y (stock), time lag p Significance test if lagged x terms should be retained or not 12 Causality Measures Auto-regression Lagged values

Feedback Prior Generation TopicWordImpact Significance (%) 1 Social+99 Security+96 Gun-98 Control-96 5 September-99 Airline-99 Terrorism-97 … (5 more words) Attack-96 Good TopicWordProb 1 Social0.8 security0.2 2 Gun0.75 Control September0.1 Airline0.1 Terrorism0.075 … (5 more) Attack0.05 Good0.0

Time: June 2000 – Dec Text data – New York Times Time series – American Airlines stock (AAMRQ) – Apple stock (AAPL) Question: any “causal topics” to explain fluctuation of the stocks of the two companies? 14 Experiment Design 1: Stock Market Analysis

15 Experiment Design 2: 2000 Presidential election campaign

Measuring Topic Quality Causality Confidence of a topic – Based on p-value of causality test (Granger, Pearson) for the topic Topic Purity – Consistency in the direction of “causal” relation with the time series (“are all words in the topic positively correlated with the time series?”) – Based on entropy of distribution of positive/negative words 16

Topic Purity TopicWordImpactSignificance (%) 1 Social+99 Security+96 Gun-98 Control-96 5 September-99 Airline-99 Terrorism-97 Attack-96 Good+96 P(T=“pos”) H(T) P(T=“pos”)=p(T=“neg”)=1/2  Highest entropy  Lowest purity(0) P(T=“pos”)=1/5 p(T=“neg”)=4/5  Lower entropy  Higher purity 17

AAMRQAAPL russia russian putin europe europeangermany bush gore presidential police court judge airlines airport air united trade terrorism food foods cheese nets scott basketball tennis williams open awards gay boy moss minnesota chechnya paid notice st russia russian europe olympicgames olympics she her ms oil ford prices black fashion blacks computer technology software internet com web football giants jets japan japanese plane … 18 - Significant topic list of two different external time series. AAMRQ: airline, terrorism topic AAPL: IT industry topic  Topics discovered depend on external time series Sample Result 1: Topics discovered for AAMRQ vs. AAPL

Effect of Iterations on Causality Confidence & Purity 19

Iter Different Feedback Strength (µ) 20 Significant improvement in confidence, number of significant topics by feedback – Clear benefit of feedback Large µ guarantees topic purity improvement µ=10 µ=50 µ=100 µ=500 µ=1000 Iter

Sample Result 2: Major Topics in 2000 Presidential Election Revealed several important issues – E.g. tax cut, abortion, gun control, oil energy – Such topics are also cited in political science literature [Pomper `01] and Wikipedia [Link]Link 21 Top Three Words in Significant Topics tax cut 1 screen pataki guiliani enthusiasm door symbolic oil energy prices news w top pres al vice love tucker presented partial abortion privatization court supreme abortion gun control nra

Additional Results: Election/dashboard/Dashboard.html Election/dashboard/Dashboard.html 22

Conclusions & Future Work Meaningful topics can be extracted from text stream by using time series for supervision Such “causal” topics provide potential explanations for changes in the time series data Preliminary experiment results on 2000 presidential prediction markets are promising Future work (discussion) – Issues related to topic models (e.g., local maxima, # of topics, interpretation of topics) – Issues related to causality analysis (e.g., “local” causality) – Unified analysis model – System to support online interactive analysis of causal topics (time series can be derived from text too) 23

Thank You! Questions/Comments? 24