Inferring User Political Preferences from Streaming Communications Svitlana Volkova 1, Glen Coppersmith 2 and Benjamin Van Durme 1,2 1 Center for Language and Speech Processing 2 Human Language Technology Center of Excellence ACL 2014, Baltimore
Motivation Personalized, diverse and timely data Can reveal user interests, preferences and opinions DemographicsPro – WolphralAlpha Analytics –
Applications Large-scale passive polling and real-time live polling Online advertising Healthcare analytics Personalized recommendation systems and search
User Attribute Prediction Political Preference Rao et al., 2010; Conover et al., 2011, Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, Communications Gender Garera and Yarowsky, 2009; Rao et al., 2010; Burger et al., 2011; Van Durme, 2012; Zamal et al., 2012; Bergsma and Van Durme, 2013 Age Rao et al., 2010; Zamal et al., 2012; Cohen and Ruth, 2013; Nguyen et al., 2011, 2013 … … … … …
Existing Approaches ~1K Tweets* ….… Does an average Twitter user produce thousands of tweets? *Rao et al., 2010; Conover et al., 2011; Pennacchiotti and Popescu, 2011a; Burger et al., 2011; Zamal et al., 2012; Nguyen et al., 2013 Tweets as a document
How Active are Twitter Users?
Real-World Predictions Not active users: no or limited content Average Twitter users Median = 10 tweets per day Active users 1,000+ tweets Private users: no content 10% 50% 20%
Our Approach 1.Take advantage of user local neighborhoods 2.Incremental dynamic real-time predictions Real world batch predictions Streaming predictions
Our Approach 1.Take advantage of user local neighborhoods 2.Incremental dynamic real-time predictions Real world batch predictions
Attributed Social Network User Local Neighborhoods a.k.a. Social Circles
Twitter Network Data Code, data and trained models for gender, age, political preference prediction
Twitter Social Graph I.Candidate-Centric 1,031 users of interest II.Geo-Centric 270 users III.Politically Active* 371 users neighbors of each type per user ~50K nodes, ~60K edges What types of neighbors lead to the best attribute prediction for a given user? *Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013 Code, data and trained models for gender, age, political preference prediction
Experiments Log-linear binary unigram models: (I)Users vs. (II) Neighbors and (III) Both Evaluate the relative utility of different neighborhood types: – varying neighborhood size n=[1, 2, 5, 10] and content amount t=[5, 10, 15, 25, 50, 100, 200] – 10-fold cross validation with 100 random restarts for every n and t parameter combination
Neighborhood Comparison Tweets per Neighbor 1 Neighbor10 Neighbors Accuracy
Optimizing Twitter API Calls Cand-Centric Graph: Friend Circle
Summary: Batch Real-World Predictions with Limited User Data More data is better How to get it? More neighbors per user > additional content from the existing neighbors What kind of data? Follower, retweet Users recently joined Twitter No or limited access to user tweets no or very limited content! Real-world predictions
Our Approach 1.Take advantage of user local neighborhoods 2.Incremental dynamic real-time predictions Streaming predictions
Iterative Bayesian Predictions Time … ?
Cand-Centric Graph: Belief Updates ? … Time ? …
Cand-Centric Graph: Prediction Time User-Neighbor 100 users 75% confidence Cand 75% 95% User Stream
Batch vs. Online Performance
Summary Neighborhood content is useful * Neighborhoods constructed from friends, usermentions and retweets are most effective Signal is distributed in the neighborhood Streaming models > batch models *Pennacchiotti and Popescu, 2011a, 2001b; Conover et al., 2011a, 2001b; Golbeck et al., 2011; Zamal et al., 2012
Thank you! Labeled Twitter network data for gender, age, political preference prediction: Code and pre-trained models available upon request: