Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, Dafna Shahaf, Carlos Guestrin
Millions of blog posts published every day Some stories become disproportionately popular Hard to find information you care about
Our Goal: Coverage Turn down the noise in the blogosphere Select a small set of posts that covers the most important stories January 17, 2009
Our Goal: Coverage Turn down the noise in the blogosphere Select a small set of posts that covers the most important stories
Posts selected without personalization Our Goal: Personalization Tailor post selection to user tastes But, I like sports! I want articles like: After personalization based on Zidane’s feedback
Main Contributions Formalize notion of covering the blogosphere Near-optimal solution for post selection Learn a personalized coverage function No-regret algorithm for learning user preferences using limited feedback Evaluate on real blog data Conduct user studies and compare against:
Approach Overview Blogosphere … Feature Extraction Coverage Function Post Selection
Document Features Low level Words, noun phrases, named entities e.g., Obama, China, peanut butter High level e.g., Topics from a topic model Topic = probability distribution over words Inauguration TopicNational Security Topic
Coverage cover ( ) = amount by which covers cover ( ) = amount by which {, } covers … Features Posts … Document dFeature f cover d (f) Set A Feature f cover A (f)
Simple Coverage: MAX-COVER Find k posts that cover the most features cover ( ) = 1 if at least or contain … at George Mason University in Fairfax, Va. Problems with MAX-COVER : Feature Significance in Document Feature Significance in Corpus
Feature Significance in Document Solution: Define a probabilistic coverage function cover d (f) = P(feature f | post d) Not really about Washington cover (Washington) = 0.01 Feature Significance in Document Feature Significance in Corpus e.g., with topics as features Feature Significance in Document Feature Significance in Corpus ≡ P(post d is about topic f)
Feature Significance in Corpus Some features are more important Want to cover the important features Solution: Associate a weight w f with each feature f e.g., frequency of feature in corpus Cover an important feature using multiple posts Barack Obama Carlos Guestrin Feature Significance in Document Feature Significance in Corpus
cover ( )= 1 – P(neither nor cover ) = 1 – (1 – 0.5) (1 – 0.4) = 0.7 cover( ) Incremental Coverage probability at least one post in set A covers feature f 1. Obama: Tight noose on Bin Laden as good as capture 2. What Obama’s win means for China cover ( ) < 0.7 < cover ( )+cover ( ) Gain due to covering using multiple posts Diminishing returns
Post Selection Optimization Want to select a set of posts A that maximizes This function is submodular Exact maximization is NP-hard Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution We use CELF (Leskovec et al 2007) feature set weights on features probability that set A covers feature f
Approach Overview Blogosphere Feature Extraction Coverage Function Post Selection Submodular function optimization
Evaluating Coverage Evaluate on real blog data from Spinn3r 2 week period in January ~200K posts per day (after pre-processing) Two variants of our algorithm User study involving 27 subjects to evaluate: TDN+LDA: High level features Latent Dirichlet Allocation topics TDN+NE: Low level features Topicality & Redundancy
Topicality User Study … Reference StoriesPost for evaluation Downed jet lifted from ice- laden Hudson River NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson… Is this post topical? i.e., is it related to any of the major stories of the day?
Results: Topicality LDA topics as features Named entities and common noun phrases as features Higher is better TDN +NE TDN +LDA We do as well as Yahoo! and Google
Evaluation: Redundancy 1.Israel unilaterally halts fire as rockets persist 2.Downed jet lifted from ice-laden Hudson River 3.Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell Is this post redundant with respect to any of the previous posts?
Results: Redundancy Lower is better TDN +LDA TDN +NE Google performs poorly We do as well as Yahoo! Google performs poorly We do as well as Yahoo!
Higher is better TDN +LDA TDN +NE Results: Coverage Google: good topicality, high redundancy Yahoo!: performs well on both, but uses rich features CTR, search trends, user voting, etc. TopicalityRedundancy Lower is better TDN +LDA TDN +NE We do as well as Yahoo! using only post content We do as well as Yahoo! using only post content
Results: January 22, 2009
Personalization People have varied interests Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears
Approach Overview Blogosphere Feature Extraction Post Selection Personalization Coverage Function Personalized coverage Fn. Pers. Post Selection
Modeling User Preferences ¼ f represents user preference for feature f Want to learn preference ¼ over the features ¼ for a sports fan ¼ for a politico ¼5¼5 ¼4¼4 ¼3¼3 ¼2¼2 ¼1¼1 ¼5¼5 ¼4¼4 ¼3¼3 ¼2¼2 ¼1¼1 User preference Importance of feature in corpus
Learning User Preferences Before any feedback After 1 day of personalization After 2 days of personalization Multiplicative Weights Update
No-Regret Learning learned ¼ Given the user ratings in advance, compare with the optimal fixed ¼ optimal fixed ¼ Theorem: For TDN, optimal fixed learned using TDN avg( ) – avg( ) 0 i.e., we achieve no-regret
Approach Overview Blogosphere Feature Extraction Pers. Post Selection Personalization Submodular function optimization User feedback Online learning Personalized coverage fn.
Simulating a Sports Fan likes all posts from Fan House (a sports blog) Dead Spin (Sports Blog) Fan House (Sports Blog) Huffington Post (Politics Blog) Days of sports personalization Unpersonalized Personalization ratio Personalization Ratio Personalized Objective Unpersonalized Objective ═ ═
Personalizing for India Like all posts about India Dislike everything else After 5 epochs: 1. India keeps up pressure on Pakistan over Mumbai After 10 epochs: 1. Pakistan’s shift alarms the U.S. 3. India among 20 most dangerous places in world After 15 epochs: 1. 26/11 effect: Pak delegation gets cold vibes 3. Pakistan flaunts its all weather ties with China 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire] 8. Miliband was not off-message, he toed the UK line on Kashmir
Generate personalized posts Obtain user ratings Generate posts without using feedback Obtain user ratings Personalization User Study … Blogosphere …
Personalization Evaluation Higher is better Unpersonalized Personalized Users like personalized posts more than unpersonalized posts Users like personalized posts more than unpersonalized posts
Summary Formalized covering the blogosphere Near-optimal optimization algorithm Learned a personalized coverage function No-regret learning algorithm Evaluated on real blog data Coverage: using only post content, we perform as well as other techniques that use richer features Successfully tailor post selection to user preferences