Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, Dafna Shahaf, Carlos Guestrin.

Slides:



Advertisements
Similar presentations
Optimizing Recommender Systems as a Submodular Bandits Problem Yisong Yue Carnegie Mellon University Joint work with Carlos Guestrin & Sue Ann Hong.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,
Maximizing the Spread of Influence through a Social Network
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Beyond Keyword Search: Discovering Relevant Scientific Literature Khalid El-Arini and Carlos Guestrin August 22, 2011 TexPoint fonts used in EMF. Read.
Linear Submodular Bandits and their Application to Diversified Retrieval Yisong Yue (CMU) & Carlos Guestrin (CMU) Optimizing Recommender Systems Every.
A Decentralised Coordination Algorithm for Maximising Sensor Coverage in Large Sensor Networks Ruben Stranders, Alex Rogers and Nicholas R. Jennings School.
Existing tools to analyze Blogosphere. IceRocket Ice Spy – Spy on what others are searching. Blog Trends – Identifies the trend of particular terms in.
Nisha Ranga TURNING DOWN THE NOISE IN BLOGOSPHERE.
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
Latent Dirichlet Allocation a generative model for text
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
1 Dynamic Resource Allocation in Conservation Planning 1 Daniel GolovinAndreas Krause Beth Gardner Sarah Converse Steve Morey.
[1][1][1][1] Lecture 5-7: Cell Planning of Cellular Networks June 22 + July 6, Introduction to Algorithmic Wireless Communications David Amzallag.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Taming Information Overload Carlos Guestrin Khalid El-Arini Dafna Shahaf Yisong Yue.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Overview of Search Engines
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
Metro Maps of Dafna Shahaf Carlos Guestrin Eric Horvitz.
Tag-based Social Interest Discovery
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Trains of Thought: Generating Information Maps Dafna Shahaf, Carlos Guestrin and Eric Horvitz.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Streaming Predictions of User Behavior in Real- Time Ethan DereszynskiEthan Dereszynski (Webtrends) Eric ButlerEric Butler (Cedexis) OSCON 2014.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Online Social Networks and Media
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Collaborative Deep Learning for Recommender Systems
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Inferring Networks of Diffusion and Influence
Who are the most influential bloggers?
Representing Documents Through Their Readers
Optimizing Submodular Functions
Multimedia Information Retrieval
Cost-effective Outbreak Detection in Networks
Topic Models in Text Processing
Connecting the Dots Between News Article
Presentation transcript:

Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, Dafna Shahaf, Carlos Guestrin

Millions of blog posts published every day Some stories become disproportionately popular Hard to find information you care about

Our Goal: Coverage Turn down the noise in the blogosphere Select a small set of posts that covers the most important stories January 17, 2009

Our Goal: Coverage Turn down the noise in the blogosphere Select a small set of posts that covers the most important stories

Posts selected without personalization Our Goal: Personalization Tailor post selection to user tastes But, I like sports! I want articles like: After personalization based on Zidane’s feedback

Main Contributions Formalize notion of covering the blogosphere Near-optimal solution for post selection Learn a personalized coverage function No-regret algorithm for learning user preferences using limited feedback Evaluate on real blog data Conduct user studies and compare against:

Approach Overview Blogosphere … Feature Extraction Coverage Function Post Selection

Document Features Low level Words, noun phrases, named entities e.g., Obama, China, peanut butter High level e.g., Topics from a topic model Topic = probability distribution over words Inauguration TopicNational Security Topic

Coverage cover ( ) = amount by which covers cover ( ) = amount by which {, } covers … Features Posts … Document dFeature f cover d (f) Set A Feature f cover A (f)

Simple Coverage: MAX-COVER Find k posts that cover the most features cover ( ) = 1 if at least or contain … at George Mason University in Fairfax, Va. Problems with MAX-COVER : Feature Significance in Document Feature Significance in Corpus

Feature Significance in Document Solution: Define a probabilistic coverage function cover d (f) = P(feature f | post d) Not really about Washington cover (Washington) = 0.01 Feature Significance in Document Feature Significance in Corpus e.g., with topics as features Feature Significance in Document Feature Significance in Corpus ≡ P(post d is about topic f)

Feature Significance in Corpus Some features are more important Want to cover the important features Solution: Associate a weight w f with each feature f e.g., frequency of feature in corpus Cover an important feature using multiple posts Barack Obama Carlos Guestrin Feature Significance in Document Feature Significance in Corpus

cover ( )= 1 – P(neither nor cover ) = 1 – (1 – 0.5) (1 – 0.4) = 0.7 cover( ) Incremental Coverage probability at least one post in set A covers feature f 1. Obama: Tight noose on Bin Laden as good as capture 2. What Obama’s win means for China cover ( ) < 0.7 < cover ( )+cover ( ) Gain due to covering using multiple posts Diminishing returns

Post Selection Optimization Want to select a set of posts A that maximizes This function is submodular Exact maximization is NP-hard Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution We use CELF (Leskovec et al 2007) feature set weights on features probability that set A covers feature f

Approach Overview Blogosphere Feature Extraction Coverage Function Post Selection Submodular function optimization

Evaluating Coverage Evaluate on real blog data from Spinn3r 2 week period in January ~200K posts per day (after pre-processing) Two variants of our algorithm User study involving 27 subjects to evaluate: TDN+LDA: High level features Latent Dirichlet Allocation topics TDN+NE: Low level features Topicality & Redundancy

Topicality User Study … Reference StoriesPost for evaluation Downed jet lifted from ice- laden Hudson River NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson… Is this post topical? i.e., is it related to any of the major stories of the day?

Results: Topicality LDA topics as features Named entities and common noun phrases as features Higher is better TDN +NE TDN +LDA We do as well as Yahoo! and Google

Evaluation: Redundancy 1.Israel unilaterally halts fire as rockets persist 2.Downed jet lifted from ice-laden Hudson River 3.Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell Is this post redundant with respect to any of the previous posts?

Results: Redundancy Lower is better TDN +LDA TDN +NE Google performs poorly We do as well as Yahoo! Google performs poorly We do as well as Yahoo!

Higher is better TDN +LDA TDN +NE Results: Coverage Google: good topicality, high redundancy Yahoo!: performs well on both, but uses rich features CTR, search trends, user voting, etc. TopicalityRedundancy Lower is better TDN +LDA TDN +NE We do as well as Yahoo! using only post content We do as well as Yahoo! using only post content

Results: January 22, 2009

Personalization People have varied interests Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears

Approach Overview Blogosphere Feature Extraction Post Selection Personalization Coverage Function Personalized coverage Fn. Pers. Post Selection

Modeling User Preferences ¼ f represents user preference for feature f Want to learn preference ¼ over the features ¼ for a sports fan ¼ for a politico ¼5¼5 ¼4¼4 ¼3¼3 ¼2¼2 ¼1¼1 ¼5¼5 ¼4¼4 ¼3¼3 ¼2¼2 ¼1¼1 User preference Importance of feature in corpus

Learning User Preferences Before any feedback After 1 day of personalization After 2 days of personalization Multiplicative Weights Update

No-Regret Learning learned ¼ Given the user ratings in advance, compare with the optimal fixed ¼ optimal fixed ¼ Theorem: For TDN, optimal fixed learned using TDN avg( ) – avg( ) 0 i.e., we achieve no-regret

Approach Overview Blogosphere Feature Extraction Pers. Post Selection Personalization Submodular function optimization User feedback Online learning Personalized coverage fn.

Simulating a Sports Fan likes all posts from Fan House (a sports blog) Dead Spin (Sports Blog) Fan House (Sports Blog) Huffington Post (Politics Blog) Days of sports personalization Unpersonalized Personalization ratio Personalization Ratio Personalized Objective Unpersonalized Objective ═ ═

Personalizing for India Like all posts about India Dislike everything else After 5 epochs: 1. India keeps up pressure on Pakistan over Mumbai After 10 epochs: 1. Pakistan’s shift alarms the U.S. 3. India among 20 most dangerous places in world After 15 epochs: 1. 26/11 effect: Pak delegation gets cold vibes 3. Pakistan flaunts its all weather ties with China 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire] 8. Miliband was not off-message, he toed the UK line on Kashmir

Generate personalized posts Obtain user ratings Generate posts without using feedback Obtain user ratings Personalization User Study … Blogosphere …

Personalization Evaluation Higher is better Unpersonalized Personalized Users like personalized posts more than unpersonalized posts Users like personalized posts more than unpersonalized posts

Summary Formalized covering the blogosphere Near-optimal optimization algorithm Learned a personalized coverage function No-regret learning algorithm Evaluated on real blog data Coverage: using only post content, we perform as well as other techniques that use richer features Successfully tailor post selection to user preferences