Representing Documents Through Their Readers

Slides:



Advertisements
Similar presentations
Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Advertisements

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Web Intelligence Text Mining, and web-related Applications
Privacy: Facebook, Twitter
Ranking Tweets Considering Trust and Relevance Srijith Ravikumar,Raju Balakrishnan, and Subbarao Kambhampati Arizona State University 1.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 4: Recent Applications of Lasso 1.
Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, Dafna Shahaf, Carlos Guestrin.
Khalid El-Arini Carnegie Mellon University Joint work with: Ulrich Paquet, Ralf Herbrich, Jurgen Van Gael, Blaise Agüera y Arcas Transparent User Models.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS276 Information Retrieval and Web Mining
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Distributed Representations of Sentences and Documents
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Mining Cross-network Association for YouTube Video Promotion Ming Yan Institute of Automation, C hinese Academy of Sciences May 15, 2014.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Media Relations in a Social Media World By Julie DeBardelaben Deputy Director of Public Affairs CAP National Headquarters.
The identification of interesting web sites Presented by Xiaoshu Cai.
Text Classification, Active/Interactive learning.
Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Using Social Media for Fundraising and Communication with Supporters Lindsay Boyle – Communications & Research Coordinator Claire Chapman – Information.
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
Jointly Modeling Topics, Events and User Interests on Twitter Qiming DiaoJing Jiang School of Information Systems Singapore Management University.
Foxtrot seminar Capturing knowledge of user preferences with recommender systems Stuart E. Middleton David C. De Roure, Nigel R. Shadbolt Intelligence,
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
IR 6 Scoring, term weighting and the vector space model.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
How you know what you’re doing is working… or not working.
Deep learning David Kauchak CS158 – Fall 2016.
Semantic Processing with Context Analysis
The important use of Twitter in the Educators’ World
DM-Group Meeting Liangzhe Chen, Nov
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Intro to NLP and Deep Learning
Martin Rajman, Martin Vesely
Content-Aware Click Modeling
Latent Dirichlet Analysis
Author: Kazunari Sugiyama, etc. (WWW2004)
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Google News Personalization: Scalable Online Collaborative Filtering
From frequency to meaning: vector space models of semantics
Expandable Group Identification in Spreadsheets
Word Embedding Word2Vec.
Personalized Celebrity Video Search Based on Cross-space Mining
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Collaborative Filtering Non-negative Matrix Factorization
How to promote a band well
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presented by: Anurag Paul
Modeling IDS using hybrid intelligent systems
Reinforcement Learning (2)
Term Frequency–Inverse Document Frequency
GhostLink: Latent Network Inference for Influence-aware Recommendation
Reinforcement Learning (2)
Presentation transcript:

Representing Documents Through Their Readers Khalid El-Arini Min Xu Emily B. Fox Carlos Guestrin

overloaded by news More than a million news articles and blog posts generated every hour* Spinn3r statistic * [www.spinn3r.com]

a news recommendation engine user Animate the user in from the right Vector representation: Bag of words LDA topics etc. corpus

a news recommendation engine user Animate the user in from the right Vector representation: Bag of words LDA topics etc. corpus

a news recommendation engine user Animate the user in from the right Vector representation: Bag of words LDA topics etc. [Morales+ WSDM 2012] [El-Arini+ KDD 2009] [Li+ WWW 2010] … corpus

an observation Most common representations don’t naturally line up with user interests Fine-grained representations are too specific High-level topics (e.g., from LDA) - semantically vague - can be inconsistent over time

goal Improve recommendation performance through a more natural document representation

an opportunity: news is now social In 2012, Guardian announced more readers visit site via Facebook than via Google search

badges

our approach a document representation based on how readers publicly describe themselves

music From many such tweets, we learn that someone who identifies with reads articles with these words:

3 million articles in our experiments Given: training set of tweeted news articles from a specific period of time 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new documents 3 million articles in our experiments music words badges

advantages Interpretable Clear labels Correspond to user interests

advantages Interpretable Higher-level than words Clear labels Correspond to user interests Higher-level than words

advantages Interpretable Higher-level than words Clear labels Correspond to user interests Higher-level than words Semantically consistent over time politics

3 million articles in our experiments Given: training set of tweeted news articles from a specific period of time 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new documents 3 million articles in our experiments music words badges

learning the dictionary Training data (for time period t): Bag-of-words representation of document Identifies badges in Twitter profile of tweeter Fleetwood Mac Nicks love album linux music gig cycling

learning the dictionary Training data (for time period t): Bag-of-words representation of document Identifies badges in Twitter profile of tweeter Fleetwood Mac Nicks love album linux music gig cycling

learning the dictionary Training data (for time period t): Model: Bag-of-words representation of document Identifies badges in Twitter profile of tweeter V x 1 K x 1 V x K sparse, non-negative dictionary

learning the dictionary Optimization Efficiently solve via projected stochastic gradient descent, allowing us to operate on streaming data

examining B music Biden soccer Labour September 2012 Music Soccer tennis

badges over time music Biden September 2012 September 2010 Music Soccer Labour Biden tennis

3 million articles in our experiments Given: training set of tweeted news articles from a specific period of time 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new documents 3 million articles in our experiments music words badges

coding the documents Can we just re-use our objective, but fix B? Problem case: Two articles about Barack Obama playing basketball Lasso problem arbitrarily codes one as {Obama, sports} and the other as {politics, basketball} No incentive to pick both “Obama” and “politics” (or both “sports” and “basketball”), as they cover similar words Leads to extremely related articles being totally dissimilar How do we fix this?

a better coding Problem occurs because vanilla lasso ignores relationships between badges Idea: use badge co-occurrence statistics from Twitter

Graph-guided fused lasso [Kim, Sohn, Xing 2009] a better coding Problem occurs because vanilla lasso ignores relationships between badges Idea: use badge co-occurrence statistics from Twitter Define weight ws,t to be high for badges that co-occur often in Twitter profiles Graph-guided fused lasso [Kim, Sohn, Xing 2009]

recap 1. Learn a badge dictionary from training set 2. Use badge dictionary to encode new documents music words badges

experimental results Case study on political columnists User study Offline metrics

coding columnists Downloaded articles from July 2012 for fourteen prominent political columnists Coded the articles via badge dictionary learned from same month Nicholas Kristof Maureen Dowd

“top conservatives on Twitter” a spectrum of pundits “top conservatives on Twitter” Limit badges to progressive and TCOT Predict political alignments of likely readers? more conservative

experimental results Case study on political columnists User study Offline metrics User study shows badges better document representation than LDA topics or tf-idf when recommending news articles across time Offline analysis shows badges are more thematically coherent than LDA topics

user study The fundamental question: Which representation best captures user preferences over time? Study on Amazon Mechanical Turk with 112 users Steps: Show users random 20 articles from Guardian, from time period 1, and obtain ratings Pick random representation (tfidf, LDA, badges) Represent user preferences as mean of liked articles Use probabilistic max-cover* to select 10 related articles from a second time period * [El-Arini+ KDD 2009]

user study better

summary Novel document representation based on user attributes and sharing behavior Interpretable Consistent over time Case studies provide insight into journalism and politics Improved recommendation of news articles to real users