Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
1 Estimating Rates of Rare Events at Multiple Resolutions Deepak Agarwal Andrei Broder Deepayan Chakrabarti Dejan Diklic Vanja Josifovski Mayssam Sayyadian.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
INFO 624 Week 3 Retrieval System Evaluation
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Presented by Zeehasham Rasheed
Recommender systems Ram Akella November 26 th 2008.
Scalable Text Mining with Sparse Generative Models
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Search Engines and Information Retrieval Chapter 1.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Evaluation Methods and Challenges. 2 Deepak Agarwal & Bee-Chung ICML’11 Evaluation Methods Ideal method –Experimental Design: Run side-by-side.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Information Retrieval and Web Search
Overfitting and Underfitting
Chapter 5: Information Retrieval and Web Search
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Relevance and Reinforcement in Interactive Browsing
Presentation transcript:

Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24 th April, 2008

Outline Motivating Application, Challenges  Contextual Advertising Semantic versus Predictive models  Pros, Cons Our Approach: Blend Semantic with Predictive Model Description  Logistic Regression, Feature Selection  Model structure amenable to fast scoring at run time Experimental Results Ongoing work

Outline 1 Motivating Application, Background and Challenges

Motivating Application Problem: Match ads to queries  Sponsored Search: The query is a short piece of text input by the user User intent better expressed; less noisy  Contextual Advertising: The query is a webpage Generally long, noisy, user intent less clear Harder matching problem

Challenges Serve ads to maximize revenue (CTR)  Serve most relevant ads in a given context User Feedback in the form of Clicks in different context Automation must for profitability  Billions of opportunities; millions of ads High volume, low marginal cost →lucrative business  Automation through Algorithms/Models Accuracy: Massive data; scalable procedures Structure of Models: Scoring ads under strict latency requirements (~few ms)

Classical Approach: Semantic Serve Shoe ads on Shoe pages Models: Information Retrieval  Get relevant docs (ads) for a query (webpage) Simple vector space model  q=(t 1,w 1 ;…,t n,w n ); a=(a 1,v 1 ;…,a m,v m )  Cos(q,a) =  s ε q ∩ a w s a s /(|q||a|)  w’s, a’s: tf-idf; Frequency: reward in doc; penalize in corpus  Higher score →More relevance

Semantic: Pros & Cons Pros Training: simple, scalable  Vocabulary (stop-words; stemming); Corpus Serving with low latency  evaluates millions of candidate ads in few ms Clever algorithms (Broder et al) Cons Does not always capture context Clicks?  Better? Active user feedback  Can we use it ?

Predictive Approach: Clicks New challenging research area Learn from historic clicks on ads  Indicator of overall relevance  Rank ads by CTR = P(Click|Ad,context) Estimating CTR difficult statistical problem  High-dim, sparseness (too many combinations)  (Page,Ad)→(Page Features, Ad Features) Bias-Variance Tradeoff when selecting features Coarse is stable but less precise; fine has high variance

Statistical Challenges( contd) Retrospective data biased  I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match? What is irrelevant? Labeling negatives.  I never click on ads no matter what Good models maybe complex  Scalability while training (Grid computing helps)  Serving: All models are not index friendly Quick evaluation during serve time improves system

When Semantic meets Predictive Semantic provides domain knowledge  Feature selection driven by semantic knowledge  Predictive “enhances” semantic “correction” terms to semantic to match click feedback fallback on semantic when signal weak Model scalable (Grid computing) Fast to evaluate at run time  Faster→More candidates evaluated at serve time Accuracy versus Coverage

Outline 2 Modeling Approach

Predictive Regression model Region specific splitting for page and ad  Page “regions”: Title, headers, boldfaces, metadata, etc.  Ad “regions”: title, body, etc  Features: words, phrases, classes in different regions. Word matches in title more important that in the body  Illustration: word features; title regions  Extension to multiple regions, multiple feature types routine Experiments to appear in a future version

Logistic Regression: Word features Model clicks/non-clicks: Logistic Regression  Training & test data: events with clicks only y ij ~Ber(p ij ) CTRMain effect for page (overall popularity) Main effect for ad (overall popularity) Interaction effect (words shared by page and ad) Model parameters Gaussian priors on model parameters: penalizes sparse features

Feature weights “correct” relevance M p,w = tf p,w 1(w ε p) M a,w = tf a,w 1(w ε a) I p,a,w = tf p,w * tf a,w 1(w ε p) 1(w ε a) So, IR-based term frequency measures are taken into account

How to select words? Word selection  Overall, nearly 110k words in our training data Stop word removal, stemming  Learning parameters for each word would be: Expensive, overfits  We use simple feature selection strategies Select top-k

Word Selection: data based Define an interaction measure for each word Higher values for words which have higher-than- expected CTR when they occur on both page and ad Remove words served or clicked few times for robustness

Word selection contd Word selection: relevance based Average tfidf score of each word : pages and ads Higher values imply higher relevance  Ranked by geometric mean: tfidf on page and ad  Ranked by tfidf on page and ad; take the union

Best Word Selection scheme Word selection Two methods  Data based  Relevance based  We picked the top 1000 words by each measure  Data-based methods give better results Recall Precision

Semantic similarity score Word features have low coverage; fallback mechanism to semantic similarity Map cosine on logit scale? Create score bins  100 points per bin  Mean score vs logit(CTR)  Quadratic relationship Cosine score logit(p ij )

Incorporating similarity Quadratic relationship used in two ways  Put in cosine and cosine 2 as features  Add as offset: Prior log-odds  Similar Results

Scalable Training Fast Implementation  Training: Hadoop implementation of Logistic Regression Data Iterative Newton- Raphson Random data splits Mean and Variance estimates Combine estimates Learned model params

Outline 3 Fast Evaluation at Serve Time

Efficient Score Evaluation Problem: For a page visit; select top-n ads using scoring formula Why hard: Only a few ms; too many ads to evaluate Rich literature in IR to solve this problem  Efficient solutions for vector space models through “posting lists”  Interaction terms in regression model motivated by this Document at a time (DAAT) strategy  Posting lists: sorted doc IDs for each query term  Evaluates each doc containing at least one query term one at a time stop prematurely if clear doc can’t make it to top n  System sparse, few correlations; efficiency through approximations

Efficient evaluation through two-stage procedure (Broder et al.) HEAP Top-n θ=min-score x1 x2 x3 x U1 U2 U3 U4 Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ WAND Iterator traverses posting list very efficiently by skipping unnecessary docs Efficiency depends on Upper bounds for terms Doc Ids CurrDoc=1 U1+U2+U3 > θ U1 + U2 <= θ

Efficient Testing :WAND Red Ball Ad 1Ad 5Ad 8 Ad 7Ad 8Ad 9 Word posting lists Cursors Query = Red Ball skip Candidate Results = Ad 8 … More generally, queries are weighted  compute upper bounds on score for skips

Efficiency of procedure Efficiency through document skipping Must be able to compute upper bounds quickly  Match scoring formula should not use arbitrary features (“word X in query AND word Y in ad”)  Such pairwise (“cross-product”) checks may get costly Large posting lists; too many evaluations Upper bounds crucial to performance  Large→False +ve’s; Small→False –ve’s  We are using upper bounds recommended in literature More efficient implementation subject of future research

System Architecture: scoring at serve time Fast Implementation  Testing Main effect for ads is used in ordering of ads in postings list (static) Interaction effect is used to modify the idf- table of words (static) Main effect for pages does not play a role in ad serving (page is given) Building postings lists

Outline 4 Experiments and Results, Summary and Ongoing Work

Experiments Recall Precision 25% lift in precision at 10% recall

Experiments Recall Precision 25% lift in precision at 10% recall Low recall region Computed precision-recall for several splits Results statistically significant

Experiments Increasing the number of words from 1000 to 3400 led to only marginal improvement  Diminishing returns  System already performs close to its limit, without needing more training  Changing the training time period changes the word list; we update our posting lists periodically

Summary Matching ads to pages challenging problem We provide an approach that blends semantic similarity and predictive models in a scalable fashion Our approach index friendly Experimental results on large scale system shows significant improvement  We can only improve the relevance models

Ongoing Work Change in training data changes word set  Working on more robust word feature selection Clustering words Efficient indexing strategies through better upper bound estimates for WAND Expanding feature sets to include neighborhoods of words in posting lists  Balance between accuracy and WAND efficiency Isotonic regression on cosine similarity