Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research.

Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24 th April, 2008

Outline Motivating Application, Challenges  Contextual Advertising Semantic versus Predictive models  Pros, Cons Our Approach: Blend Semantic with Predictive Model Description  Logistic Regression, Feature Selection  Model structure amenable to fast scoring at run time Experimental Results Ongoing work

Outline 1 Motivating Application, Background and Challenges

Motivating Application Problem: Match ads to queries  Sponsored Search: The query is a short piece of text input by the user User intent better expressed; less noisy  Contextual Advertising: The query is a webpage Generally long, noisy, user intent less clear Harder matching problem

Challenges Serve ads to maximize revenue (CTR)  Serve most relevant ads in a given context User Feedback in the form of Clicks in different context Automation must for profitability  Billions of opportunities; millions of ads High volume, low marginal cost →lucrative business  Automation through Algorithms/Models Accuracy: Massive data; scalable procedures Structure of Models: Scoring ads under strict latency requirements (~few ms)

Classical Approach: Semantic Serve Shoe ads on Shoe pages Models: Information Retrieval  Get relevant docs (ads) for a query (webpage) Simple vector space model  q=(t 1,w 1 ;…,t n,w n ); a=(a 1,v 1 ;…,a m,v m )  Cos(q,a) =  s ε q ∩ a w s a s /(|q||a|)  w’s, a’s: tf-idf; Frequency: reward in doc; penalize in corpus  Higher score →More relevance

Semantic: Pros & Cons Pros Training: simple, scalable  Vocabulary (stop-words; stemming); Corpus Serving with low latency  evaluates millions of candidate ads in few ms Clever algorithms (Broder et al) Cons Does not always capture context Clicks?  Better? Active user feedback  Can we use it ?

Predictive Approach: Clicks New challenging research area Learn from historic clicks on ads  Indicator of overall relevance  Rank ads by CTR = P(Click|Ad,context) Estimating CTR difficult statistical problem  High-dim, sparseness (too many combinations)  (Page,Ad)→(Page Features, Ad Features) Bias-Variance Tradeoff when selecting features Coarse is stable but less precise; fine has high variance

Statistical Challenges( contd) Retrospective data biased  I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match? What is irrelevant? Labeling negatives.  I never click on ads no matter what Good models maybe complex  Scalability while training (Grid computing helps)  Serving: All models are not index friendly Quick evaluation during serve time improves system

When Semantic meets Predictive Semantic provides domain knowledge  Feature selection driven by semantic knowledge  Predictive “enhances” semantic “correction” terms to semantic to match click feedback fallback on semantic when signal weak Model scalable (Grid computing) Fast to evaluate at run time  Faster→More candidates evaluated at serve time Accuracy versus Coverage

Outline 2 Modeling Approach

Predictive Regression model Region specific splitting for page and ad  Page “regions”: Title, headers, boldfaces, metadata, etc.  Ad “regions”: title, body, etc  Features: words, phrases, classes in different regions. Word matches in title more important that in the body  Illustration: word features; title regions  Extension to multiple regions, multiple feature types routine Experiments to appear in a future version

Logistic Regression: Word features Model clicks/non-clicks: Logistic Regression  Training & test data: events with clicks only y ij ~Ber(p ij ) CTRMain effect for page (overall popularity) Main effect for ad (overall popularity) Interaction effect (words shared by page and ad) Model parameters Gaussian priors on model parameters: penalizes sparse features

Feature weights “correct” relevance M p,w = tf p,w 1(w ε p) M a,w = tf a,w 1(w ε a) I p,a,w = tf p,w * tf a,w 1(w ε p) 1(w ε a) So, IR-based term frequency measures are taken into account

How to select words? Word selection  Overall, nearly 110k words in our training data Stop word removal, stemming  Learning parameters for each word would be: Expensive, overfits  We use simple feature selection strategies Select top-k

Word Selection: data based Define an interaction measure for each word Higher values for words which have higher-than- expected CTR when they occur on both page and ad Remove words served or clicked few times for robustness

Word selection contd Word selection: relevance based Average tfidf score of each word : pages and ads Higher values imply higher relevance  Ranked by geometric mean: tfidf on page and ad  Ranked by tfidf on page and ad; take the union

Best Word Selection scheme Word selection Two methods  Data based  Relevance based  We picked the top 1000 words by each measure  Data-based methods give better results Recall Precision

Semantic similarity score Word features have low coverage; fallback mechanism to semantic similarity Map cosine on logit scale? Create score bins  100 points per bin  Mean score vs logit(CTR)  Quadratic relationship Cosine score logit(p ij )

Incorporating similarity Quadratic relationship used in two ways  Put in cosine and cosine 2 as features  Add as offset: Prior log-odds  Similar Results

Scalable Training Fast Implementation  Training: Hadoop implementation of Logistic Regression Data Iterative Newton- Raphson Random data splits Mean and Variance estimates Combine estimates Learned model params

Outline 3 Fast Evaluation at Serve Time

Efficient Score Evaluation Problem: For a page visit; select top-n ads using scoring formula Why hard: Only a few ms; too many ads to evaluate Rich literature in IR to solve this problem  Efficient solutions for vector space models through “posting lists”  Interaction terms in regression model motivated by this Document at a time (DAAT) strategy  Posting lists: sorted doc IDs for each query term  Evaluates each doc containing at least one query term one at a time stop prematurely if clear doc can’t make it to top n  System sparse, few correlations; efficiency through approximations

Efficient evaluation through two-stage procedure (Broder et al.) HEAP Top-n θ=min-score x1 x2 x3 x4 1 53 3 7 32 7 U1 U2 U3 U4 Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ WAND Iterator traverses posting list very efficiently by skipping unnecessary docs Efficiency depends on Upper bounds for terms Doc Ids CurrDoc=1 U1+U2+U3 > θ U1 + U2 <= θ

Efficient Testing :WAND Red Ball Ad 1Ad 5Ad 8 Ad 7Ad 8Ad 9 Word posting lists Cursors Query = Red Ball skip Candidate Results = Ad 8 … More generally, queries are weighted  compute upper bounds on score for skips

Efficiency of procedure Efficiency through document skipping Must be able to compute upper bounds quickly  Match scoring formula should not use arbitrary features (“word X in query AND word Y in ad”)  Such pairwise (“cross-product”) checks may get costly Large posting lists; too many evaluations Upper bounds crucial to performance  Large→False +ve’s; Small→False –ve’s  We are using upper bounds recommended in literature More efficient implementation subject of future research

System Architecture: scoring at serve time Fast Implementation  Testing Main effect for ads is used in ordering of ads in postings list (static) Interaction effect is used to modify the idf- table of words (static) Main effect for pages does not play a role in ad serving (page is given) Building postings lists

Outline 4 Experiments and Results, Summary and Ongoing Work

Experiments Recall Precision 25% lift in precision at 10% recall

Experiments Recall Precision 25% lift in precision at 10% recall Low recall region Computed precision-recall for several splits Results statistically significant

Experiments Increasing the number of words from 1000 to 3400 led to only marginal improvement  Diminishing returns  System already performs close to its limit, without needing more training  Changing the training time period changes the word list; we update our posting lists periodically

Summary Matching ads to pages challenging problem We provide an approach that blends semantic similarity and predictive models in a scalable fashion Our approach index friendly Experimental results on large scale system shows significant improvement  We can only improve the relevance models

Ongoing Work Change in training data changes word set  Working on more robust word feature selection Clustering words Efficient indexing strategies through better upper bound estimates for WAND Expanding feature sets to include neighborhoods of words in posting lists  Balance between accuracy and WAND efficiency Isotonic regression on cosine similarity

Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research.

Similar presentations

Presentation on theme: "Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research.

Similar presentations

Presentation on theme: "Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback