Download presentation
Presentation is loading. Please wait.
Published byGarey Fowler Modified over 9 years ago
1
Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24 th April, 2008
2
Outline Motivating Application, Challenges Contextual Advertising Semantic versus Predictive models Pros, Cons Our Approach: Blend Semantic with Predictive Model Description Logistic Regression, Feature Selection Model structure amenable to fast scoring at run time Experimental Results Ongoing work
3
Outline 1 Motivating Application, Background and Challenges
4
Motivating Application Problem: Match ads to queries Sponsored Search: The query is a short piece of text input by the user User intent better expressed; less noisy Contextual Advertising: The query is a webpage Generally long, noisy, user intent less clear Harder matching problem
5
Challenges Serve ads to maximize revenue (CTR) Serve most relevant ads in a given context User Feedback in the form of Clicks in different context Automation must for profitability Billions of opportunities; millions of ads High volume, low marginal cost →lucrative business Automation through Algorithms/Models Accuracy: Massive data; scalable procedures Structure of Models: Scoring ads under strict latency requirements (~few ms)
6
Classical Approach: Semantic Serve Shoe ads on Shoe pages Models: Information Retrieval Get relevant docs (ads) for a query (webpage) Simple vector space model q=(t 1,w 1 ;…,t n,w n ); a=(a 1,v 1 ;…,a m,v m ) Cos(q,a) = s ε q ∩ a w s a s /(|q||a|) w’s, a’s: tf-idf; Frequency: reward in doc; penalize in corpus Higher score →More relevance
7
Semantic: Pros & Cons Pros Training: simple, scalable Vocabulary (stop-words; stemming); Corpus Serving with low latency evaluates millions of candidate ads in few ms Clever algorithms (Broder et al) Cons Does not always capture context Clicks? Better? Active user feedback Can we use it ?
8
Predictive Approach: Clicks New challenging research area Learn from historic clicks on ads Indicator of overall relevance Rank ads by CTR = P(Click|Ad,context) Estimating CTR difficult statistical problem High-dim, sparseness (too many combinations) (Page,Ad)→(Page Features, Ad Features) Bias-Variance Tradeoff when selecting features Coarse is stable but less precise; fine has high variance
9
Statistical Challenges( contd) Retrospective data biased I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match? What is irrelevant? Labeling negatives. I never click on ads no matter what Good models maybe complex Scalability while training (Grid computing helps) Serving: All models are not index friendly Quick evaluation during serve time improves system
10
When Semantic meets Predictive Semantic provides domain knowledge Feature selection driven by semantic knowledge Predictive “enhances” semantic “correction” terms to semantic to match click feedback fallback on semantic when signal weak Model scalable (Grid computing) Fast to evaluate at run time Faster→More candidates evaluated at serve time Accuracy versus Coverage
11
Outline 2 Modeling Approach
12
Predictive Regression model Region specific splitting for page and ad Page “regions”: Title, headers, boldfaces, metadata, etc. Ad “regions”: title, body, etc Features: words, phrases, classes in different regions. Word matches in title more important that in the body Illustration: word features; title regions Extension to multiple regions, multiple feature types routine Experiments to appear in a future version
13
Logistic Regression: Word features Model clicks/non-clicks: Logistic Regression Training & test data: events with clicks only y ij ~Ber(p ij ) CTRMain effect for page (overall popularity) Main effect for ad (overall popularity) Interaction effect (words shared by page and ad) Model parameters Gaussian priors on model parameters: penalizes sparse features
14
Feature weights “correct” relevance M p,w = tf p,w 1(w ε p) M a,w = tf a,w 1(w ε a) I p,a,w = tf p,w * tf a,w 1(w ε p) 1(w ε a) So, IR-based term frequency measures are taken into account
15
How to select words? Word selection Overall, nearly 110k words in our training data Stop word removal, stemming Learning parameters for each word would be: Expensive, overfits We use simple feature selection strategies Select top-k
16
Word Selection: data based Define an interaction measure for each word Higher values for words which have higher-than- expected CTR when they occur on both page and ad Remove words served or clicked few times for robustness
17
Word selection contd Word selection: relevance based Average tfidf score of each word : pages and ads Higher values imply higher relevance Ranked by geometric mean: tfidf on page and ad Ranked by tfidf on page and ad; take the union
18
Best Word Selection scheme Word selection Two methods Data based Relevance based We picked the top 1000 words by each measure Data-based methods give better results Recall Precision
19
Semantic similarity score Word features have low coverage; fallback mechanism to semantic similarity Map cosine on logit scale? Create score bins 100 points per bin Mean score vs logit(CTR) Quadratic relationship Cosine score logit(p ij )
20
Incorporating similarity Quadratic relationship used in two ways Put in cosine and cosine 2 as features Add as offset: Prior log-odds Similar Results
21
Scalable Training Fast Implementation Training: Hadoop implementation of Logistic Regression Data Iterative Newton- Raphson Random data splits Mean and Variance estimates Combine estimates Learned model params
22
Outline 3 Fast Evaluation at Serve Time
23
Efficient Score Evaluation Problem: For a page visit; select top-n ads using scoring formula Why hard: Only a few ms; too many ads to evaluate Rich literature in IR to solve this problem Efficient solutions for vector space models through “posting lists” Interaction terms in regression model motivated by this Document at a time (DAAT) strategy Posting lists: sorted doc IDs for each query term Evaluates each doc containing at least one query term one at a time stop prematurely if clear doc can’t make it to top n System sparse, few correlations; efficiency through approximations
24
Efficient evaluation through two-stage procedure (Broder et al.) HEAP Top-n θ=min-score x1 x2 x3 x4 1 53 3 7 32 7 U1 U2 U3 U4 Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ WAND Iterator traverses posting list very efficiently by skipping unnecessary docs Efficiency depends on Upper bounds for terms Doc Ids CurrDoc=1 U1+U2+U3 > θ U1 + U2 <= θ
25
Efficient Testing :WAND Red Ball Ad 1Ad 5Ad 8 Ad 7Ad 8Ad 9 Word posting lists Cursors Query = Red Ball skip Candidate Results = Ad 8 … More generally, queries are weighted compute upper bounds on score for skips
26
Efficiency of procedure Efficiency through document skipping Must be able to compute upper bounds quickly Match scoring formula should not use arbitrary features (“word X in query AND word Y in ad”) Such pairwise (“cross-product”) checks may get costly Large posting lists; too many evaluations Upper bounds crucial to performance Large→False +ve’s; Small→False –ve’s We are using upper bounds recommended in literature More efficient implementation subject of future research
27
System Architecture: scoring at serve time Fast Implementation Testing Main effect for ads is used in ordering of ads in postings list (static) Interaction effect is used to modify the idf- table of words (static) Main effect for pages does not play a role in ad serving (page is given) Building postings lists
28
Outline 4 Experiments and Results, Summary and Ongoing Work
29
Experiments Recall Precision 25% lift in precision at 10% recall
30
Experiments Recall Precision 25% lift in precision at 10% recall Low recall region Computed precision-recall for several splits Results statistically significant
31
Experiments Increasing the number of words from 1000 to 3400 led to only marginal improvement Diminishing returns System already performs close to its limit, without needing more training Changing the training time period changes the word list; we update our posting lists periodically
32
Summary Matching ads to pages challenging problem We provide an approach that blends semantic similarity and predictive models in a scalable fashion Our approach index friendly Experimental results on large scale system shows significant improvement We can only improve the relevance models
33
Ongoing Work Change in training data changes word set Working on more robust word feature selection Clustering words Efficient indexing strategies through better upper bound estimates for WAND Expanding feature sets to include neighborhoods of words in posting lists Balance between accuracy and WAND efficiency Isotonic regression on cosine similarity
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.