Download presentation
Presentation is loading. Please wait.
Published byVirginia Cooper Modified over 9 years ago
1
1 Challenges in Computational Advertising Deepayan Chakrabarti (deepay@yahoo-inc.com)
2
Online Advertising Overview Advertisers Ad Network Ads Content Pick ads User Content Provider Examples: Yahoo, Google, MSN, RightMedia, … 2
3
Advertising Setting DisplayContent Match Sponsored Search
4
Advertising Setting Pick ads DisplayContent Match Sponsored Search 4
5
Advertising Setting Graphical display ads Mostly for brand awareness Revenue based on number of impressions (not clicks) DisplayContent Match Sponsored Search 5
6
Advertising Setting Content match ad DisplayContent Match Sponsored Search 6
7
Advertising Setting Pick ads Text ads Match ads to the content DisplayContent Match Sponsored Search 7
8
Advertising Setting The user intent is unclear Revenue depends on number of clicks Query (webpage) is long and noisy DisplayContent Match Sponsored Search 8
9
Advertising Setting Search Query Sponsored Search Ads DisplayContent Match Sponsored Search 9
10
This presentation 1)Content Match [KDD 2007] : How can we estimate the click-through rate (CTR) of an ad on a page? ~10 6 ads ~10 9 pages CTR for ad j on page i 10
11
This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] Article summary Alternates click Display ads 11
12
This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] Recommend articles (not ads) need high CTR on article summaries + prefer articles on which under-delivering ads can be shown 12
13
This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper] Represent relationships as a graph Recommendation = Link Prediction Many useful heuristics exist Why do these heuristics work? Goal: Suggest friends 13
14
14 Estimating CTR for Content Match Contextual Advertising Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of an ad on a page ~10 6 ads ~10 9 pages CTR for ad j on page i
15
Estimating CTR for Content Match Why not use the MLE? 1. Few (page, ad) pairs have N>0 2. Very few have c>0 as well 3. MLE does not differentiate between 0/10 and 0/100 We have additional information: hierarchies 15
16
16 Estimating CTR for Content Match Use an existing, well-understood hierarchy Categorize ads and webpages to leaves of the hierarchy CTR estimates of siblings are correlated The hierarchy allows us to aggregate data Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer resolutions
17
17 Estimating CTR for Content Match Level 0 Level i Page hierarchy Ad hierarchy Region = (page node, ad node) Region Hierarchy A cross-product of the page hierarchy and the ad hierarchy Page classes Ad classes Region
18
Estimating CTR for Content Match Our Approach Data Transformation Model Model Fitting 18
19
Data Transformation Problem: Solution: Freeman-Tukey transform Differentiates regions with 0 clicks Variance stabilization: 19
20
Model Goal: Smoothing across siblings in hierarchy [Huang+Cressie/2000] 20 Level i Level i+1 S1S1 S2S2 S3S3 S4S4 S parent 1. Each region has a latent state S r 2.y r is independent of the hierarchy given S r 3.S r is drawn from its parent S pa(r) y1y1 y2y2 y4y4 observable latent
21
Model 21 SrSr S pa(r) yryr y pa(r) variance V r coeff. β r variance w r V pa(r) w pa(r) urur β pa(r) u pa(r)
22
However, learning W r, V r and β r for each region is clearly infeasible Assumptions: All regions at the same level ℓ share the same W ( ℓ) and β (ℓ) V r = V/N r for some constant V, since Model 22 SrSr yryr VrVr βrβr wrwr urur S pa(r)
23
Model Implications: determines degree of smoothing : S r varies greatly from S pa(r) Each region learns its own S r No smoothing : All S r are identical A regression model on features u r is learnt Maximum Smoothing 23 SrSr yryr VrVr βrβr wrwr urur S pa(r)
24
Implications: determines degree of smoothing Var(S r ) increases from root to leaf Better estimates at coarser resolutions Model 24 SrSr yryr VrVr βrβr wrwr urur S pa(r)
25
Implications: determines degree of smoothing Var(S r ) increases from root to leaf Correlations among siblings at level ℓ: Depends only on level of least common ancestor Model 25 SrSr yryr VrVr βrβr wrwr urur S pa(r) Corr(, ) > Corr(, )
26
Estimating CTR for Content Match Our Approach Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain) Model Fitting 26
27
27 Model Fitting Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate data from leaves to root Smoothing: Propagate information from root to leaves Complexity: linear in the number of regions, for both time and space filtering smoothing
28
28 Model Fitting Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate data from leaves to root Smoothing: Propagates information from root to leaves Kalman filter requires knowledge of β, V, and W EM wrapped around the Kalman filter filtering smoothing
29
29 Experiments 503M impressions 7-level hierarchy of which the top 3 levels were used Zero clicks in 76% regions in level 2 95% regions in level 3 Full dataset DFULL, and a 2/3 sample DSAMPLE
30
30 Experiments Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE Some of these regions R >0 get clicks in DFULL A good model should predict higher CTRs for R >0 as against the other regions in R
31
31 Experiments We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed independently NS (no smoothing): CTR proportional to 1/N r Random: Assuming |R >0 | is given, randomly predict the membership of R >0 out of R
32
32 Experiments TS Random LM, NS
33
Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR? 33 Impressions Estimated CTR Impressions Estimated CTR No Smoothing (NS)Our Model (TS) Variability from coarser resolutions Close to MLE for large N
34
34 Estimating CTR for Content Match We presented a method to estimate rates of extremely rare events at multiple resolutions under severe sparsity constraints Key points: Tree-structured generative model Extremely fast parameter fitting
35
Traffic Shaping 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper] 35
36
Traffic Shaping 36 Which article summary should be picked? Ans: The one with highest expected CTR Which ad should be displayed? Ans: The ad that minimizes underdelivery Article pool
37
Underdelivery Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months) only to users matching their specs only when they visit certain types of pages only on certain positions on the page An underdelivering ad is one that is likely to miss its guarantee 37
38
Underdelivery How can underdelivery be computed? Need user traffic forecasts Depends on other ads in the system An ad-serving system will try to minimize under-delivery on this graph 38 Forecasted impressions (user, article, position) Ad inventory Supply s ℓ Demand d j ℓ j
39
Traffic Shaping 39 Which article summary should be picked? Ans: The one with highest expected CTR Which ad should be displayed? Ans: The ad that minimizes underdelivery Goal: Combine the two
40
Traffic Shaping Goal: Bias the article summary selection to reduce under-delivery but insignificant drop in CTR AND do this in real-time
41
Outline Formulation as an optimization problem Real-time solution Empirical results 41
42
Formulation j:(ads) ℓ: (user, article, position) “Fully Qualified Impression” i:(user, article) k:(user) ℓ j i k Goal: Infer traffic shaping fractions w ki Supply s k CTR c ki Traffic shaping fraction w ki Demand d j Ad delivery fraction φ ℓj
43
Formulation Full traffic shaping graph: All forecasted user traffic X all available articles arriving at the homepage, or directly on article page Goal: Infer w ki But forced to infer φ ℓj as well Full Traffic Shaping Graph A B C Traffic shaping fraction w ki Ad delivery fraction φ ℓj CTR c ki
44
Formulation 44 ℓ j i k underdelivery Total user traffic flowing to j (accounting for CTR loss) demand (Satisfy demand constraints) sksk w ki c ki
45
Formulation 45 ℓ j i k (Bounds on traffic shaping fractions) (Shape only available traffic) (Satisfy demand constraints) (Ad delivery fractions)
46
Key Transformation This allows a reformulation solely in terms of new variables z ℓj z ℓj = fraction of supply that is shown ad j, assuming user always clicks article 46
47
Formulation Convex program can be solved optimally 47
48
Formulation But we have another problem At runtime, we must shape every incoming user without looking at the entire graph Solution: Periodically solve the convex problem offline Store a cache derived from this solution Reconstruct the optimal solution for each user at runtime, using only the cache 48
49
Outline Formulation as an optimization problem Real-time solution Empirical results 49
50
Real-time solution 50 Cache these Reconstruct using these All constraints can be expressed as constraints on σ ℓ
51
Real-time solution 51 1 2 σ ℓ = 0 unless Σz ℓj = max ℓ Σz ℓj 3 Σ ℓ σ ℓ = constant for all i connected to k Σz ℓj UiUi LiLi σℓσℓ 3 KKT conditions Shape depends on the cached duals α j ℓ j k i
52
Real-time solution 52 1 2 σ ℓ = 0 unless Σz ℓj = max ℓ Σz ℓj 3 Σ ℓ σ ℓ = constant for all i connected to k ℓ j k i Σz ℓj UiUi LiLi σℓσℓ Algo Initialize σ ℓ = 0 Compute Σz ℓj from (1) If constraints unsatisfied, increase σ ℓ while satisfying (2) and (3) Repeat Extract w ki from z ℓj
53
Results Data: Historical traffic logs from April, 2011 25K user nodes Total supply weight > 50B impressions 100K ads We compare our model to a scheme that picks articles to maximize expected CTR, and picks ads to display via a separate greedy method 53
54
Lift in impressions Lift in impressions delivered to underperforming ads Fraction of traffic that is not shaped Nearly threefold improvement via traffic shaping 54
55
Average CTR Average CTR (as percentage of maximum CTR) Fraction of traffic that is not shaped CTR drop < 10% 55
56
Comparison with other methods 56
57
Summary 3x underdelivery reduction with <10% CTR drop 2.6x reduction with 4% CTR drop Runtime application needs only a small cache 57
58
Traffic Shaping 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper] 58
59
Link Prediction Which pair of nodes {i,j} should be connected? Alice Bob Charlie Goal: Recommend a movie 59
60
Link Prediction Which pair of nodes {i,j} should be connected? Goal: Suggest friends 60
61
Previous Empirical Studies * RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 How do we justify these observations? Especially if the graph is sparse 61
62
Link Prediction – Generative Model Unit volume universe Model: 1.Nodes are uniformly distributed points in a latent space 2.This space has a distance metric 3.Points close to each other are likely to be connected in the graph Logistic distance function (Raftery+/2002) 62
63
63 1 ½ Higher probability of linking radius r α determines the steepness Link prediction ≈ find nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space Link Prediction – Generative Model Model: 1.Nodes are uniformly distributed points in a latent space 2.This space has a distance metric 3.Points close to each other are likely to be connected in the graph
64
Common Neighbors Pr 2 (i,j) = Pr(common neighbor|d ij ) Product of two logistic probabilities, integrated over a volume determined by d ij i j 64
65
Common Neighbors OPT = node closest to i MAX = node with max common neighbors with i Theorem: w.h.p Link prediction by common neighbors is asymptotically optimal d OPT ≤ d MAX ≤ d OPT + 2[ ε/V(1)] 1/D 65
66
Common Neighbors: Distinct Radii Node k has radius r k. i k if d ik ≤ r k (Directed graph) r k captures popularity of node k “Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r) i rkrk Weight for nodes of radius r # common neighbors of radius r k j m 66
67
Type 2 common neighbors r is close to max radius Real world graphs generally fall in this range i rkrk k j Presence of common neighbor is very informative Absence is very informative Adamic/Ad ar 1/r 67
68
ℓ- hop Paths Common neighbors = 2 hop paths For longer paths: Bounds are weaker For ℓ ’ ≥ ℓ we need η ℓ ’ >> η ℓ to obtain similar bounds justifies the exponentially decaying weight given to longer paths by the Katz measure 68
69
Summary Three key ingredients 1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 2. Triangle inequality holds necessary to extend to ℓ- hop paths 3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance 69
70
Summary RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction. 70
71
Conclusions Discussed three problems 1.Estimating CTR for Content Match Combat sparsity by hierarchical smoothing 2.Traffic Shaping for Display Advertising Joint optimization of CTR and underdelivery-reduction Optimal traffic shaping at runtime using cached duals 3.Theoretical underpinnings Latent space model Link prediction ≈ finding nearest neighbors in this space 71
72
Other Work 72 Web Search Finding Quicklinks Titles for Quicklinks Incorporating tweets into search results Website clustering Webpage segmentation Template detection Finding hidden query aspects Computational Advertising Combining IR with click feedback Multi-armed bandits using hierarchies Online learning under finite ad lifetimes Graph Mining Epidemic thresholds Non-parametric prediction in dynamic graphs Graph sampling Graph generation models Community detection
73
Model Goal: Smoothing across siblings in hierarchy Our approach: Each region has a latent state S r y r is independent of hierarchy given S r S r is drawn from the parent region S pa(r) 73 Level i Level i+1
74
Data Transformation Problem: Solution: Freeman-Tukey transform Differentiates regions with 0 clicks Variance stabilization: 74 MLE CTR N * Var(MLE ) Mean y r N * Var(y r )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.