1 Challenges in Computational Advertising Deepayan Chakrabarti

1 Challenges in Computational Advertising Deepayan Chakrabarti (deepay@yahoo-inc.com)

Online Advertising Overview Advertisers Ad Network Ads Content Pick ads User Content Provider Examples: Yahoo, Google, MSN, RightMedia, … 2

Advertising Setting DisplayContent Match Sponsored Search

Advertising Setting Pick ads DisplayContent Match Sponsored Search 4

Advertising Setting Graphical display ads Mostly for brand awareness Revenue based on number of impressions (not clicks) DisplayContent Match Sponsored Search 5

Advertising Setting Content match ad DisplayContent Match Sponsored Search 6

Advertising Setting Pick ads Text ads Match ads to the content DisplayContent Match Sponsored Search 7

Advertising Setting The user intent is unclear Revenue depends on number of clicks Query (webpage) is long and noisy DisplayContent Match Sponsored Search 8

Advertising Setting Search Query Sponsored Search Ads DisplayContent Match Sponsored Search 9

This presentation 1)Content Match [KDD 2007] :  How can we estimate the click-through rate (CTR) of an ad on a page? ~10 6 ads ~10 9 pages CTR for ad j on page i 10

This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] Article summary Alternates click Display ads 11

This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12]  Recommend articles (not ads)  need high CTR on article summaries  + prefer articles on which under-delivering ads can be shown 12

This presentation 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper]  Represent relationships as a graph  Recommendation = Link Prediction  Many useful heuristics exist  Why do these heuristics work? Goal: Suggest friends 13

14 Estimating CTR for Content Match Contextual Advertising  Show an ad on a webpage (“impression”)  Revenue is generated if a user clicks  Problem: Estimate the click-through rate (CTR) of an ad on a page ~10 6 ads ~10 9 pages CTR for ad j on page i

Estimating CTR for Content Match Why not use the MLE? 1. Few (page, ad) pairs have N>0 2. Very few have c>0 as well 3. MLE does not differentiate between 0/10 and 0/100  We have additional information: hierarchies 15

16 Estimating CTR for Content Match Use an existing, well-understood hierarchy  Categorize ads and webpages to leaves of the hierarchy  CTR estimates of siblings are correlated  The hierarchy allows us to aggregate data Coarser resolutions  provide reliable estimates for rare events  which then influences estimation at finer resolutions

17 Estimating CTR for Content Match Level 0 Level i Page hierarchy Ad hierarchy Region = (page node, ad node) Region Hierarchy  A cross-product of the page hierarchy and the ad hierarchy Page classes Ad classes Region

Estimating CTR for Content Match Our Approach  Data Transformation  Model  Model Fitting 18

Data Transformation Problem: Solution: Freeman-Tukey transform  Differentiates regions with 0 clicks  Variance stabilization: 19

Model Goal: Smoothing across siblings in hierarchy [Huang+Cressie/2000] 20 Level i Level i+1 S1S1 S2S2 S3S3 S4S4 S parent 1. Each region has a latent state S r 2.y r is independent of the hierarchy given S r 3.S r is drawn from its parent S pa(r) y1y1 y2y2 y4y4 observable latent

Model 21 SrSr S pa(r) yryr y pa(r) variance V r coeff. β r variance w r V pa(r) w pa(r) urur β pa(r) u pa(r)

 However, learning W r, V r and β r for each region is clearly infeasible  Assumptions: All regions at the same level ℓ share the same W ( ℓ) and β (ℓ) V r = V/N r for some constant V, since Model 22 SrSr yryr VrVr βrβr wrwr urur S pa(r)

Model Implications:  determines degree of smoothing  : S r varies greatly from S pa(r) Each region learns its own S r No smoothing  : All S r are identical A regression model on features u r is learnt Maximum Smoothing 23 SrSr yryr VrVr βrβr wrwr urur S pa(r)

Implications:  determines degree of smoothing  Var(S r ) increases from root to leaf Better estimates at coarser resolutions Model 24 SrSr yryr VrVr βrβr wrwr urur S pa(r)

Implications:  determines degree of smoothing  Var(S r ) increases from root to leaf  Correlations among siblings at level ℓ: Depends only on level of least common ancestor Model 25 SrSr yryr VrVr βrβr wrwr urur S pa(r) Corr(, ) > Corr(, )

Estimating CTR for Content Match Our Approach Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain)  Model Fitting 26

27 Model Fitting Fitting using a Kalman filtering algorithm  Filtering: Recursively aggregate data from leaves to root  Smoothing: Propagate information from root to leaves Complexity: linear in the number of regions, for both time and space filtering smoothing

28 Model Fitting Fitting using a Kalman filtering algorithm  Filtering: Recursively aggregate data from leaves to root  Smoothing: Propagates information from root to leaves Kalman filter requires knowledge of β, V, and W  EM wrapped around the Kalman filter filtering smoothing

29 Experiments 503M impressions 7-level hierarchy of which the top 3 levels were used Zero clicks in  76% regions in level 2  95% regions in level 3 Full dataset DFULL, and a 2/3 sample DSAMPLE

30 Experiments Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE Some of these regions R >0 get clicks in DFULL A good model should predict higher CTRs for R >0 as against the other regions in R

31 Experiments We compared 4 models  TS: our tree-structured model  LM (level-mean): each level smoothed independently  NS (no smoothing): CTR proportional to 1/N r  Random: Assuming |R >0 | is given, randomly predict the membership of R >0 out of R

32 Experiments TS Random LM, NS

Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR? 33 Impressions Estimated CTR Impressions Estimated CTR No Smoothing (NS)Our Model (TS) Variability from coarser resolutions Close to MLE for large N

34 Estimating CTR for Content Match We presented a method to estimate  rates of extremely rare events  at multiple resolutions  under severe sparsity constraints Key points:  Tree-structured generative model  Extremely fast parameter fitting

Traffic Shaping 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper] 35

Traffic Shaping 36 Which article summary should be picked? Ans: The one with highest expected CTR Which ad should be displayed? Ans: The ad that minimizes underdelivery Article pool

Underdelivery Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months)  only to users matching their specs  only when they visit certain types of pages  only on certain positions on the page An underdelivering ad is one that is likely to miss its guarantee 37

Underdelivery How can underdelivery be computed?  Need user traffic forecasts  Depends on other ads in the system An ad-serving system will try to minimize under-delivery on this graph 38 Forecasted impressions (user, article, position) Ad inventory Supply s ℓ Demand d j ℓ j

Traffic Shaping 39 Which article summary should be picked? Ans: The one with highest expected CTR Which ad should be displayed? Ans: The ad that minimizes underdelivery Goal: Combine the two

Traffic Shaping Goal: Bias the article summary selection to  reduce under-delivery  but insignificant drop in CTR  AND do this in real-time

Outline Formulation as an optimization problem Real-time solution Empirical results 41

Formulation j:(ads) ℓ: (user, article, position) “Fully Qualified Impression” i:(user, article) k:(user) ℓ j i k Goal: Infer traffic shaping fractions w ki Supply s k CTR c ki Traffic shaping fraction w ki Demand d j Ad delivery fraction φ ℓj

Formulation Full traffic shaping graph:  All forecasted user traffic X all available articles  arriving at the homepage,  or directly on article page Goal: Infer w ki  But forced to infer φ ℓj as well Full Traffic Shaping Graph A B C Traffic shaping fraction w ki Ad delivery fraction φ ℓj CTR c ki

Formulation 44 ℓ j i k underdelivery Total user traffic flowing to j (accounting for CTR loss) demand (Satisfy demand constraints) sksk w ki c ki

Formulation 45 ℓ j i k (Bounds on traffic shaping fractions) (Shape only available traffic) (Satisfy demand constraints) (Ad delivery fractions)

Key Transformation This allows a reformulation solely in terms of new variables z ℓj  z ℓj = fraction of supply that is shown ad j, assuming user always clicks article 46

Formulation Convex program  can be solved optimally 47

Formulation But we have another problem  At runtime, we must shape every incoming user without looking at the entire graph Solution:  Periodically solve the convex problem offline  Store a cache derived from this solution  Reconstruct the optimal solution for each user at runtime, using only the cache 48

Outline Formulation as an optimization problem Real-time solution Empirical results 49

Real-time solution 50 Cache these Reconstruct using these All constraints can be expressed as constraints on σ ℓ

Real-time solution 51 1 2 σ ℓ = 0 unless Σz ℓj = max ℓ Σz ℓj 3 Σ ℓ σ ℓ = constant for all i connected to k Σz ℓj UiUi LiLi σℓσℓ 3 KKT conditions Shape depends on the cached duals α j ℓ j k i

Real-time solution 52 1 2 σ ℓ = 0 unless Σz ℓj = max ℓ Σz ℓj 3 Σ ℓ σ ℓ = constant for all i connected to k ℓ j k i Σz ℓj UiUi LiLi σℓσℓ Algo  Initialize σ ℓ = 0  Compute Σz ℓj from (1)  If constraints unsatisfied, increase σ ℓ while satisfying (2) and (3)  Repeat  Extract w ki from z ℓj

Results Data:  Historical traffic logs from April, 2011  25K user nodes Total supply weight > 50B impressions  100K ads We compare our model to a scheme that  picks articles to maximize expected CTR, and  picks ads to display via a separate greedy method 53

Lift in impressions Lift in impressions delivered to underperforming ads Fraction of traffic that is not shaped Nearly threefold improvement via traffic shaping 54

Average CTR Average CTR (as percentage of maximum CTR) Fraction of traffic that is not shaped CTR drop < 10% 55

Comparison with other methods 56

Summary 3x underdelivery reduction with <10% CTR drop 2.6x reduction with 4% CTR drop Runtime application needs only a small cache 57

Traffic Shaping 1)Estimating CTR for Content Match [KDD ‘07] 2)Traffic Shaping for Display Advertising [EC ‘12] 3)Theoretical underpinnings [COLT ‘10 best student paper] 58

Link Prediction  Which pair of nodes {i,j} should be connected? Alice Bob Charlie Goal: Recommend a movie 59

Link Prediction  Which pair of nodes {i,j} should be connected? Goal: Suggest friends 60

Previous Empirical Studies * RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 How do we justify these observations? Especially if the graph is sparse 61

Link Prediction – Generative Model Unit volume universe Model: 1.Nodes are uniformly distributed points in a latent space 2.This space has a distance metric 3.Points close to each other are likely to be connected in the graph  Logistic distance function (Raftery+/2002) 62

63 1 ½ Higher probability of linking radius r α determines the steepness Link prediction ≈ find nearest neighbor who is not currently linked to the node.  Equivalent to inferring distances in the latent space Link Prediction – Generative Model Model: 1.Nodes are uniformly distributed points in a latent space 2.This space has a distance metric 3.Points close to each other are likely to be connected in the graph

Common Neighbors Pr 2 (i,j) = Pr(common neighbor|d ij ) Product of two logistic probabilities, integrated over a volume determined by d ij i j 64

Common Neighbors OPT = node closest to i MAX = node with max common neighbors with i Theorem: w.h.p Link prediction by common neighbors is asymptotically optimal d OPT ≤ d MAX ≤ d OPT + 2[ ε/V(1)] 1/D 65

Common Neighbors: Distinct Radii Node k has radius r k.  i  k if d ik ≤ r k (Directed graph)  r k captures popularity of node k  “Weighted” common neighbors:  Predict (i,j) pairs with highest Σ w(r)η(r) i rkrk Weight for nodes of radius r # common neighbors of radius r k j m 66

Type 2 common neighbors r is close to max radius Real world graphs generally fall in this range i rkrk k j Presence of common neighbor is very informative Absence is very informative Adamic/Ad ar 1/r 67

ℓ- hop Paths Common neighbors = 2 hop paths For longer paths: Bounds are weaker For ℓ ’ ≥ ℓ we need η ℓ ’ >> η ℓ to obtain similar bounds   justifies the exponentially decaying weight given to longer paths by the Katz measure 68

Summary Three key ingredients 1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 2. Triangle inequality holds  necessary to extend to ℓ- hop paths 3. Points are spread uniformly at random  Otherwise properties will depend on location as well as distance 69

Summary RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction. 70

Conclusions  Discussed three problems 1.Estimating CTR for Content Match  Combat sparsity by hierarchical smoothing 2.Traffic Shaping for Display Advertising  Joint optimization of CTR and underdelivery-reduction  Optimal traffic shaping at runtime using cached duals 3.Theoretical underpinnings  Latent space model  Link prediction ≈ finding nearest neighbors in this space 71

Other Work 72 Web Search  Finding Quicklinks  Titles for Quicklinks  Incorporating tweets into search results  Website clustering  Webpage segmentation  Template detection  Finding hidden query aspects Computational Advertising  Combining IR with click feedback  Multi-armed bandits using hierarchies  Online learning under finite ad lifetimes Graph Mining  Epidemic thresholds  Non-parametric prediction in dynamic graphs  Graph sampling  Graph generation models  Community detection

Model Goal: Smoothing across siblings in hierarchy Our approach:  Each region has a latent state S r  y r is independent of hierarchy given S r  S r is drawn from the parent region S pa(r) 73 Level i Level i+1

Data Transformation Problem: Solution: Freeman-Tukey transform  Differentiates regions with 0 clicks  Variance stabilization: 74 MLE CTR N * Var(MLE ) Mean y r N * Var(y r )

1 Challenges in Computational Advertising Deepayan Chakrabarti

Similar presentations

Presentation on theme: "1 Challenges in Computational Advertising Deepayan Chakrabarti"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Challenges in Computational Advertising Deepayan Chakrabarti

Similar presentations

Presentation on theme: "1 Challenges in Computational Advertising Deepayan Chakrabarti"— Presentation transcript:

Similar presentations

About project

Feedback