Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Advertising Multi-billion dollar industry, high growth

Similar presentations


Presentation on theme: "Online Advertising Multi-billion dollar industry, high growth"— Presentation transcript:

1 Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti (Yahoo! Research)

2 Online Advertising Multi-billion dollar industry, high growth
$9.7B in 2006 (17% increase), total $150B Why this will continue? Broadband cheap, ubiquitous “Getting things done” easier on the internet Advertisers shifting dollars Why does it work? Massive scale, automated, low marginal cost Key: Monetize more and better, “learn from data” New discipline “Computational Advertising” ISSUE: 9.7B vs 150B? How many unique visitors does Y! get per month? A young but multi-billion dollar industry as evident from the phenomenal success of companies like Google, Yahoo!, MSN; continues to grow at a rapid rate. Broadband cheap, ubiquitous; people spending more time on internet since search engines and other services make it easier to get things done online; content on WWW growing by leaps and bounds. This has caught the eye of the advertising industry who are pumping more advertising dollars to internet relative to other advertising media like television, radio, newspaper etc. Why does the business work and works so well? Many bright minds have dismissed its potential few years ago. I am going to describe it abstractly now, hopefully it would start making a lot more sense by the end of the talk. Extremely large scale system, several billion transactions conducted everyday in an almost automated fashion. A small fraction monetize, but good enough to make it a lucrative business The key to success is monetizing more and better by automated learning through massive amounts of data constantly flowing into the system. Given rise to a new academic discipline called “Computational advertising”

3 What is “Computational Advertising”?
New scientific sub-discipline, at the intersection of Large scale search and text analysis Information retrieval Statistical modeling Machine learning Optimization Microeconomics Multi-disciplinary; composed of several key components

4 Online advertising: 6000 ft Overview
Pick ads Ads Advertisers Content Ad Network User This only shows one scenario; that of content match. Let’s add Sponsored Search (Replace Content with Query) and Have a new slide for display advertising. This also does not provide info for the revenue model (shall we add it here or later). Examples: Yahoo, Google, MSN, RightMedia, … Content Provider

5 Outline Background on online advertising The Fundamental Problem
Sponsored Search, Content Match, Display, Unified marketplace The Fundamental Problem Statistical sub-problems: Description Existing methods Challenges

6 Online Advertising Different flavors Revenue Models Misc. Ad exchanges
Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

7 Revenue Models Ad Network Advertisers Pick ads Ads Content User
CPM CPC CPA Cost Per iMpression Ad Network Pick ads Ads Advertisers Content User $$ $ Content Provider

8 Revenue Models Ad Network Advertisers Pick ads Ads Content User
CPM CPC CPA Cost Per Click Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

9 Advertiser landing page
Revenue Models Advertiser landing page Cost Per Action CPM CPC CPA Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

10 Revenue Models CPM CPC CPA Depends on auction mechanism
Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC CPM CPC CPA Depends on auction mechanism Click-through Rate (probability of a click given an impression)

11 Auction Mechanism Revenue depends on type of auction
Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

12 Revenue Models CPM CPC CPA
Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC Under CPA: Revenue = N * CTR * Conv. Rate * CPA CPM CPC CPA Conversion Rate (probability of a user conversion on the advertiser’s landing page given a click)

13 Relevance to advertisers
Revenue Models CPM website traffic CPC website traffic + ad relevance CPA website traffic + ad relevance + landing page quality Revenue dependence Relevance to advertisers Prices and Bids Ease of picking ads

14 Online Advertising Background Revenue Models Misc. Ad exchanges
Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

15 Advertising Setting Pick ads Ads Advertisers Content
What do you show the user? How does the user interact with the ad system? Ad Network User Content Provider

16 Advertising Setting Display Content Match Sponsored Search

17 Advertising Setting Display Content Match Sponsored Search Pick ads

18 Advertising Setting Graphical display ads Mostly for brand awareness
Revenue model is typically CPM Display Content Match Sponsored Search

19 Advertising Setting Display Content Match Sponsored Search
Content match ad

20 Match ads to the content
Advertising Setting Display Content Match Sponsored Search Text ads Pick ads Match ads to the content

21 Advertising Setting Display Content Match Sponsored Search
The user intent is unclear Revenue model is typically CPC Query (webpage) is long and noisy Display Content Match Sponsored Search

22 Advertising Setting Display Content Match Sponsored Search
Search Query Sponsored Search Ads

23 Advertising Setting Search Query Display Content Match
Sponsored Search Pick ads Text ads Search Query Match ads to the query

24 Advertising Setting Display Content Match Sponsored Search
User “declares” his/her intention Click rates generally higher than for Content Match Revenue model is typically CPC (recently some CPA) Query is short and less noisy than Content Match Display Content Match Sponsored Search

25 Summary Different revenue models Brand awareness
Depends on the goal of the advertiser campaign Brand awareness Display advertising Pay per impression (CPM) Attracting users to advertised product Content Match, Sponsored Search Pay per click (CPC), Pay per action (CPA)

26 Online Advertising Background Revenue Models Misc. Ad exchanges
Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

27 Unified Marketplace Publishers, Ad-networks, advertisers participate together in a singe exchange Publishers put impressions in the exchange; advertisers/ad-networks bid for it CPM, CPC, CPA are all integrated into a single auction mechanism

28 Overview: The Open Exchange
Bids $0.75 via Network… Bids $0.50 Bids $0.60 Ad.com AdSense Bids $0.65—WINS! Has ad impression to sell -- AUCTIONS … which becomes $0.45 bid Transparency and value

29 Unified scale: Expected CPM
Campaigns are CPC, CPA, CPM They may all participate in an auction together Converting to a common denomination is a challenge

30 Outline Background on online advertising The Fundamental Problem
Statistical sub-problems: Description Existing methods Challenges

31 Outline Background on online advertising The Fundamental Problem
Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

32 Display Advertising

33 Does it work? Lewis and Reiley, Retail Advertising Works! Yahoo! Technical Report Controlled experiment assigning customers to treatment and control groups for a large retailer Advertising significantly improved purchases (online and offline) Good news! Main goal is brand awareness Revenue model typically CPM, advertiser takes all the risk

34 Display adverting: Buyer and Seller
Advertiser (Buyer) Buy ad space well in advance or in spot market May buy from publisher who may guarantee or not Typically, guaranteed contracts cost more, price higher Publisher (seller) Sell in advance (guaranteed) or sell in spot market

35 Display Advertising Main goal of advertisers: Brand Awareness
Revenue Model: Primarily Cost per impression (CPM) Traditional Advertising Model: Ads are targeted at particular demographics (user characteristics) GM ads on Y! autos shown to “males above 55” Mortgage ad shown to “everybody on Y! Front page” Book a slot well in advance “2M impressions in Jan next year” These future impressions must be guaranteed by the ad network

36 Display Advertising Predict Supply:
Fundamental Problem: Guarantee impressions to advertisers Predict Supply: How many impressions will be available? Demographics overlap Predict Demand: How much will advertisers want each demographic? Young US 2 1 4 3 2 2 1 Y! Mail Female

37 Display Advertising Predict Supply Predict Demand
Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation subject to supply and demand constraints Young US 2 1 4 3 2 2 si dj xij Gives rise to an allocation problem: who should supply to what contracts? Several feasible solutions; which one is best? Find the optimal one given demand and supply? 1 Y! Mail Female 37

38 Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Optimal in terms of what objective function? Depends on the goal of the ad-network Advertisers don’t specify too many targets Dangerous for ad-networks to allow that (difficult to forecast) BUT they expect good inventory (campaign should result in good conversion rates) 38

39 Allocation through Optimization
Optimal in terms of what objective function? E.g. Maximize value of remaining inventory Cherry-picks valuable inventory, saves it for later Fairness “Spreads the wealth” subject to constraints si supply demand dj xij

40 Example US, Y, nF Supply = 2 Price = 1 US, Y, F Supply = 3 Price = 5
Supply Pools Young US US, Y, nF Supply = 2 Price = 1 Demand 4 2 1 3 US & Y (2) 2 2 US, Y, F Supply = 3 Price = 5 1 Y! Mail Female Supply Pools How should we distribute impressions from the supply pools to satisfy this demand?

41 Example (Cherry-picking)
Supply Pools Cherry-picking: Fulfill demands at least cost US, Y, nF Supply = 2 Price = 1 Demand (2) US & Y (2) US, Y, F Supply = 3 Price = 5 How should we distribute impressions from the supply pools to satisfy this demand?

42 Example (Fairness) Cherry-picking: Fulfill demands at least cost
Supply Pools Cherry-picking: Fulfill demands at least cost Fairness: Equitable distribution of available supply pools US, Y, nF Supply = 2 Cost = 1 Demand (1) US & Y (2) (1) US, Y, F Supply = 3 Cost = 5 How should we distribute impressions from the supply pools to satisfy this demand?

43 Example of an objective function

44 Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Further issues: Risk Management: Supply and demand forecasts should have both mean and variance Forecast aggregation: Forecasts may be needed over multiple resolutions, in time and in demographics Challenging time series problem Adapting system to external events Variance estimates important; we cannot forecast everything accurately. Solutions should use variance to do risk management. Financial crisis news increases traffic to Finance, we should be able to adapt our forecasts and take advantage. 44

45 Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Forecasting accuracy is critical! Overshoot  under-delivery of impressions  unhappy advertisers Undershoot  loss in revenue

46 Outline Background on online advertising The Fundamental Problem
Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

47 Sponsored Search and Content Match
Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What is total expected revenue?

48 Example (Content Match)
Ad Position 1 Ad Position 2 Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks. Ad Position 3

49 Example (Content Match)
Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks.

50 Reminder: Auction Mechanism
Revenue depends on type of auction Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors Total expected revenue = revenue obtained in a given time window [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

51 Sponsored Search and Content Match
Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What affects the total revenue? Relevance of the ad to the query Bids on the ads User experience on the ad landing page (ad “quality”) Expected total revenue is some function of these Optimizing merely by CTR is myopic Obvious things: relevance, bids. Long term and not so obvious things: user experience; have to incorporate these into the ranking formula. Optimizing merely by CTR is myopic 51

52 Sponsored Search and Content Match
Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue Fundamental Problem: Estimate relevance of the ad to the query

53 Ad Relevance Computation
53

54 Overview Information Retrieval (IR)
Techniques Challenges Machine Learning using Click Feedback Online Learning

55 IR-based ad matching “Why not use a search engine to match ads to context?” Ads are the “documents” Context (user query or webpage content) is the “query” Three broad approaches: Vector space models Probabilistic models Language models Open-source software is available: Lemur (

56 IR-based ad matching Vector space models: Probabilistic models
Each word/phrase in the vocabulary is a separate dimension Each ad and query is a point in this vector space Example: cosine similarity Probabilistic models Language models

57 IR-based ad matching Q1: How can we score the goodness of an ad for a context? Cosine similarity: Advantages: Simple and easy to interpret Normalizes for different ad and context lengths Query vector Ad vector

58 IR-based ad matching Vector space models Probabilistic models:
Predict, for every (ad, query) pair, the probability that the ad is relevant to the query Example: Okapi BM25 Language models

59 Term Frequency in query Inverse Document Frequency
IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Term Frequency in ad Norm. document length Term Frequency in query Inverse Document Frequency Parameters

60 Term Frequency in query
IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Advantages: Different terms are weighted differently Tunable parameters Good performance Term Frequency in ad Norm. document length Term Frequency in query

61 IR-based ad matching Vector space models Probabilistic models
Language models: Ads and queries are generated by statistical models of how words are used in the language What statistical models can be used? How do we translate query and ad generation probabilities into relevance?

62 Term probability (model parameters)
IR-based ad matching What statistical models can be used? Bigram model Multinomial model Given any ad or query, we can compute the parameter setting most likely to have generated the document Term probability (model parameters) Total length Term Frequency

63 IR-based ad matching Query params Query Ad params Ad
How do we translate query and ad generation probabilities into relevance? Method 1 Compute most likely query and ad params Generate ad using query params High probability  high relevance Query params Query Ad params Ad

64 IR-based ad matching Query params Query Ad params Ad
How do we translate query and ad generation probabilities into relevance? Method 2 Compute most likely query and ad params Generate query using ad params High probability  high relevance Query params Query Ad params Ad

65 IR-based ad matching Query params Query Ad params Ad
How do we translate query and ad generation probabilities into relevance? Method 3 Compute most likely query and ad params Compute KL-divergence between params Low KL-divergence  high relevance Query params Query Ad params Ad

66 Overview Information Retrieval (IR)
Techniques Challenges Machine Learning using Click Feedback Online Learning

67 Challenges of IR-based ad matching
Word matches might not always work

68 Woes of word matching Extract Topical info Increases coverage,
more relevant match

69 IR-based ad matching New methods to combine syntactic and semantic information For example, “A Semantic Approach to Contextual Advertising” by Broder+/SIGIR/2007 Words only provide syntactic clues Classify ads and queries into a common taxonomy Taxonomy matches provide semantic clues

70 Challenges of IR-based ad matching
Word matches might not always work Works well for frequent words, what about rare words? Long tail, big revenue impact. Remedy: Add more matching dimensions (phrase,…) Static, does not capture effect of external factors E.g. high interest in basketball page due to an event; dies off after the event Click feedback a powerful way of capturing such latent effects; difficult to do it through relevance only Relevance scores may not correspond to CTR; does not provide estimates of expected revenue

71 Challenges of IR-based ad matching
Heterogeneous corpus (query, ads). Single tfidf scores not applicable. In content match, queries long and noisy Partial feedback does not work Not scalable Ads are small, relevance of landing page difficult to determine (video, image, text)

72 Machine Learning using Click Feedback

73 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

74 Learning from Click Feedback
Learning relevance from partial human-labeled training data Attractive but not scalable Users provide us direct feedback through ad clicks Low cost and automated learning mechanism Large amounts of feedback for big ad-networks Estimation problem: Estimate CTR = Pr(click| query, ad, user)

75 Learning from Clicks: Challenges
Noisy labels Clicks (unscrupulous users gaming the system) Negatives (not clear; I never click on ads ) Sparseness (query, ad) matrix has billions of cells; long tail Too few data points in large number of cells; MLE has high variance Goal is to learn the best cells, not all cells Dynamic and seasonal effects CTRs evolve; subject to seasonal effects Summer, Halloween,.. Palin ads popular yesterday, not today

76 Challenges continued Selection bias Positional bias, presentation bias
We never showed watch ads on golf pages Positional bias, presentation bias Same ad performs differently at different positions Slate bias Performance of ad depends on other ads that were displayed

77 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

78 Feature based approach
Query, Ad characterized by features Query: bag-of-words, phrases, topic,… Ads: bag-of-words, keywords, size,… Query feature vector: q Ad feature vector: a Pr(Click|Q,A) = f(q,a;θ) Example: Logistic regression log-odds(Pr(Click|Q,A)) = q’ W a W estimated from data

79 Feature based models: Challenges
High dimensional, need to regularize (Priors) De-bias for positional and slate effects Negative events to be weighted appropriately Go through case studies reported in literature

80 Estimate CTR of new ads in Sponsored search
Predicting Clicks: Estimating the Click-through rates of new ads: Richardson et al, WWW 2007 Estimate CTR of new ads in Sponsored search Log-odds(CTR(ad)) = wifi(ad) Features used: Bid term CTRs of related ads (from other accounts) CTRs of all other ads with keyword “camera” Appearance, attention, advertiser reputation, landing page quality, relevance of bid terms to ad, bag-of-words in ad. Does not capture interactions between (query, ad), main focus is to estimate CTR of new ads only Negative events down-weighted based on eye-tracking study

81 Combining relevance with Click Feedback, Chakrabarti et al, WWW 08
Content Match application CTR estimation for arbitrary (page, ad) pairs Features : Bag-of-words in query, ads; relevance scores from IR Cross-product of words: Occurs in both page and ad Learn to predict click data using such features Prediction function amenable to WAND algorithm Helps with fast retrieval at serve time

82 Proposed Method A logistic regression method model for CTR
Model parameters CTR Main effect for page (how good is the page) Main effect for ad (how good is the ad) Interaction effect (words shared by page and ad)

83 Proposed Method Mp,w = tfp,w Ma,w = tfa,w Ip,a,w = tfp,w * tfa,w
So, IR-based term frequency measures are taken into account

84 Proposed Method Two sources of complexity Adding in IR scores
Word selection for efficient learning

85 Proposed Method How can IR scores fit into the model?
What is the relationship between logit(pij) and cosine score? Quadratic relationship logit(pij) Cosine score

86 Proposed Method How can IR scores fit into the model?
This quadratic relationship can be used in two ways Put in cosine and cosine2 as features Use it as a prior

87 Proposed Method Word selection Overall, nearly 110k words in corpus
Learning parameters for each word would be: Very expensive Require a huge amount of data Suffer from diminishing returns So we want to select ~1k top words which will have the most impact

88 Proposed Method Word selection Data based:
Define an interaction measure for each word Higher values for words which have higher-than-expected CTR when they occur on both page and ad

89 25% lift in precision at 10% recall
Experiments Precision Recall 25% lift in precision at 10% recall

90 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

91 Regelsen and Fain, 2006 Estimate CTR of terms by “borrowing strength” at multiple resolutions Hierarchical clustering of related terms Clustering advertiser keyword matrix Estimating CTR at finer resolutions by using information at coarser resolutions Weighted average, more weight to finer resolutions Weights selected heuristically, no principled approach

92 Estimation in the “tail”
A more principled approach to “Estimating Rates of Rare Events at Multiple Resolutions” [KDD/2007] Contextual Advertising Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of an ad on a page Most (ad, page) pairs have very few impressions, if any, and even fewer clicks Severe data sparsity

93 Estimation in the “tail”
Use an existing, well-understood hierarchy Categorize ads and webpages to leaves of the hierarchy CTR estimates of siblings are correlated The hierarchy allows us to aggregate data Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer resolutions

94 System overview Retrospective data [URL, ad, isClicked] Crawl URLs
a sample of URLs Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

95 Sampling of webpages Naïve strategy: sample at random from the set of URLs Sampling errors in impression volume AND click volume Instead, we propose: Crawling all URLs with at least one click, and a sample of the remaining URLs Variability is only in impression volume Sampling bias adjusted through statistical procedure (details in the paper) Only sampling pages, not ads. All ad information is available.

96 System overview Retrospective data [page, ad, isclicked] Crawl Pages
a sample of pages Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

97 Rare rate modeling Freeman-Tukey transform:
yij = F-T(clicks and impressions at ij) ≈ transformed-CTR Variance stabilizing transformation: Var(y) is independent of E[y]  needed in further modeling

98 Rare rate modeling Sparent(ij) Sij yij yparent(ij) Unobserved “state”
Generative Model (Tree-structured Markov Model) variance Wij Wparent(ij) Unobserved “state” Sparent(ij) Sij βparent(ij) covariates βij variance Vij Vparent(ij) yij yparent(ij)

99 Rare rate modeling Model fitting with a 2-pass Kalman filter:
Filtering: Leaf to root Smoothing: Root to leaf Linear in the number of regions

100 Tree-structured Markov model
ISSUE

101 Scalable Model fitting Multi-resolution Kalman filter

102 Multi-Resolution Kalman filter: Mathematical overview
Br: correlation between two sibling regions at level dr

103 Experiments 503M impressions
7-level hierarchy of which the top 3 levels were used Zero clicks in 76% regions in level 2 95% regions in level 3 Full dataset DFULL, and a 2/3 sample DSAMPLE

104 Experiments Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE Some of these regions R>0 get clicks in DFULL A good model should predict higher CTRs for R>0 as against the other regions in R

105 Experiments We compared 4 models TS: our tree-structured model
LM (level-mean): each level smoothed independently NS (no smoothing): CTR proportional to 1/Ñ Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

106 Experiments TS Random LM, NS

107 Experiments Few impressions  Estimates depend more on siblings
Enough impressions  little “borrowing” from siblings

108 Related Work Multi-resolution modeling Imputation
studied in time series modeling and spatial statistics [Openshaw+/79, Cressie/90, Chou+/94] Imputation studied in statistics [Darroch+/1972] Application of such models to estimation of such rare events (rates of ~10-3) is novel

109 Summary A method to estimate The method has two parts
rates of extremely rare events at multiple resolutions under severe sparsity constraints The method has two parts Imputation  incorporates hierarchy, fixes sampling bias Tree-structured generative model  extremely fast parameter fitting

110 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

111 Collaborative Filtering
Similarity based methods Ad-ad similarity matrix Rating (CTR) for query u of ad i Local neighborhood of ad i

112 Collaborative Filtering
Similarity based methods Possible adaptation Challenges: Learning similarity Simultaneously incorporating query and ad similarities Feature-based model Collaborative filtering model

113 Factor coefficients for ad Factor coefficients for query
Matrix Factorization Matrix Factorization Each query (ad) is a linear combination of latent factors Solve for factors, under some regularization and constraints Factor coefficients for ad Factor coefficients for query

114 Matrix Factorization Matrix Factorization Bi-clustering
Predictive Discrete latent factor models, Agarwal and Merugu, KDD 07.

115 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

116 Challenges of Feature-based models
Learns from clicks but still misses context in many instances as in relevance based approach Introducing features that are too granular makes it hard to learn CTR reliably Does not capture the dynamics of the system Training cost is high Slow prediction functions inadmissible due to latency constraints

117 Challenges of Feature-based models
Other methods Boosting, Neural nets, Decision Trees, Random Forests, …… Local models Mixture of experts: Fit local, think global Hierarchical modeling with multiple trees User interest, query, ad,.. Each tree is different How to perform smoothing with multiple disparate trees?

118 Challenges of Feature-based models
Combining cold start with warm start together main challenge in collaborative filtering based methods We believe, solving basic issues more challenging Positional bias Selection bias Correlation in ads on a slate Dynamic CTR; seasonal variations

119 Online learning

120 Overview Information Retrieval (IR)
Machine Learning using Click Feedback Online Learning

121 Online learning for ad matching
All previous approaches learn from historical data This has several drawbacks: Slow response to emerging patterns in the data due to special events like elections, … Initial systemic biases are never corrected If the system has never shown “sound system dock” ads for the “iPod” query, it can never learn if this match is good System needs to be retrained periodically

122 Online learning for ad matching
Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

123 Background: Bandits Bandit “arms” p1 p2 p3
(unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure)

124 Background: Bandits Bandit “arms” p1 p2 p3
Goal: Pull arms sequentially so as to maximize the total expected reward Estimate payoff probabilities pi Bias the estimation process towards better arms Bandit “arms” p1 p2 p3 (unknown payoff probabilities)

125 Background: Bandits An algorithm to sequentially pick the arms is called a bandit policy Regret of a policy = how much extra payoff could be gained in expectation if the best arm is always pulled Of course, the best arm is not known to the policy Hence, the regret is the price of exploration Low regret implies that the policy quickly converges to the best arm What is the optimal policy?

126 Background: Bandits argmax g(s1, f1, s2, f2, …, sk, fk) ?
Which arm should be pulled next? Not necessarily what looks best right now, since it might have had a few lucky successes Seems to depend on some complicated function of the successes and failures of all arms Number of successes argmax g(s1, f1, s2, f2, …, sk, fk) ? Number of failures

127 argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}
Background: Bandits What is the optimal policy? Consider a bandit which has an infinite time horizon, but future rewards are geometrically discounted Rtotal = R(1) + γ.R(2) + γ2.R(3) + … (0<γ<1) Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently argmax g(s1, f1, s2, f2, …, sk, fk) ? argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}

128 Background: Bandits What is the optimal policy?
Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently Significantly reduces the dimension of the problem space But, the optimal functions gi(si, fi) are hard to compute Need approximate methods…

129 Background: Bandits Priority 1 Priority 2 Priority 3 Bandit Policy
Assign priority to each arm “Pull” arm with max priority, and observe reward Update priorities Allocation Estimation

130 Background: Bandits Number of failures Total number of observations
One common policy is UCB1 [Auer/2002] Number of failures Total number of observations Number of successes Number of observations of arm i Observed payoff Factor representing uncertainty

131 Factor representing uncertainty
Background: Bandits As total observations T becomes large: Observed payoff tends asymptotically towards the true payoff probability The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty

132 Factor representing uncertainty
Background: Bandits Sub-optimal arms are pulled O(log T) times Hence, UCB1 has O(log T) regret This is the lowest possible regret Observed payoff Factor representing uncertainty

133 Online learning for ad matching
Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

134 Background: Bandits ~109 pages ~106 ads Bandit “arms” = ads Webpage 1

135 Background: Bandits Ads Webpages One bandit Unknown CTR
Content Match = A matrix Each row is a bandit Each cell has an unknown CTR

136 Background: Bandits Why not simply apply a bandit policy directly to our problem? Convergence is too slow ~109 bandits, with ~106 arms per bandit Additional structure is available, that can help  Taxonomies

137 Taxonomies for dimensionality reduction
Root Already exist Actively maintained Existing classifiers to map pages and ads to taxonomy nodes Page/Ad Apparel Computers Travel A bandit policy that uses this structure can be faster

138 Outline Multi-level Bandit Policy for Content Match Experiments
Summary

139 Consider only two levels
Multi-level Policy Ads classes Webpages classes …… …… Consider only two levels

140 Consider only two levels
Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters …… One bandit Travel Consider only two levels

141 Key idea: CTRs in a block are homogeneous
Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters …… One bandit Travel Key idea: CTRs in a block are homogeneous

142 Multi-level Policy CTRs in a block are homogeneous
Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

143 Multi-level Policy CTRs in a block are homogeneous
Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

144 Multi-level Policy (Allocation)
? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class “We still haven’t learnt that geeks and high fashion don’t mix.”

145 Multi-level Policy (Allocation)
ad ? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class Run bandit among cells  pick one ad class In general, continue from root to leaf  final ad

146 Multi-level Policy (Allocation)
ad A C T Page classifier A C T Bandits at higher levels use aggregated information have fewer bandit arms Quickly figure out the best ad parent class

147 Multi-level Policy CTRs in a block are homogeneous
Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

148 Multi-level Policy (Estimation)
CTRs in a block are homogeneous Observations from one cell also give information about others in the block How can we model this dependence?

149 Multi-level Policy (Estimation)
Shrinkage Model # impressions in cell # clicks in cell Scell | CTRcell ~ Bin (Ncell, CTRcell) CTRcell ~ Beta (Paramsblock) All cells in a block come from the same distribution

150 Multi-level Policy (Estimation)
Intuitively, this leads to shrinkage of cell CTRs towards block CTRs E[CTR] = α.Priorblock + (1-α).Scell/Ncell Estimated CTR Beta prior (“block CTR”) Observed CTR

151 Experiments Root 20 nodes We use these 2 levels 221 nodes …
Depth 0 Depth 1 20 nodes We use these 2 levels Depth 2 221 nodes Depth 7 ~7000 leaves Taxonomy structure

152 Experiments Data collected over a 1 day period
Collected from only one server, under some other ad-matching rules (not our bandit) ~229M impressions CTR values have been linearly transformed for purposes of confidentiality

153 Experiments (Multi-level Policy)
Clicks Number of pulls Multi-level gives much higher #clicks

154 Experiments (Multi-level Policy)
Mean-Squared Error Number of pulls Multi-level gives much better Mean-Squared Error  it has learnt more from its explorations

155 Experiments (Shrinkage)
without shrinkage Clicks Mean-Squared Error with shrinkage Number of pulls Number of pulls Shrinkage  improved Mean-Squared Error, but no gain in #clicks

156 Summary Taxonomies exist for many datasets They can be used for
Dimensionality Reduction Multi-level bandit policy  higher #clicks Better estimation via shrinkage models  better MSE

157 Online learning for ad matching
Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

158 Challenges and Open Problems
Bandit policies typically assume stationarity But, sudden changes are the norm in the online advertising world: Ads may be suddenly removed when they run out of budget New ads are constantly added to the system The total number of ads is huge, and full exploration may be too costly Mortal multi-armed bandits [NIPS/2008] Algorithms for infinitely multi-armed bandits [NIPS/2008]

159 Mortal Multi-armed Bandits
Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 1 (Sampling): Pick a random sample from the set of available arms Run UCB1 on sample, until some fraction of arms in the sample are lost Pro: Quicker convergence, more exploitation Con: Best arm in the sample may be worse than best arm overall Pick sample size to control this tradeoff

160 Mortal Multi-armed Bandits
Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 2 (Payoff threshold): New bandit policy: If the observed payoff of any arm is higher than a threshold, pull it till it expires Pro: Good arms, once found, are exploited quickly Con: While exploiting good arms, the best arm may be starving and may expire without being found Pick threshold to control this tradeoff

161 Mortal Multi-armed Bandits
Challenges: Selecting the critical sample size or threshold correctly, for arbitrary payoff distributions What if even the payoff distribution is unknown?

162 Challenges and Open Problems
Mortal multi-armed bandits What if the bandit policy has some information about the budget? The bandit policy can control which arms expire, and when “Handling Advertisements of Unknown Quality in Search Advertising” by Pandey+/NIPS/2006 Combining budgets with extra knowledge of ad CTRs E.g., Using an ad taxonomy Using a bandit scheme to infer/correct an ad taxonomy

163 Conclusions

164 Conclusions We provided an introduction to Online Advertising
Discussed the eco-system and various actors involved Discussed different flavors of online advertising Sponsored Search, Content Match, Display Advertising

165 Online Advertising Conclusions Revenue Models Misc. Ad exchanges
Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

166 Conclusions Online Models Offline Modeling Explore/Exploit
Outlined associated statistical challenges Sponsored search, Content Match, Display We believe the following to be a technical roadmap Offline Modeling Online Models Time series Explore/Exploit Multi-armed bandits Regression, collaborative filtering, mixture of experts Multi-resolution models Selection bias Slate correlation Noisy labels

167 Conclusions Offline Modeling Explore/Exploit
By far the best studied so far Not a careful study of selection bias, slate correlations, noisy labels. Good opportunity here More emphasis on matrix structure, goal is to estimate interactions Explore/Exploit Some work using multi-armed bandits; long way to go Time series model to capture temporal aspects Little work Holistic approach that combines all components in a principled way


Download ppt "Online Advertising Multi-billion dollar industry, high growth"

Similar presentations


Ads by Google