Online Advertising Multi-billion dollar industry, high growth

Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti (Yahoo! Research)

Online Advertising Multi-billion dollar industry, high growth
$9.7B in 2006 (17% increase), total $150B Why this will continue? Broadband cheap, ubiquitous “Getting things done” easier on the internet Advertisers shifting dollars Why does it work? Massive scale, automated, low marginal cost Key: Monetize more and better, “learn from data” New discipline “Computational Advertising” ISSUE: 9.7B vs 150B? How many unique visitors does Y! get per month? A young but multi-billion dollar industry as evident from the phenomenal success of companies like Google, Yahoo!, MSN; continues to grow at a rapid rate. Broadband cheap, ubiquitous; people spending more time on internet since search engines and other services make it easier to get things done online; content on WWW growing by leaps and bounds. This has caught the eye of the advertising industry who are pumping more advertising dollars to internet relative to other advertising media like television, radio, newspaper etc. Why does the business work and works so well? Many bright minds have dismissed its potential few years ago. I am going to describe it abstractly now, hopefully it would start making a lot more sense by the end of the talk. Extremely large scale system, several billion transactions conducted everyday in an almost automated fashion. A small fraction monetize, but good enough to make it a lucrative business The key to success is monetizing more and better by automated learning through massive amounts of data constantly flowing into the system. Given rise to a new academic discipline called “Computational advertising”

What is “Computational Advertising”?
New scientific sub-discipline, at the intersection of Large scale search and text analysis Information retrieval Statistical modeling Machine learning Optimization Microeconomics Multi-disciplinary; composed of several key components

Online advertising: 6000 ft Overview
Pick ads Ads Advertisers Content Ad Network User This only shows one scenario; that of content match. Let’s add Sponsored Search (Replace Content with Query) and Have a new slide for display advertising. This also does not provide info for the revenue model (shall we add it here or later). Examples: Yahoo, Google, MSN, RightMedia, … Content Provider

Outline Background on online advertising The Fundamental Problem
Sponsored Search, Content Match, Display, Unified marketplace The Fundamental Problem Statistical sub-problems: Description Existing methods Challenges

Online Advertising Different flavors Revenue Models Misc. Ad exchanges
Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

Revenue Models Ad Network Advertisers Pick ads Ads Content User
CPM CPC CPA Cost Per iMpression Ad Network Pick ads Ads Advertisers Content User $$ $ Content Provider

Revenue Models Ad Network Advertisers Pick ads Ads Content User
CPM CPC CPA Cost Per Click Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

Advertiser landing page
Revenue Models Advertiser landing page Cost Per Action CPM CPC CPA Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

Revenue Models CPM CPC CPA Depends on auction mechanism
Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC CPM CPC CPA Depends on auction mechanism Click-through Rate (probability of a click given an impression)

Auction Mechanism Revenue depends on type of auction
Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

Revenue Models CPM CPC CPA
Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC Under CPA: Revenue = N * CTR * Conv. Rate * CPA CPM CPC CPA Conversion Rate (probability of a user conversion on the advertiser’s landing page given a click)

Relevance to advertisers
Revenue Models CPM website traffic CPC website traffic + ad relevance CPA website traffic + ad relevance + landing page quality Revenue dependence Relevance to advertisers Prices and Bids Ease of picking ads

Online Advertising Background Revenue Models Misc. Ad exchanges

Advertising Setting Pick ads Ads Advertisers Content
What do you show the user? How does the user interact with the ad system? Ad Network User Content Provider

Advertising Setting Display Content Match Sponsored Search

Advertising Setting Display Content Match Sponsored Search Pick ads

Advertising Setting Graphical display ads Mostly for brand awareness
Revenue model is typically CPM Display Content Match Sponsored Search

Advertising Setting Display Content Match Sponsored Search
Content match ad

Match ads to the content
Advertising Setting Display Content Match Sponsored Search Text ads Pick ads Match ads to the content

The user intent is unclear Revenue model is typically CPC Query (webpage) is long and noisy Display Content Match Sponsored Search

Search Query Sponsored Search Ads

Advertising Setting Search Query Display Content Match
Sponsored Search Pick ads Text ads Search Query Match ads to the query

User “declares” his/her intention Click rates generally higher than for Content Match Revenue model is typically CPC (recently some CPA) Query is short and less noisy than Content Match Display Content Match Sponsored Search

Summary Different revenue models Brand awareness
Depends on the goal of the advertiser campaign Brand awareness Display advertising Pay per impression (CPM) Attracting users to advertised product Content Match, Sponsored Search Pay per click (CPC), Pay per action (CPA)

Online Advertising Background Revenue Models Misc. Ad exchanges

Unified Marketplace Publishers, Ad-networks, advertisers participate together in a singe exchange Publishers put impressions in the exchange; advertisers/ad-networks bid for it CPM, CPC, CPA are all integrated into a single auction mechanism

Overview: The Open Exchange
Bids $0.75 via Network… Bids $0.50 Bids $0.60 Ad.com AdSense Bids $0.65—WINS! Has ad impression to sell -- AUCTIONS … which becomes $0.45 bid Transparency and value

Unified scale: Expected CPM
Campaigns are CPC, CPA, CPM They may all participate in an auction together Converting to a common denomination is a challenge

Statistical sub-problems: Description Existing methods Challenges

Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

Display Advertising

Does it work? Lewis and Reiley, Retail Advertising Works! Yahoo! Technical Report Controlled experiment assigning customers to treatment and control groups for a large retailer Advertising significantly improved purchases (online and offline) Good news! Main goal is brand awareness Revenue model typically CPM, advertiser takes all the risk

Display adverting: Buyer and Seller
Advertiser (Buyer) Buy ad space well in advance or in spot market May buy from publisher who may guarantee or not Typically, guaranteed contracts cost more, price higher Publisher (seller) Sell in advance (guaranteed) or sell in spot market

Display Advertising Main goal of advertisers: Brand Awareness
Revenue Model: Primarily Cost per impression (CPM) Traditional Advertising Model: Ads are targeted at particular demographics (user characteristics) GM ads on Y! autos shown to “males above 55” Mortgage ad shown to “everybody on Y! Front page” Book a slot well in advance “2M impressions in Jan next year” These future impressions must be guaranteed by the ad network

Display Advertising Predict Supply:
Fundamental Problem: Guarantee impressions to advertisers Predict Supply: How many impressions will be available? Demographics overlap Predict Demand: How much will advertisers want each demographic? Young US 2 1 4 3 2 2 1 Y! Mail Female

Display Advertising Predict Supply Predict Demand
Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation subject to supply and demand constraints Young US 2 1 4 3 2 2 si dj xij Gives rise to an allocation problem: who should supply to what contracts? Several feasible solutions; which one is best? Find the optimal one given demand and supply? 1 Y! Mail Female 37

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Optimal in terms of what objective function? Depends on the goal of the ad-network Advertisers don’t specify too many targets Dangerous for ad-networks to allow that (difficult to forecast) BUT they expect good inventory (campaign should result in good conversion rates) 38

Allocation through Optimization
Optimal in terms of what objective function? E.g. Maximize value of remaining inventory Cherry-picks valuable inventory, saves it for later Fairness “Spreads the wealth” subject to constraints si supply demand dj xij

Example US, Y, nF Supply = 2 Price = 1 US, Y, F Supply = 3 Price = 5
Supply Pools Young US US, Y, nF Supply = 2 Price = 1 Demand 4 2 1 3 US & Y (2) 2 2 US, Y, F Supply = 3 Price = 5 1 Y! Mail Female Supply Pools How should we distribute impressions from the supply pools to satisfy this demand?

Example (Cherry-picking)
Supply Pools Cherry-picking: Fulfill demands at least cost US, Y, nF Supply = 2 Price = 1 Demand (2) US & Y (2) US, Y, F Supply = 3 Price = 5 How should we distribute impressions from the supply pools to satisfy this demand?

Example (Fairness) Cherry-picking: Fulfill demands at least cost
Supply Pools Cherry-picking: Fulfill demands at least cost Fairness: Equitable distribution of available supply pools US, Y, nF Supply = 2 Cost = 1 Demand (1) US & Y (2) (1) US, Y, F Supply = 3 Cost = 5 How should we distribute impressions from the supply pools to satisfy this demand?

Example of an objective function

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Further issues: Risk Management: Supply and demand forecasts should have both mean and variance Forecast aggregation: Forecasts may be needed over multiple resolutions, in time and in demographics Challenging time series problem Adapting system to external events Variance estimates important; we cannot forecast everything accurately. Solutions should use variance to do risk management. Financial crisis news increases traffic to Finance, we should be able to adapt our forecasts and take advantage. 44

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Forecasting accuracy is critical! Overshoot  under-delivery of impressions  unhappy advertisers Undershoot  loss in revenue

Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

Sponsored Search and Content Match
Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What is total expected revenue?

Example (Content Match)
Ad Position 1 Ad Position 2 Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks. Ad Position 3

Example (Content Match)
Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks.

Reminder: Auction Mechanism
Revenue depends on type of auction Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors Total expected revenue = revenue obtained in a given time window [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What affects the total revenue? Relevance of the ad to the query Bids on the ads User experience on the ad landing page (ad “quality”) Expected total revenue is some function of these Optimizing merely by CTR is myopic Obvious things: relevance, bids. Long term and not so obvious things: user experience; have to incorporate these into the ranking formula. Optimizing merely by CTR is myopic 51

Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue Fundamental Problem: Estimate relevance of the ad to the query

Ad Relevance Computation
53

Overview Information Retrieval (IR)
Techniques Challenges Machine Learning using Click Feedback Online Learning

IR-based ad matching “Why not use a search engine to match ads to context?” Ads are the “documents” Context (user query or webpage content) is the “query” Three broad approaches: Vector space models Probabilistic models Language models Open-source software is available: Lemur (

IR-based ad matching Vector space models: Probabilistic models
Each word/phrase in the vocabulary is a separate dimension Each ad and query is a point in this vector space Example: cosine similarity Probabilistic models Language models

IR-based ad matching Q1: How can we score the goodness of an ad for a context? Cosine similarity: Advantages: Simple and easy to interpret Normalizes for different ad and context lengths Query vector Ad vector

IR-based ad matching Vector space models Probabilistic models:
Predict, for every (ad, query) pair, the probability that the ad is relevant to the query Example: Okapi BM25 Language models

Term Frequency in query Inverse Document Frequency
IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Term Frequency in ad Norm. document length Term Frequency in query Inverse Document Frequency Parameters

Term Frequency in query
IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Advantages: Different terms are weighted differently Tunable parameters Good performance Term Frequency in ad Norm. document length Term Frequency in query

IR-based ad matching Vector space models Probabilistic models
Language models: Ads and queries are generated by statistical models of how words are used in the language What statistical models can be used? How do we translate query and ad generation probabilities into relevance?

Term probability (model parameters)
IR-based ad matching What statistical models can be used? Bigram model Multinomial model Given any ad or query, we can compute the parameter setting most likely to have generated the document Term probability (model parameters) Total length Term Frequency

IR-based ad matching Query params Query Ad params Ad
How do we translate query and ad generation probabilities into relevance? Method 1 Compute most likely query and ad params Generate ad using query params High probability  high relevance Query params Query Ad params Ad

How do we translate query and ad generation probabilities into relevance? Method 2 Compute most likely query and ad params Generate query using ad params High probability  high relevance Query params Query Ad params Ad

How do we translate query and ad generation probabilities into relevance? Method 3 Compute most likely query and ad params Compute KL-divergence between params Low KL-divergence  high relevance Query params Query Ad params Ad

Techniques Challenges Machine Learning using Click Feedback Online Learning

Challenges of IR-based ad matching
Word matches might not always work

Woes of word matching Extract Topical info Increases coverage,
more relevant match

IR-based ad matching New methods to combine syntactic and semantic information For example, “A Semantic Approach to Contextual Advertising” by Broder+/SIGIR/2007 Words only provide syntactic clues Classify ads and queries into a common taxonomy Taxonomy matches provide semantic clues

Word matches might not always work Works well for frequent words, what about rare words? Long tail, big revenue impact. Remedy: Add more matching dimensions (phrase,…) Static, does not capture effect of external factors E.g. high interest in basketball page due to an event; dies off after the event Click feedback a powerful way of capturing such latent effects; difficult to do it through relevance only Relevance scores may not correspond to CTR; does not provide estimates of expected revenue

Heterogeneous corpus (query, ads). Single tfidf scores not applicable. In content match, queries long and noisy Partial feedback does not work Not scalable Ads are small, relevance of landing page difficult to determine (video, image, text)

Machine Learning using Click Feedback

Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Learning from Click Feedback
Learning relevance from partial human-labeled training data Attractive but not scalable Users provide us direct feedback through ad clicks Low cost and automated learning mechanism Large amounts of feedback for big ad-networks Estimation problem: Estimate CTR = Pr(click| query, ad, user)

Learning from Clicks: Challenges
Noisy labels Clicks (unscrupulous users gaming the system) Negatives (not clear; I never click on ads ) Sparseness (query, ad) matrix has billions of cells; long tail Too few data points in large number of cells; MLE has high variance Goal is to learn the best cells, not all cells Dynamic and seasonal effects CTRs evolve; subject to seasonal effects Summer, Halloween,.. Palin ads popular yesterday, not today

Challenges continued Selection bias Positional bias, presentation bias
We never showed watch ads on golf pages Positional bias, presentation bias Same ad performs differently at different positions Slate bias Performance of ad depends on other ads that were displayed

Feature based approach
Query, Ad characterized by features Query: bag-of-words, phrases, topic,… Ads: bag-of-words, keywords, size,… Query feature vector: q Ad feature vector: a Pr(Click|Q,A) = f(q,a;θ) Example: Logistic regression log-odds(Pr(Click|Q,A)) = q’ W a W estimated from data

Feature based models: Challenges
High dimensional, need to regularize (Priors) De-bias for positional and slate effects Negative events to be weighted appropriately Go through case studies reported in literature

Estimate CTR of new ads in Sponsored search
Predicting Clicks: Estimating the Click-through rates of new ads: Richardson et al, WWW 2007 Estimate CTR of new ads in Sponsored search Log-odds(CTR(ad)) = wifi(ad) Features used: Bid term CTRs of related ads (from other accounts) CTRs of all other ads with keyword “camera” Appearance, attention, advertiser reputation, landing page quality, relevance of bid terms to ad, bag-of-words in ad. Does not capture interactions between (query, ad), main focus is to estimate CTR of new ads only Negative events down-weighted based on eye-tracking study

Combining relevance with Click Feedback, Chakrabarti et al, WWW 08
Content Match application CTR estimation for arbitrary (page, ad) pairs Features : Bag-of-words in query, ads; relevance scores from IR Cross-product of words: Occurs in both page and ad Learn to predict click data using such features Prediction function amenable to WAND algorithm Helps with fast retrieval at serve time

Proposed Method A logistic regression method model for CTR
Model parameters CTR Main effect for page (how good is the page) Main effect for ad (how good is the ad) Interaction effect (words shared by page and ad)

Proposed Method Mp,w = tfp,w Ma,w = tfa,w Ip,a,w = tfp,w * tfa,w
So, IR-based term frequency measures are taken into account

Proposed Method Two sources of complexity Adding in IR scores
Word selection for efficient learning

Proposed Method How can IR scores fit into the model?
What is the relationship between logit(pij) and cosine score? Quadratic relationship logit(pij) Cosine score

Proposed Method How can IR scores fit into the model?
This quadratic relationship can be used in two ways Put in cosine and cosine2 as features Use it as a prior

Proposed Method Word selection Overall, nearly 110k words in corpus
Learning parameters for each word would be: Very expensive Require a huge amount of data Suffer from diminishing returns So we want to select ~1k top words which will have the most impact

Proposed Method Word selection Data based:
Define an interaction measure for each word Higher values for words which have higher-than-expected CTR when they occur on both page and ad

25% lift in precision at 10% recall
Experiments Precision Recall 25% lift in precision at 10% recall

Regelsen and Fain, 2006 Estimate CTR of terms by “borrowing strength” at multiple resolutions Hierarchical clustering of related terms Clustering advertiser keyword matrix Estimating CTR at finer resolutions by using information at coarser resolutions Weighted average, more weight to finer resolutions Weights selected heuristically, no principled approach

Estimation in the “tail”
A more principled approach to “Estimating Rates of Rare Events at Multiple Resolutions” [KDD/2007] Contextual Advertising Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of an ad on a page Most (ad, page) pairs have very few impressions, if any, and even fewer clicks Severe data sparsity

Estimation in the “tail”
Use an existing, well-understood hierarchy Categorize ads and webpages to leaves of the hierarchy CTR estimates of siblings are correlated The hierarchy allows us to aggregate data Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer resolutions

System overview Retrospective data [URL, ad, isClicked] Crawl URLs
a sample of URLs Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

Sampling of webpages Naïve strategy: sample at random from the set of URLs Sampling errors in impression volume AND click volume Instead, we propose: Crawling all URLs with at least one click, and a sample of the remaining URLs Variability is only in impression volume Sampling bias adjusted through statistical procedure (details in the paper) Only sampling pages, not ads. All ad information is available.

System overview Retrospective data [page, ad, isclicked] Crawl Pages
a sample of pages Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

Rare rate modeling Freeman-Tukey transform:
yij = F-T(clicks and impressions at ij) ≈ transformed-CTR Variance stabilizing transformation: Var(y) is independent of E[y]  needed in further modeling

Rare rate modeling Sparent(ij) Sij yij yparent(ij) Unobserved “state”
Generative Model (Tree-structured Markov Model) variance Wij Wparent(ij) Unobserved “state” Sparent(ij) Sij βparent(ij) covariates βij variance Vij Vparent(ij) yij yparent(ij)

Rare rate modeling Model fitting with a 2-pass Kalman filter:
Filtering: Leaf to root Smoothing: Root to leaf Linear in the number of regions

Tree-structured Markov model
ISSUE

Scalable Model fitting Multi-resolution Kalman filter

Multi-Resolution Kalman filter: Mathematical overview
Br: correlation between two sibling regions at level dr

Experiments 503M impressions
7-level hierarchy of which the top 3 levels were used Zero clicks in 76% regions in level 2 95% regions in level 3 Full dataset DFULL, and a 2/3 sample DSAMPLE

Experiments Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE Some of these regions R>0 get clicks in DFULL A good model should predict higher CTRs for R>0 as against the other regions in R

Experiments We compared 4 models TS: our tree-structured model
LM (level-mean): each level smoothed independently NS (no smoothing): CTR proportional to 1/Ñ Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

Experiments TS Random LM, NS

Experiments Few impressions  Estimates depend more on siblings
Enough impressions  little “borrowing” from siblings

Related Work Multi-resolution modeling Imputation
studied in time series modeling and spatial statistics [Openshaw+/79, Cressie/90, Chou+/94] Imputation studied in statistics [Darroch+/1972] Application of such models to estimation of such rare events (rates of ~10-3) is novel

Summary A method to estimate The method has two parts
rates of extremely rare events at multiple resolutions under severe sparsity constraints The method has two parts Imputation  incorporates hierarchy, fixes sampling bias Tree-structured generative model  extremely fast parameter fitting

Collaborative Filtering
Similarity based methods Ad-ad similarity matrix Rating (CTR) for query u of ad i Local neighborhood of ad i

Collaborative Filtering
Similarity based methods Possible adaptation Challenges: Learning similarity Simultaneously incorporating query and ad similarities Feature-based model Collaborative filtering model

Factor coefficients for ad Factor coefficients for query
Matrix Factorization Matrix Factorization Each query (ad) is a linear combination of latent factors Solve for factors, under some regularization and constraints Factor coefficients for ad Factor coefficients for query

Matrix Factorization Matrix Factorization Bi-clustering
Predictive Discrete latent factor models, Agarwal and Merugu, KDD 07.

Challenges of Feature-based models
Learns from clicks but still misses context in many instances as in relevance based approach Introducing features that are too granular makes it hard to learn CTR reliably Does not capture the dynamics of the system Training cost is high Slow prediction functions inadmissible due to latency constraints

Other methods Boosting, Neural nets, Decision Trees, Random Forests, …… Local models Mixture of experts: Fit local, think global Hierarchical modeling with multiple trees User interest, query, ad,.. Each tree is different How to perform smoothing with multiple disparate trees?

Combining cold start with warm start together main challenge in collaborative filtering based methods We believe, solving basic issues more challenging Positional bias Selection bias Correlation in ads on a slate Dynamic CTR; seasonal variations

Online learning

Machine Learning using Click Feedback Online Learning

Online learning for ad matching
All previous approaches learn from historical data This has several drawbacks: Slow response to emerging patterns in the data due to special events like elections, … Initial systemic biases are never corrected If the system has never shown “sound system dock” ads for the “iPod” query, it can never learn if this match is good System needs to be retrained periodically

Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

Background: Bandits Bandit “arms” p1 p2 p3
(unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure)

Background: Bandits Bandit “arms” p1 p2 p3
Goal: Pull arms sequentially so as to maximize the total expected reward Estimate payoff probabilities pi Bias the estimation process towards better arms Bandit “arms” p1 p2 p3 (unknown payoff probabilities)

Background: Bandits An algorithm to sequentially pick the arms is called a bandit policy Regret of a policy = how much extra payoff could be gained in expectation if the best arm is always pulled Of course, the best arm is not known to the policy Hence, the regret is the price of exploration Low regret implies that the policy quickly converges to the best arm What is the optimal policy?

Background: Bandits argmax g(s1, f1, s2, f2, …, sk, fk) ?
Which arm should be pulled next? Not necessarily what looks best right now, since it might have had a few lucky successes Seems to depend on some complicated function of the successes and failures of all arms Number of successes argmax g(s1, f1, s2, f2, …, sk, fk) ? Number of failures

argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}
Background: Bandits What is the optimal policy? Consider a bandit which has an infinite time horizon, but future rewards are geometrically discounted Rtotal = R(1) + γ.R(2) + γ2.R(3) + … (0<γ<1) Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently argmax g(s1, f1, s2, f2, …, sk, fk) ? argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}

Background: Bandits What is the optimal policy?
Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently Significantly reduces the dimension of the problem space But, the optimal functions gi(si, fi) are hard to compute Need approximate methods…

Background: Bandits Priority 1 Priority 2 Priority 3 Bandit Policy
Assign priority to each arm “Pull” arm with max priority, and observe reward Update priorities Allocation Estimation

Background: Bandits Number of failures Total number of observations
One common policy is UCB1 [Auer/2002] Number of failures Total number of observations Number of successes Number of observations of arm i Observed payoff Factor representing uncertainty

Factor representing uncertainty
Background: Bandits As total observations T becomes large: Observed payoff tends asymptotically towards the true payoff probability The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty

Factor representing uncertainty
Background: Bandits Sub-optimal arms are pulled O(log T) times Hence, UCB1 has O(log T) regret This is the lowest possible regret Observed payoff Factor representing uncertainty

Background: Bandits ~109 pages ~106 ads Bandit “arms” = ads Webpage 1

Background: Bandits Ads Webpages One bandit Unknown CTR
Content Match = A matrix Each row is a bandit Each cell has an unknown CTR

Background: Bandits Why not simply apply a bandit policy directly to our problem? Convergence is too slow ~109 bandits, with ~106 arms per bandit Additional structure is available, that can help  Taxonomies

Taxonomies for dimensionality reduction
Root Already exist Actively maintained Existing classifiers to map pages and ads to taxonomy nodes Page/Ad Apparel Computers Travel A bandit policy that uses this structure can be faster

Outline Multi-level Bandit Policy for Content Match Experiments
Summary

Consider only two levels
Multi-level Policy Ads classes Webpages classes …… … … …… Consider only two levels

Consider only two levels
Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters … … …… One bandit Travel Consider only two levels

Key idea: CTRs in a block are homogeneous
Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters … … …… One bandit Travel Key idea: CTRs in a block are homogeneous

Multi-level Policy CTRs in a block are homogeneous
Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

Multi-level Policy (Allocation)
? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class “We still haven’t learnt that geeks and high fashion don’t mix.”

ad ? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class Run bandit among cells  pick one ad class In general, continue from root to leaf  final ad

ad A C T Page classifier A C T Bandits at higher levels use aggregated information have fewer bandit arms Quickly figure out the best ad parent class

Multi-level Policy CTRs in a block are homogeneous
Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

Multi-level Policy (Estimation)
CTRs in a block are homogeneous Observations from one cell also give information about others in the block How can we model this dependence?

Shrinkage Model # impressions in cell # clicks in cell Scell | CTRcell ~ Bin (Ncell, CTRcell) CTRcell ~ Beta (Paramsblock) All cells in a block come from the same distribution

Intuitively, this leads to shrinkage of cell CTRs towards block CTRs E[CTR] = α.Priorblock + (1-α).Scell/Ncell Estimated CTR Beta prior (“block CTR”) Observed CTR

Experiments Root 20 nodes We use these 2 levels 221 nodes …
Depth 0 Depth 1 20 nodes We use these 2 levels Depth 2 221 nodes … Depth 7 ~7000 leaves Taxonomy structure

Experiments Data collected over a 1 day period
Collected from only one server, under some other ad-matching rules (not our bandit) ~229M impressions CTR values have been linearly transformed for purposes of confidentiality

Experiments (Multi-level Policy)
Clicks Number of pulls Multi-level gives much higher #clicks

Experiments (Multi-level Policy)
Mean-Squared Error Number of pulls Multi-level gives much better Mean-Squared Error  it has learnt more from its explorations

Experiments (Shrinkage)
without shrinkage Clicks Mean-Squared Error with shrinkage Number of pulls Number of pulls Shrinkage  improved Mean-Squared Error, but no gain in #clicks

Summary Taxonomies exist for many datasets They can be used for
Dimensionality Reduction Multi-level bandit policy  higher #clicks Better estimation via shrinkage models  better MSE

Challenges and Open Problems
Bandit policies typically assume stationarity But, sudden changes are the norm in the online advertising world: Ads may be suddenly removed when they run out of budget New ads are constantly added to the system The total number of ads is huge, and full exploration may be too costly Mortal multi-armed bandits [NIPS/2008] Algorithms for infinitely multi-armed bandits [NIPS/2008]

Mortal Multi-armed Bandits
Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 1 (Sampling): Pick a random sample from the set of available arms Run UCB1 on sample, until some fraction of arms in the sample are lost Pro: Quicker convergence, more exploitation Con: Best arm in the sample may be worse than best arm overall Pick sample size to control this tradeoff

Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 2 (Payoff threshold): New bandit policy: If the observed payoff of any arm is higher than a threshold, pull it till it expires Pro: Good arms, once found, are exploited quickly Con: While exploiting good arms, the best arm may be starving and may expire without being found Pick threshold to control this tradeoff

Challenges: Selecting the critical sample size or threshold correctly, for arbitrary payoff distributions What if even the payoff distribution is unknown?

Challenges and Open Problems
Mortal multi-armed bandits What if the bandit policy has some information about the budget? The bandit policy can control which arms expire, and when “Handling Advertisements of Unknown Quality in Search Advertising” by Pandey+/NIPS/2006 Combining budgets with extra knowledge of ad CTRs E.g., Using an ad taxonomy Using a bandit scheme to infer/correct an ad taxonomy

Conclusions

Conclusions We provided an introduction to Online Advertising
Discussed the eco-system and various actors involved Discussed different flavors of online advertising Sponsored Search, Content Match, Display Advertising

Online Advertising Conclusions Revenue Models Misc. Ad exchanges

Conclusions Online Models Offline Modeling Explore/Exploit
Outlined associated statistical challenges Sponsored search, Content Match, Display We believe the following to be a technical roadmap Offline Modeling Online Models Time series Explore/Exploit Multi-armed bandits Regression, collaborative filtering, mixture of experts Multi-resolution models Selection bias Slate correlation Noisy labels

Conclusions Offline Modeling Explore/Exploit
By far the best studied so far Not a careful study of selection bias, slate correlations, noisy labels. Good opportunity here More emphasis on matrix structure, goal is to estimate interactions Explore/Exploit Some work using multi-armed bandits; long way to go Time series model to capture temporal aspects Little work Holistic approach that combines all components in a principled way

Online Advertising Multi-billion dollar industry, high growth

Similar presentations

Presentation on theme: "Online Advertising Multi-billion dollar industry, high growth"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Online Advertising Multi-billion dollar industry, high growth

Similar presentations

Presentation on theme: "Online Advertising Multi-billion dollar industry, high growth"— Presentation transcript:

Similar presentations

About project

Feedback