Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
Sponsored Search How does it work? Search engine displays ads next to search results Advertisers pay search engine per click Who benefits from it? Main source of funding for search engines Information flow from advertisers to users
Sponsored Search Click-through-rate (CTR): given an ad and a query, CTR = probability that the ad receives a click Optimal policy to maximize search engine’s revenue: display ads of highest (CTR x bid) value Search query results Sponsored search results
Challenges in Sponsored Search Problem: CTRs initially unknown estimating CTRs requires going around the circle Exploration/Exploitation Tradeoff: explore ads to estimate CTRs exploit known high-CTR ads to maximize revenue refine CTR estimates record clicks show ads earn revenue
The Advertisement Problem Problem: Advertiser A i submits ad a i,j for Query phrase Q j User clicks on a ij -> A i pays b ij (the “bid value”) Queries arrive one after another Select ads to show for each query, in an online fashion Constraints: Show at most C ads per query Advertisers have daily budgets: A i pays at most d i Goal: Maximize search engine’s revenue Advertisers Query phrases a 1,1 A1A1 Q1Q1 a 2,1 a 1,3 A2A2 A3A3 Q2Q2 Q3Q3 a 3,2 d1d1 d2d2 d3d3 BudgetsAds
Our Approach Unbudgeted Advertisement Problem Isomorphic to multi-armed bandit problem Budgeted Advertisement Problem Similar to bandit problem, but with additional budget constraints that span arms Introduce Budgeted Multi-armed Multi-bandit problem (BMMP)
Unbudgeted Advertisement Problem as Multi-armed Bandit Problem Bandit: Classical example of online learning under the explore/exploit tradeoff K arms. Arm i has an associated reward r i and unknown payoff probability p i Pull C arms at each time instant to maximize the reward accrued over time p1p1 p2p2 p3p3 Isomorphism: query phrase bandit instance; ads arms; CTR payoff probability; bid reward
Policy for Unbudgeted Problem Policy “MIX” ( adopted from [Auer et. al. ML’02] ) When query phrase Q j arrives Compute the priority p i,j of each ad a i,j where p i,j = (e i,j + sqrt(2 ln n j / n i,j )). b i,j e i,j is the MLE of the CTR value of a i,j b i,j is the price or bid value of ad a i,j n i,j : # times ad a i,j has been shown in the past n j : # times query Q j has been answered Display the C highest-priority ads
Budgeted Multi-armed Multi-Bandit problem (BMMP) Finite set of bandit instances; each instance has a finite number of arms Each arm has an associated type Each type T i has budget d i Upper limit on the total amount of reward that can be generated by the arms of type T i An external actor invokes a bandit instance at each time instant the policy must choose C arms of the invoked instance
Meta Policy for BMMP Input: BMMP instance and policy POL for the conventional multi-armed bandit problem Output: The following Policy BPOL Run POL in parallel for each bandit instance B i Whenever B i is invoked: Discard arm(s) with depleted budget If one or more arms was discarded, restart POL i Let POL i decide which of the remaining arms to activate
Performance Guarantee of BPOL OPT = algorithm that knows in advance: 1.Full sequence of bandit invocations 2.Payoff probabilities Claim: bpol(N) >= opt(N)/2 – O(f(N)) bpol(N): total expcted reward of BPOL policy after N bandit invocations opt(N): total expected reward of OPT f(N): regret of POL after N invocations of the regular bandit problem
Proof of Performance Guarantee Divide the time instants into 3 categories: 1 : BPOL chooses an arm of higher expected reward than OPT opt 1 (N) <= bpol 1 (N) 2 : BPOL chooses an arm of lower expected reward because OPT’s arm has run out of budget opt 2 (N) <= bpol 2 (N) + (#types. max reward) 3 : otherwise opt 3 (N) = O(f(N)) Claim (implies from the above bounds) opt(N) <= bpol(N) + bpol(N) + O(1) + O(f(N)) bpol(N) >= opt(N)/2 – O(f(n))
Advertisement Policies BMIX : Output of our generic BPOL policy when given MIX as input BMIX-E : Replace sqrt(2 ln n j / n i,j ) in priority p i,j by sqrt(min(0.25, V(n i,j,n j )). ln n j / n i,j ), where V(n i,j,n j ) = e i,j.(1-e i,j ). sqrt(2 ln n j / n i,j ) Suggested in Auer. et. al. ML’02. Purpose: Aggressive exploitation BMIX-T : Replace b i,j in priority p i,j by b i,j. throttle(d i ‘), throttle(d i ‘) = 1-e^(- d i ‘/d i ) where d i ‘ is the remaining budget of advertiser A i Suggested in Mehta et. al. FOCS’05 Purpose: Delay the depletion of advertisers’ budgets BMIX-ET: with both E and T modifications
Experiments Simulations over real data Data: 85,000 query phrases from Yahoo! query log Yahoo! ads with daily budget constraints CTRs drawn from Yahoo!’s CTR distribution Simulated user clicks using the CTR values Time horizon = multiple days Policies carried over the CTR estimates from one day to the next
Results GREEDY : select ads with highest current reward estimate (e i,j. b i,j ) Does not explore. Only exploits. *Revenue values scaled for confidentiality reasons
Conclusion Search advertisement problem Exploration/exploitation tradeoff Model as multi-armed bandit Introduced new Bandit variant Budgeted multi-armed multi-bandit problem (BMMP) New policy for BMMP with performance guarantee In paper: Variable set of ads (ads come and go) Prior CTR estimates