Probabilistic Methods for Targeted Advertising Max Chickering Microsoft Research
Outline Targeted Mailing To whom should you send a solicitation? Targeted Advertising on the Web How should you display banner ads to maximize click-through?
Targeted Mailing Given a population of potential customers. PersonX 1 X 2 …X n 100…red 203.4…blue.... m17…green Sending an advertisement costs money: - Postage - Possible Discount Which potential customers do you solicit?
Motivating Application Advertisement: MSN subscription Potential customers: People who registered Windows 95 Known variables: 15from questionnaire (e.g. gender, RAM size)
Naïve Solutions Mail to those customers most likely to subscribe to MSN Can waste money by targeting customers who would subscribe anyway Mail to everyone Even worse!
Response Behaviors MailDon’t Mail Always buyerYesYes PersuadableYesNo Anti-persuadableNoYes Never buyerNoNo Will the potential customer buy the product? We only make money from mailing to the persuadable potential customers
Expected Profit for a Population Population of N potential cutomers N alw, N per, N anti, N nev Cost of mailing c Solicited and unsolicited revenue r Expected Profit from mailing Profit from not mailing
Lift in Profit From Mailing Profit from mailing - Profit from not mailing For any set of potential customers, we should only mail if the lift is positive.
Learning Expected Lift S {s 0, s 1 }(did not subscribe, did subscribe) M {m 0, m 1 }(did not mail, did mail) Identifiable if S, M known in training data Lift : -c + [ p(S=s 1 |M=m 1 ) – p(S=s 1 |M=m 0 ) ] r
Controlled Experiment: Identify Profitable Sub-Populations 1.Choose a small sample of the potential customers 2.Randomly divide those customers into a “treatment group” (M = m 1 ) and a “control group” (M = m 0 ) 3.Wait a specified period of time, and record S = s 0 or S = s 1 for each
Controlled Experiment PersonX 1 X 2 …X n M S 100…red m 1 s …blue m 0 s m17…green m 1 s 1 Use machine-learning techniques to identify sub-populations with high positive lift, and then target those customers Lift ( Sub-population corresponding to X n =blue ) = -c + [ p(S=s 1 |M=m 1, X n =blue) – p(S=s 1 |M=m 0, X n =blue) ] r
Identify Profitable Sub-Populations Partitions of X define sub-populations and statistical model for p(S|M,X) defines the lift Approach: Use Decision Trees Known distinctions in our data : X = {X 1, …, X n }, S, M X 1 > 10, X 4 = 2 X 1 < 10, X 12 = false X 1 < 10, X 12 = true Lift 2 Lift 3 Lift 4 X 1 > 10, X 4 2 Lift 1
Probabilistic Decision Trees p(S | M=m 0, X 1 =1, X 2 =2) p(S | M, X 1, X 2 )
X 2 MX 1 M M p(S=subscribed) = 0.6 p(S=not subscribed) = ,3 mailed not mailed 1 2 p(S=subscribed) = 0.5 p(S=not subscribed) = 0.5 p(S=subscribed) = 0.4 p(S=not subscribed) = 0.6 p(S=subscribed) = 0.2 p(S=not subscribed) = 0.8 mailed not mailed not mailed p(S=subscribed) = 0.7 p(S=not subscribed) = 0.3 p(S=subscribed) = 0.3 p(S=not subscribed) = 0.7 Calculating Lift Potential customer with {X 1 =1, X 2 =2}, Assume c = 0.50, r = 9 Lift = (0.4 – 0.2) 9 = 1.3 Mail to this person!
Traditional Learning Algorithm X1X1 Score 1 (Data) X2X2 Score 2 (Data) XnXn Score n (Data) X2X2 X2X2 X1X1 Score 1 (Data) X2X2 X3X3 Score 3 (Data) X2X2 XnXn Score n (Data)
Lift-Aware Learning Algorithm Traditional Learning Algorithm Identify a tree that represents p(S|M,X) well Lift-Aware Would like the tree to be good at modeling the difference: p(S=s 1 |M=m 1,X=x) - p(S=s 1 |M=m 0,X=x)
A Heuristic Only consider decision trees (for S) with the last split on M M X1X1 MM X1X1 MM Score 1 (Data) XnXn MM Score n (Data) X1X1 M Score 2 (Data) X2X2 MM X1X1 M X2X2 MM
Experiment: Real-world Dataset Product of interest: MSN subscription Potential customers: Windows 95 registrants Known variables (X):15 from questionnaire (e.g. gender, RAM size) Cost to Mail:42 cents Subscription revenue:varied from 1 to 15 dollars Data:sample of ~110,000 potential customers (70% train, 30% test) Compared our algorithm (FORCE) with unconstrained greedy algorithm (NORMAL) for various revenues
Results on Test Data: Per-person improvement over Mail-to-All
Conclusions / Future Work Marginal improvement over standard decision-tree algorithm: Almost every path in the “standard” trees contained a split on M. We expect larger difference for other domains. Algorithm works for discounted prices: Expected Profit from mailing Profit from not mailing
Part II: Targeted Advertising on the Web Given information about a visitor, how do you choose which advertisement to display? ???
Goals of Targeted Advertising Maximize $$$ Maximize Clicks Brand Presence
Naïve Targeting Scheme Possible cluster attributes: Current page category Pages the user has visited on the site Known demographics Inferred demographics Previous advertisement clicks Cluster 1Cluster m Step 1: cluster / segment users
Naïve Targeting Scheme Step 2: Advertiser books ads into clusters Step 3: Measure click probabilities Step 4: Show best ad to each cluster Problems: (Inventory management) Ad Quotas Cluster overbooking
Advertisement Allocation Cluster 1Cluster m Ad 1 Ad 2 Ad n x 11 x 21 xn1 xn1 x1mx1m x2mx2m x nm Cluster 2 x 12 x 22 xn2xn2 x ij = Number of times to show advertisement i to user cluster j
Maximize Expected Clicks Cluster 1Cluster m Ad 1 Ad 2 Ad n p 11 x 11 p 21 x 21 pn1 xn1pn1 xn1 p1m x1mp1m x1m p2m x2mp2m x2m p nm x nm Cluster 2 p 12 x 12 p 22 x 22 pn2 xn2pn2 xn2
Inventory-Management Constraints Ad i xi1xi1 x im Cluster j x ij xi1xi1 x in
Linear Program Find the schedule X that maximizes: Subject to: Solve using (e.g.) the simplex algorithm
A Simple Targeting System Estimate probabilities Find the optimal schedule Serve ads to cluster j via
Sensitivity to Estimates Cluster 1 Ad 1 Ad Cluster q 1 = q 2 = c 1 = c 2 =k Cluster 1 Ad 1 Ad 2 0 k Cluster 2 k 0 Probabilities: Optimal Schedule:
Solution: Buckets Cluster 1 Ad 1 Ad Cluster q 1 = q 2 = c 1 = c 2 =k Cluster 1 Ad 1 Ad 2 a c Cluster 2 b d Probabilities: Optimal Schedule: a+b+c+d = 2k Secondary (linear) optimization: Ads are shown as close to uniform across all clusters
Passive Experiment: MSNBC (December 1998) Sports News Health Opinion Clusters defined by the current page group Manual approach: advertisers buy impressions on page groups
~20 clusters ~500 advertisements ~1.6 million impressions / day Passive Experiment: MSNBC (December 1998) Data from day 1: Estimate p ij (ave ~4K data points per probability) Find optimal schedule (less than 1 minute – no buckets) Data from day 2: Re-estimate p ij Evaluate schedule: Result: 20 – 30 % increase over manual schedule
Particular advertiser: 5 ads Data from weekend 1: Estimate p ij (~15K data points per probability) Find optimal schedule (less than 1 second using buckets) Rearrange advertisements for weekend 2 Data from weekend 2: Count the number of clicks and compare to weekend 1 Active Experiment on MSNBC (May 1999)
0 advertisercontrol Weekend 1 (pre target) Weekend 2 (post target) 30% increase for the advertiser, negligible increase for others Predicted a 20% increase on MSNBC Active Experiment Results
Extensions Problem: Increasing total expected clicks across site may decrease clicks for particular advertiser Solution: Add (linear) constraint that expected clicks cannot decrease Passive experiment: MSNBC overall increase still ~20%
Extensions Focus of talk: p ij = expected #clicks from showing ad i to user j In general: u ij = expected utility from showing ad i to user j Expected utility of X = Alternative u ij choices Weighted probabilities: w i p ij Probability of purchase Increase in brand awareness Expected revenue
My Home Page
Results on Test Data: Per-person improvement over Mail-to-All To evaluate test case given a model: Evaluate the lift given X (ignoring M and S) Recommend Mail if and only if Lift > 0 If recommendation matches M from the test case, add r to the total revenue. Otherwise, ignore.