Online Advertising Multi-billion dollar industry, high growth

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Bayesian Belief Propagation
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Transactions Costs.
Supervised Learning Recap
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Planning under Uncertainty
Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.
Research © 2008 Yahoo! Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti (Yahoo! Research)
1 Estimating Rates of Rare Events at Multiple Resolutions Deepak Agarwal Andrei Broder Deepayan Chakrabarti Dejan Diklic Vanja Josifovski Mayssam Sayyadian.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Presented by Zeehasham Rasheed
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
AdWords Instructor: Dawn Rauscher. Quality Score in Action 0a2PVhPQhttp:// 0a2PVhPQ.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
by B. Zadrozny and C. Elkan
Evaluation Methods and Challenges. 2 Deepak Agarwal & Bee-Chung ICML’11 Evaluation Methods Ideal method –Experimental Design: Run side-by-side.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Research © 2008 Yahoo! Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti (Yahoo! Research)
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Predictive Analytics World CONFIDENTIAL1 Predictive Keyword Scores to Optimize PPC Campaigns Vincent Granville, Ph.D. Click Forensics February 19, 2009.
Chapter 6: Information Retrieval and Web Search
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work with Deepayan Chakrabarti & Vanja Josifovski Yahoo! Research.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Search Engine Optimization
SEARCH ENGINE OPTIMIZATION.
Sampling and Sampling Distribution
Machine Learning: Ensemble Methods
Deep Feedforward Networks
Adaptive, Personalized Diversity for Visual Discovery
LECTURE 11: Advanced Discriminant Analysis
UNIT – V BUSINESS ANALYTICS
Erasmus University Rotterdam
Intro to Content Optimization
Personalized Social Image Recommendation
Traffic Audit Industry: Internet of Things (IoT) Ted Politidis Head of SEO
Martin Rajman, Martin Vesely
Machine Learning Basics
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Bandits for Taxonomies: A Model-based Approach
Multi-armed Bandit Problems with Dependent Arms
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 4: Econometric Foundations
Linear Model Selection and regularization
Ensemble learning.
Michal Rosen-Zvi University of California, Irvine
Markov Decision Problems
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Statistical Data Analysis
Predictive Keyword Scores to Optimize Online Advertising Campaigns
Learning From Observed Data
INF 141: Information Retrieval
DESIGN OF EXPERIMENTS by R. C. Baker
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
CS249: Neural Language Model
Presentation transcript:

Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti (Yahoo! Research)

Online Advertising Multi-billion dollar industry, high growth $9.7B in 2006 (17% increase), total $150B Why this will continue? Broadband cheap, ubiquitous “Getting things done” easier on the internet Advertisers shifting dollars Why does it work? Massive scale, automated, low marginal cost Key: Monetize more and better, “learn from data” New discipline “Computational Advertising” ISSUE: 9.7B vs 150B? How many unique visitors does Y! get per month? A young but multi-billion dollar industry as evident from the phenomenal success of companies like Google, Yahoo!, MSN; continues to grow at a rapid rate. Broadband cheap, ubiquitous; people spending more time on internet since search engines and other services make it easier to get things done online; content on WWW growing by leaps and bounds. This has caught the eye of the advertising industry who are pumping more advertising dollars to internet relative to other advertising media like television, radio, newspaper etc. Why does the business work and works so well? Many bright minds have dismissed its potential few years ago. I am going to describe it abstractly now, hopefully it would start making a lot more sense by the end of the talk. Extremely large scale system, several billion transactions conducted everyday in an almost automated fashion. A small fraction monetize, but good enough to make it a lucrative business The key to success is monetizing more and better by automated learning through massive amounts of data constantly flowing into the system. Given rise to a new academic discipline called “Computational advertising”

What is “Computational Advertising”? New scientific sub-discipline, at the intersection of Large scale search and text analysis Information retrieval Statistical modeling Machine learning Optimization Microeconomics Multi-disciplinary; composed of several key components

Online advertising: 6000 ft Overview Pick ads Ads Advertisers Content Ad Network User This only shows one scenario; that of content match. Let’s add Sponsored Search (Replace Content with Query) and Have a new slide for display advertising. This also does not provide info for the revenue model (shall we add it here or later). Examples: Yahoo, Google, MSN, RightMedia, … Content Provider

Outline Background on online advertising The Fundamental Problem Sponsored Search, Content Match, Display, Unified marketplace The Fundamental Problem Statistical sub-problems: Description Existing methods Challenges

Online Advertising Different flavors Revenue Models Misc. Ad exchanges Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

Revenue Models Ad Network Advertisers Pick ads Ads Content User CPM CPC CPA Cost Per iMpression Ad Network Pick ads Ads Advertisers Content User $$ $ Content Provider

Revenue Models Ad Network Advertisers Pick ads Ads Content User CPM CPC CPA Cost Per Click Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

Advertiser landing page Revenue Models Advertiser landing page Cost Per Action CPM CPC CPA Ad Network click Pick ads Ads Advertisers Content User $$ $ Content Provider

Revenue Models CPM CPC CPA Depends on auction mechanism Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC CPM CPC CPA Depends on auction mechanism Click-through Rate (probability of a click given an impression)

Auction Mechanism Revenue depends on type of auction Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

Revenue Models CPM CPC CPA Example: Suppose we show an ad N times on the same spot Under CPM: Revenue = N * CPM Under CPC: Revenue = N * CTR * CPC Under CPA: Revenue = N * CTR * Conv. Rate * CPA CPM CPC CPA Conversion Rate (probability of a user conversion on the advertiser’s landing page given a click)

Relevance to advertisers Revenue Models CPM website traffic CPC website traffic + ad relevance CPA website traffic + ad relevance + landing page quality Revenue dependence Relevance to advertisers Prices and Bids Ease of picking ads

Online Advertising Background Revenue Models Misc. Ad exchanges Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

Advertising Setting Pick ads Ads Advertisers Content What do you show the user? How does the user interact with the ad system? Ad Network User Content Provider

Advertising Setting Display Content Match Sponsored Search

Advertising Setting Display Content Match Sponsored Search Pick ads

Advertising Setting Graphical display ads Mostly for brand awareness Revenue model is typically CPM Display Content Match Sponsored Search

Advertising Setting Display Content Match Sponsored Search Content match ad

Match ads to the content Advertising Setting Display Content Match Sponsored Search Text ads Pick ads Match ads to the content

Advertising Setting Display Content Match Sponsored Search The user intent is unclear Revenue model is typically CPC Query (webpage) is long and noisy Display Content Match Sponsored Search

Advertising Setting Display Content Match Sponsored Search Search Query Sponsored Search Ads

Advertising Setting Search Query Display Content Match Sponsored Search Pick ads Text ads Search Query Match ads to the query

Advertising Setting Display Content Match Sponsored Search User “declares” his/her intention Click rates generally higher than for Content Match Revenue model is typically CPC (recently some CPA) Query is short and less noisy than Content Match Display Content Match Sponsored Search

Summary Different revenue models Brand awareness Depends on the goal of the advertiser campaign Brand awareness Display advertising Pay per impression (CPM) Attracting users to advertised product Content Match, Sponsored Search Pay per click (CPC), Pay per action (CPA)

Online Advertising Background Revenue Models Misc. Ad exchanges Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

Unified Marketplace Publishers, Ad-networks, advertisers participate together in a singe exchange Publishers put impressions in the exchange; advertisers/ad-networks bid for it CPM, CPC, CPA are all integrated into a single auction mechanism

Overview: The Open Exchange Bids $0.75 via Network… Bids $0.50 Bids $0.60 Ad.com AdSense Bids $0.65—WINS! Has ad impression to sell -- AUCTIONS … which becomes $0.45 bid Transparency and value

Unified scale: Expected CPM Campaigns are CPC, CPA, CPM They may all participate in an auction together Converting to a common denomination is a challenge

Outline Background on online advertising The Fundamental Problem Statistical sub-problems: Description Existing methods Challenges

Outline Background on online advertising The Fundamental Problem Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

Display Advertising

Does it work? Lewis and Reiley, 2009. Retail Advertising Works! Yahoo! Technical Report Controlled experiment assigning customers to treatment and control groups for a large retailer Advertising significantly improved purchases (online and offline) Good news! Main goal is brand awareness Revenue model typically CPM, advertiser takes all the risk

Display adverting: Buyer and Seller Advertiser (Buyer) Buy ad space well in advance or in spot market May buy from publisher who may guarantee or not Typically, guaranteed contracts cost more, price higher Publisher (seller) Sell in advance (guaranteed) or sell in spot market

Display Advertising Main goal of advertisers: Brand Awareness Revenue Model: Primarily Cost per impression (CPM) Traditional Advertising Model: Ads are targeted at particular demographics (user characteristics) GM ads on Y! autos shown to “males above 55” Mortgage ad shown to “everybody on Y! Front page” Book a slot well in advance “2M impressions in Jan next year” These future impressions must be guaranteed by the ad network

Display Advertising Predict Supply: Fundamental Problem: Guarantee impressions to advertisers Predict Supply: How many impressions will be available? Demographics overlap Predict Demand: How much will advertisers want each demographic? Young US 2 1 4 3 2 2 1 Y! Mail Female

Display Advertising Predict Supply Predict Demand Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation subject to supply and demand constraints Young US 2 1 4 3 2 2 si dj xij Gives rise to an allocation problem: who should supply to what contracts? Several feasible solutions; which one is best? Find the optimal one given demand and supply? 1 Y! Mail Female 37

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Optimal in terms of what objective function? Depends on the goal of the ad-network Advertisers don’t specify too many targets Dangerous for ad-networks to allow that (difficult to forecast) BUT they expect good inventory (campaign should result in good conversion rates) 38

Allocation through Optimization Optimal in terms of what objective function? E.g. Maximize value of remaining inventory Cherry-picks valuable inventory, saves it for later Fairness “Spreads the wealth” subject to constraints si supply demand dj xij

Example US, Y, nF Supply = 2 Price = 1 US, Y, F Supply = 3 Price = 5 Supply Pools Young US US, Y, nF Supply = 2 Price = 1 Demand 4 2 1 3 US & Y (2) 2 2 US, Y, F Supply = 3 Price = 5 1 Y! Mail Female Supply Pools How should we distribute impressions from the supply pools to satisfy this demand?

Example (Cherry-picking) Supply Pools Cherry-picking: Fulfill demands at least cost US, Y, nF Supply = 2 Price = 1 Demand (2) US & Y (2) US, Y, F Supply = 3 Price = 5 How should we distribute impressions from the supply pools to satisfy this demand?

Example (Fairness) Cherry-picking: Fulfill demands at least cost Supply Pools Cherry-picking: Fulfill demands at least cost Fairness: Equitable distribution of available supply pools US, Y, nF Supply = 2 Cost = 1 Demand (1) US & Y (2) (1) US, Y, F Supply = 3 Cost = 5 How should we distribute impressions from the supply pools to satisfy this demand?

Example of an objective function

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Further issues: Risk Management: Supply and demand forecasts should have both mean and variance Forecast aggregation: Forecasts may be needed over multiple resolutions, in time and in demographics Challenging time series problem Adapting system to external events Variance estimates important; we cannot forecast everything accurately. Solutions should use variance to do risk management. Financial crisis news increases traffic to Finance, we should be able to adapt our forecasts and take advantage. 44

Display Advertising Fundamental Problem: Guarantee impressions to advertisers Predict Supply Predict Demand Find the optimal allocation, subject to constraints Pick the right objective function Forecasting accuracy is critical! Overshoot  under-delivery of impressions  unhappy advertisers Undershoot  loss in revenue

Outline Background on online advertising The Fundamental Problem Display advertising Sponsored Search and Content Match Statistical sub-problems: Description Existing methods Challenges

Sponsored Search and Content Match Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What is total expected revenue?

Example (Content Match) Ad Position 1 Ad Position 2 Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks. Ad Position 3

Example (Content Match) Relevant ads placed when user visits a webpage; no query specified by user. Hard to guess what he wants, can only use the context (i.e. the type of page he is viewing) and his other characteristics (browsing behavior,demographics). Payment happens for clicks.

Reminder: Auction Mechanism Revenue depends on type of auction Generalized First-price: CPC = bid on clicked ad Generalized Second-price: CPC = bid of ad below clicked ad (or the reserve price) CPC could be modified by additional factors Total expected revenue = revenue obtained in a given time window [Optimal Auction Design in a Multi-Unit Environment: The Case of Sponsored Search Auctions] by Edelman+/2006 [Internet Advertising and the Generalized Second Price Auction…] by Edelman+/2006

Sponsored Search and Content Match Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue What affects the total revenue? Relevance of the ad to the query Bids on the ads User experience on the ad landing page (ad “quality”) Expected total revenue is some function of these Optimizing merely by CTR is myopic Obvious things: relevance, bids. Long term and not so obvious things: user experience; have to incorporate these into the ranking formula. Optimizing merely by CTR is myopic 51

Sponsored Search and Content Match Given a query: Select the top-k ads to be shown on the k slots to maximize total expected revenue Fundamental Problem: Estimate relevance of the ad to the query

Ad Relevance Computation 53

Overview Information Retrieval (IR) Techniques Challenges Machine Learning using Click Feedback Online Learning

IR-based ad matching “Why not use a search engine to match ads to context?” Ads are the “documents” Context (user query or webpage content) is the “query” Three broad approaches: Vector space models Probabilistic models Language models Open-source software is available: Lemur (www.lemurproject.org)

IR-based ad matching Vector space models: Probabilistic models Each word/phrase in the vocabulary is a separate dimension Each ad and query is a point in this vector space Example: cosine similarity Probabilistic models Language models

IR-based ad matching Q1: How can we score the goodness of an ad for a context? Cosine similarity: Advantages: Simple and easy to interpret Normalizes for different ad and context lengths Query vector Ad vector

IR-based ad matching Vector space models Probabilistic models: Predict, for every (ad, query) pair, the probability that the ad is relevant to the query Example: Okapi BM25 Language models

Term Frequency in query Inverse Document Frequency IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Term Frequency in ad Norm. document length Term Frequency in query Inverse Document Frequency Parameters

Term Frequency in query IR-based ad matching Q1: How can we score the goodness of an ad for a context? Okapi BM25: Advantages: Different terms are weighted differently Tunable parameters Good performance Term Frequency in ad Norm. document length Term Frequency in query

IR-based ad matching Vector space models Probabilistic models Language models: Ads and queries are generated by statistical models of how words are used in the language What statistical models can be used? How do we translate query and ad generation probabilities into relevance?

Term probability (model parameters) IR-based ad matching What statistical models can be used? Bigram model Multinomial model Given any ad or query, we can compute the parameter setting most likely to have generated the document Term probability (model parameters) Total length Term Frequency

IR-based ad matching Query params Query Ad params Ad How do we translate query and ad generation probabilities into relevance? Method 1 Compute most likely query and ad params Generate ad using query params High probability  high relevance Query params Query Ad params Ad

IR-based ad matching Query params Query Ad params Ad How do we translate query and ad generation probabilities into relevance? Method 2 Compute most likely query and ad params Generate query using ad params High probability  high relevance Query params Query Ad params Ad

IR-based ad matching Query params Query Ad params Ad How do we translate query and ad generation probabilities into relevance? Method 3 Compute most likely query and ad params Compute KL-divergence between params Low KL-divergence  high relevance Query params Query Ad params Ad

Overview Information Retrieval (IR) Techniques Challenges Machine Learning using Click Feedback Online Learning

Challenges of IR-based ad matching Word matches might not always work

Woes of word matching Extract Topical info Increases coverage, more relevant match

IR-based ad matching New methods to combine syntactic and semantic information For example, “A Semantic Approach to Contextual Advertising” by Broder+/SIGIR/2007 Words only provide syntactic clues Classify ads and queries into a common taxonomy Taxonomy matches provide semantic clues

Challenges of IR-based ad matching Word matches might not always work Works well for frequent words, what about rare words? Long tail, big revenue impact. Remedy: Add more matching dimensions (phrase,…) Static, does not capture effect of external factors E.g. high interest in basketball page due to an event; dies off after the event Click feedback a powerful way of capturing such latent effects; difficult to do it through relevance only Relevance scores may not correspond to CTR; does not provide estimates of expected revenue

Challenges of IR-based ad matching Heterogeneous corpus (query, ads). Single tfidf scores not applicable. In content match, queries long and noisy Partial feedback does not work Not scalable Ads are small, relevance of landing page difficult to determine (video, image, text)

Machine Learning using Click Feedback

Overview Information Retrieval (IR) Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Learning from Click Feedback Learning relevance from partial human-labeled training data Attractive but not scalable Users provide us direct feedback through ad clicks Low cost and automated learning mechanism Large amounts of feedback for big ad-networks Estimation problem: Estimate CTR = Pr(click| query, ad, user)

Learning from Clicks: Challenges Noisy labels Clicks (unscrupulous users gaming the system) Negatives (not clear; I never click on ads ) Sparseness (query, ad) matrix has billions of cells; long tail Too few data points in large number of cells; MLE has high variance Goal is to learn the best cells, not all cells Dynamic and seasonal effects CTRs evolve; subject to seasonal effects Summer, Halloween,.. Palin ads popular yesterday, not today

Challenges continued Selection bias Positional bias, presentation bias We never showed watch ads on golf pages Positional bias, presentation bias Same ad performs differently at different positions Slate bias Performance of ad depends on other ads that were displayed

Overview Information Retrieval (IR) Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Feature based approach Query, Ad characterized by features Query: bag-of-words, phrases, topic,… Ads: bag-of-words, keywords, size,… Query feature vector: q Ad feature vector: a Pr(Click|Q,A) = f(q,a;θ) Example: Logistic regression log-odds(Pr(Click|Q,A)) = q’ W a W estimated from data

Feature based models: Challenges High dimensional, need to regularize (Priors) De-bias for positional and slate effects Negative events to be weighted appropriately Go through case studies reported in literature

Estimate CTR of new ads in Sponsored search Predicting Clicks: Estimating the Click-through rates of new ads: Richardson et al, WWW 2007 Estimate CTR of new ads in Sponsored search Log-odds(CTR(ad)) = wifi(ad) Features used: Bid term CTRs of related ads (from other accounts) CTRs of all other ads with keyword “camera” Appearance, attention, advertiser reputation, landing page quality, relevance of bid terms to ad, bag-of-words in ad. Does not capture interactions between (query, ad), main focus is to estimate CTR of new ads only Negative events down-weighted based on eye-tracking study

Combining relevance with Click Feedback, Chakrabarti et al, WWW 08 Content Match application CTR estimation for arbitrary (page, ad) pairs Features : Bag-of-words in query, ads; relevance scores from IR Cross-product of words: Occurs in both page and ad Learn to predict click data using such features Prediction function amenable to WAND algorithm Helps with fast retrieval at serve time

Proposed Method A logistic regression method model for CTR Model parameters CTR Main effect for page (how good is the page) Main effect for ad (how good is the ad) Interaction effect (words shared by page and ad)

Proposed Method Mp,w = tfp,w Ma,w = tfa,w Ip,a,w = tfp,w * tfa,w So, IR-based term frequency measures are taken into account

Proposed Method Two sources of complexity Adding in IR scores Word selection for efficient learning

Proposed Method How can IR scores fit into the model? What is the relationship between logit(pij) and cosine score? Quadratic relationship logit(pij) Cosine score

Proposed Method How can IR scores fit into the model? This quadratic relationship can be used in two ways Put in cosine and cosine2 as features Use it as a prior

Proposed Method Word selection Overall, nearly 110k words in corpus Learning parameters for each word would be: Very expensive Require a huge amount of data Suffer from diminishing returns So we want to select ~1k top words which will have the most impact

Proposed Method Word selection Data based: Define an interaction measure for each word Higher values for words which have higher-than-expected CTR when they occur on both page and ad

25% lift in precision at 10% recall Experiments Precision Recall 25% lift in precision at 10% recall

Overview Information Retrieval (IR) Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Regelsen and Fain, 2006 Estimate CTR of terms by “borrowing strength” at multiple resolutions Hierarchical clustering of related terms Clustering advertiser keyword matrix Estimating CTR at finer resolutions by using information at coarser resolutions Weighted average, more weight to finer resolutions Weights selected heuristically, no principled approach

Estimation in the “tail” A more principled approach to “Estimating Rates of Rare Events at Multiple Resolutions” [KDD/2007] Contextual Advertising Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of an ad on a page Most (ad, page) pairs have very few impressions, if any, and even fewer clicks Severe data sparsity

Estimation in the “tail” Use an existing, well-understood hierarchy Categorize ads and webpages to leaves of the hierarchy CTR estimates of siblings are correlated The hierarchy allows us to aggregate data Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer resolutions

System overview Retrospective data [URL, ad, isClicked] Crawl URLs a sample of URLs Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

Sampling of webpages Naïve strategy: sample at random from the set of URLs Sampling errors in impression volume AND click volume Instead, we propose: Crawling all URLs with at least one click, and a sample of the remaining URLs Variability is only in impression volume Sampling bias adjusted through statistical procedure (details in the paper) Only sampling pages, not ads. All ad information is available.

System overview Retrospective data [page, ad, isclicked] Crawl Pages a sample of pages Classify pages and ads Rare event estimation using hierarchy Impute impressions, fix sampling bias

Rare rate modeling Freeman-Tukey transform: yij = F-T(clicks and impressions at ij) ≈ transformed-CTR Variance stabilizing transformation: Var(y) is independent of E[y]  needed in further modeling

Rare rate modeling Sparent(ij) Sij yij yparent(ij) Unobserved “state” Generative Model (Tree-structured Markov Model) variance Wij Wparent(ij) Unobserved “state” Sparent(ij) Sij βparent(ij) covariates βij variance Vij Vparent(ij) yij yparent(ij)

Rare rate modeling Model fitting with a 2-pass Kalman filter: Filtering: Leaf to root Smoothing: Root to leaf Linear in the number of regions

Tree-structured Markov model ISSUE

Scalable Model fitting Multi-resolution Kalman filter

Multi-Resolution Kalman filter: Mathematical overview Br: correlation between two sibling regions at level dr

Experiments 503M impressions 7-level hierarchy of which the top 3 levels were used Zero clicks in 76% regions in level 2 95% regions in level 3 Full dataset DFULL, and a 2/3 sample DSAMPLE

Experiments Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE Some of these regions R>0 get clicks in DFULL A good model should predict higher CTRs for R>0 as against the other regions in R

Experiments We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed independently NS (no smoothing): CTR proportional to 1/Ñ Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

Experiments TS Random LM, NS

Experiments Few impressions  Estimates depend more on siblings Enough impressions  little “borrowing” from siblings

Related Work Multi-resolution modeling Imputation studied in time series modeling and spatial statistics [Openshaw+/79, Cressie/90, Chou+/94] Imputation studied in statistics [Darroch+/1972] Application of such models to estimation of such rare events (rates of ~10-3) is novel

Summary A method to estimate The method has two parts rates of extremely rare events at multiple resolutions under severe sparsity constraints The method has two parts Imputation  incorporates hierarchy, fixes sampling bias Tree-structured generative model  extremely fast parameter fitting

Overview Information Retrieval (IR) Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Collaborative Filtering Similarity based methods Ad-ad similarity matrix Rating (CTR) for query u of ad i Local neighborhood of ad i

Collaborative Filtering Similarity based methods Possible adaptation Challenges: Learning similarity Simultaneously incorporating query and ad similarities Feature-based model Collaborative filtering model

Factor coefficients for ad Factor coefficients for query Matrix Factorization Matrix Factorization Each query (ad) is a linear combination of latent factors Solve for factors, under some regularization and constraints Factor coefficients for ad Factor coefficients for query

Matrix Factorization Matrix Factorization Bi-clustering Predictive Discrete latent factor models, Agarwal and Merugu, KDD 07.

Overview Information Retrieval (IR) Machine Learning using Click Feedback Advantages and Challenges of Click Feedback Feature-based models Description Case Studies Hierarchical Models Matrix Factorization and Collaborative Filtering Challenges and Open Problems Online Learning

Challenges of Feature-based models Learns from clicks but still misses context in many instances as in relevance based approach Introducing features that are too granular makes it hard to learn CTR reliably Does not capture the dynamics of the system Training cost is high Slow prediction functions inadmissible due to latency constraints

Challenges of Feature-based models Other methods Boosting, Neural nets, Decision Trees, Random Forests, …… Local models Mixture of experts: Fit local, think global Hierarchical modeling with multiple trees User interest, query, ad,.. Each tree is different How to perform smoothing with multiple disparate trees?

Challenges of Feature-based models Combining cold start with warm start together main challenge in collaborative filtering based methods We believe, solving basic issues more challenging Positional bias Selection bias Correlation in ads on a slate Dynamic CTR; seasonal variations

Online learning

Overview Information Retrieval (IR) Machine Learning using Click Feedback Online Learning

Online learning for ad matching All previous approaches learn from historical data This has several drawbacks: Slow response to emerging patterns in the data due to special events like elections, … Initial systemic biases are never corrected If the system has never shown “sound system dock” ads for the “iPod” query, it can never learn if this match is good System needs to be retrained periodically

Online learning for ad matching Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

Background: Bandits Bandit “arms” p1 p2 p3 (unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure)

Background: Bandits Bandit “arms” p1 p2 p3 Goal: Pull arms sequentially so as to maximize the total expected reward Estimate payoff probabilities pi Bias the estimation process towards better arms Bandit “arms” p1 p2 p3 (unknown payoff probabilities)

Background: Bandits An algorithm to sequentially pick the arms is called a bandit policy Regret of a policy = how much extra payoff could be gained in expectation if the best arm is always pulled Of course, the best arm is not known to the policy Hence, the regret is the price of exploration Low regret implies that the policy quickly converges to the best arm What is the optimal policy?

Background: Bandits argmax g(s1, f1, s2, f2, …, sk, fk) ? Which arm should be pulled next? Not necessarily what looks best right now, since it might have had a few lucky successes Seems to depend on some complicated function of the successes and failures of all arms Number of successes argmax g(s1, f1, s2, f2, …, sk, fk) ? Number of failures

argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)} Background: Bandits What is the optimal policy? Consider a bandit which has an infinite time horizon, but future rewards are geometrically discounted Rtotal = R(1) + γ.R(2) + γ2.R(3) + … (0<γ<1) Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently argmax g(s1, f1, s2, f2, …, sk, fk) ? argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}

Background: Bandits What is the optimal policy? Theorem [Gittins/1979]: The optimal policy decouples and solves a bandit problem for each arm independently Significantly reduces the dimension of the problem space But, the optimal functions gi(si, fi) are hard to compute Need approximate methods…

Background: Bandits Priority 1 Priority 2 Priority 3 Bandit Policy Assign priority to each arm “Pull” arm with max priority, and observe reward Update priorities Allocation Estimation

Background: Bandits Number of failures Total number of observations One common policy is UCB1 [Auer/2002] Number of failures Total number of observations Number of successes Number of observations of arm i Observed payoff Factor representing uncertainty

Factor representing uncertainty Background: Bandits As total observations T becomes large: Observed payoff tends asymptotically towards the true payoff probability The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty

Factor representing uncertainty Background: Bandits Sub-optimal arms are pulled O(log T) times Hence, UCB1 has O(log T) regret This is the lowest possible regret Observed payoff Factor representing uncertainty

Online learning for ad matching Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

Background: Bandits ~109 pages ~106 ads Bandit “arms” = ads Webpage 1

Background: Bandits Ads Webpages One bandit Unknown CTR Content Match = A matrix Each row is a bandit Each cell has an unknown CTR

Background: Bandits Why not simply apply a bandit policy directly to our problem? Convergence is too slow ~109 bandits, with ~106 arms per bandit Additional structure is available, that can help  Taxonomies

Taxonomies for dimensionality reduction Root Already exist Actively maintained Existing classifiers to map pages and ads to taxonomy nodes Page/Ad Apparel Computers Travel A bandit policy that uses this structure can be faster

Outline Multi-level Bandit Policy for Content Match Experiments Summary

Consider only two levels Multi-level Policy Ads classes Webpages classes …… … … …… Consider only two levels

Consider only two levels Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters … … …… One bandit Travel Consider only two levels

Key idea: CTRs in a block are homogeneous Multi-level Policy Apparel Compu-ters Travel Ad parent classes Ad child classes Apparel …… Block Compu-ters … … …… One bandit Travel Key idea: CTRs in a block are homogeneous

Multi-level Policy CTRs in a block are homogeneous Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

Multi-level Policy CTRs in a block are homogeneous Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

Multi-level Policy (Allocation) ? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class “We still haven’t learnt that geeks and high fashion don’t mix.”

Multi-level Policy (Allocation) ad ? A C T Page classifier A C T Classify webpage  page class, parent page class Run bandit on ad parent classes  pick one ad parent class Run bandit among cells  pick one ad class In general, continue from root to leaf  final ad

Multi-level Policy (Allocation) ad A C T Page classifier A C T Bandits at higher levels use aggregated information have fewer bandit arms Quickly figure out the best ad parent class

Multi-level Policy CTRs in a block are homogeneous Used in allocation (picking ad for each new page) Used in estimation (updating priorities after each observation)

Multi-level Policy (Estimation) CTRs in a block are homogeneous Observations from one cell also give information about others in the block How can we model this dependence?

Multi-level Policy (Estimation) Shrinkage Model # impressions in cell # clicks in cell Scell | CTRcell ~ Bin (Ncell, CTRcell) CTRcell ~ Beta (Paramsblock) All cells in a block come from the same distribution

Multi-level Policy (Estimation) Intuitively, this leads to shrinkage of cell CTRs towards block CTRs E[CTR] = α.Priorblock + (1-α).Scell/Ncell Estimated CTR Beta prior (“block CTR”) Observed CTR

Experiments Root 20 nodes We use these 2 levels 221 nodes … Depth 0 Depth 1 20 nodes We use these 2 levels Depth 2 221 nodes … Depth 7 ~7000 leaves Taxonomy structure

Experiments Data collected over a 1 day period Collected from only one server, under some other ad-matching rules (not our bandit) ~229M impressions CTR values have been linearly transformed for purposes of confidentiality

Experiments (Multi-level Policy) Clicks Number of pulls Multi-level gives much higher #clicks

Experiments (Multi-level Policy) Mean-Squared Error Number of pulls Multi-level gives much better Mean-Squared Error  it has learnt more from its explorations

Experiments (Shrinkage) without shrinkage Clicks Mean-Squared Error with shrinkage Number of pulls Number of pulls Shrinkage  improved Mean-Squared Error, but no gain in #clicks

Summary Taxonomies exist for many datasets They can be used for Dimensionality Reduction Multi-level bandit policy  higher #clicks Better estimation via shrinkage models  better MSE

Online learning for ad matching Solution: Combining exploitation with exploration Exploitation: Pick ads that are good according to current model Exploration: Pick ads that increase our knowledge about the entire space of ads Multi-armed bandits Background Applications to online advertising Challenges and Open Problems

Challenges and Open Problems Bandit policies typically assume stationarity But, sudden changes are the norm in the online advertising world: Ads may be suddenly removed when they run out of budget New ads are constantly added to the system The total number of ads is huge, and full exploration may be too costly Mortal multi-armed bandits [NIPS/2008] Algorithms for infinitely multi-armed bandits [NIPS/2008]

Mortal Multi-armed Bandits Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 1 (Sampling): Pick a random sample from the set of available arms Run UCB1 on sample, until some fraction of arms in the sample are lost Pro: Quicker convergence, more exploitation Con: Best arm in the sample may be worse than best arm overall Pick sample size to control this tradeoff

Mortal Multi-armed Bandits Traditional bandit policies like UCB1 spend a large fraction of their initial pulls on exploration Hard-earned knowledge may be lost due to finite arm lifetimes Method 2 (Payoff threshold): New bandit policy: If the observed payoff of any arm is higher than a threshold, pull it till it expires Pro: Good arms, once found, are exploited quickly Con: While exploiting good arms, the best arm may be starving and may expire without being found Pick threshold to control this tradeoff

Mortal Multi-armed Bandits Challenges: Selecting the critical sample size or threshold correctly, for arbitrary payoff distributions What if even the payoff distribution is unknown?

Challenges and Open Problems Mortal multi-armed bandits What if the bandit policy has some information about the budget? The bandit policy can control which arms expire, and when “Handling Advertisements of Unknown Quality in Search Advertising” by Pandey+/NIPS/2006 Combining budgets with extra knowledge of ad CTRs E.g., Using an ad taxonomy Using a bandit scheme to infer/correct an ad taxonomy

Conclusions

Conclusions We provided an introduction to Online Advertising Discussed the eco-system and various actors involved Discussed different flavors of online advertising Sponsored Search, Content Match, Display Advertising

Online Advertising Conclusions Revenue Models Misc. Ad exchanges Advertising Setting CPM CPC CPA Display Content Match Sponsored Search

Conclusions Online Models Offline Modeling Explore/Exploit Outlined associated statistical challenges Sponsored search, Content Match, Display We believe the following to be a technical roadmap Offline Modeling Online Models Time series Explore/Exploit Multi-armed bandits Regression, collaborative filtering, mixture of experts Multi-resolution models Selection bias Slate correlation Noisy labels

Conclusions Offline Modeling Explore/Exploit By far the best studied so far Not a careful study of selection bias, slate correlations, noisy labels. Good opportunity here More emphasis on matrix structure, goal is to estimate interactions Explore/Exploit Some work using multi-armed bandits; long way to go Time series model to capture temporal aspects Little work Holistic approach that combines all components in a principled way