CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Slides:



Advertisements
Similar presentations
6.896: Topics in Algorithmic Game Theory Lecture 21 Yang Cai.
Advertisements

Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Great Theoretical Ideas in Computer Science
Great Theoretical Ideas in Computer Science for Some.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Ad Auctions: An Algorithmic Perspective Amin Saberi Stanford University Joint work with A. Mehta, U.Vazirani, and V. Vazirani.
Search Advertising These slides are modified from those by Anand Rajaram & Jeff UllmanAnand Rajaram & Jeff Ullman.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Combinatorial Algorithms
The Theory of NP-Completeness
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Sponsored Search Presenter: Lory Al Moakar. Outline Motivation Problem Definition VCG solution GSP(Generalized Second Price) GSP vs. VCG Is GSP incentive.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Computational problems, algorithms, runtime, hardness
1 Spanning Trees Lecture 20 CS2110 – Spring
CS 345 Data Mining Online algorithms Search advertising.
Item Pricing for Revenue Maximization in Combinatorial Auctions Maria-Florina Balcan, Carnegie Mellon University Joint with Avrim Blum and Yishay Mansour.
1 Ecole Polytechnque, Nov 7, 2007 Scheduling Unit Jobs to Maximize Throughput Jobs:  all have processing time (length) = 1  release time r j  deadline.
Approximation Algorithms
Increasing graph connectivity from 1 to 2 Guy Kortsarz Joint work with Even and Nutov.
CS 345 Data Mining Online algorithms Search advertising.
Online Matching for Internet Auctions DIMACS REU Presentation 20 June 2006 Slide 1 Online Matching for Internet Auctions Ben Sowell Advisor: S. Muthukrishnan.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
A General Approach to Online Network Optimization Problems Seffi Naor Computer Science Dept. Technion Haifa, Israel Joint work: Noga Alon, Yossi Azar,
Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University
10.4 How to Find a Perfect Matching We have a condition for the existence of a perfect matching in a graph that is necessary and sufficient. Does this.
APPROXIMATION ALGORITHMS VERTEX COVER – MAX CUT PROBLEMS
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.
Prims’ spanning tree algorithm Given: connected graph (V, E) (sets of vertices and edges) V1= {an arbitrary node of V}; E1= {}; //inv: (V1, E1) is a tree,
Dynamic Covering for Recommendation Systems Ioannis Antonellis Anish Das Sarma Shaddin Dughmi.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Frequency Capping in Online Advertising Moran Feldman Technion Joint work with: Niv Buchbinder,The Open University of Israel Arpita Ghosh,Yahoo! Research.
1 CS612 Algorithms for Electronic Design Automation CS 612 – Lecture 8 Lecture 8 Network Flow Based Modeling Mustafa Ozdal Computer Engineering Department,
© The McGraw-Hill Companies, Inc., Chapter 12 On-Line Algorithms.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
CS 3343: Analysis of Algorithms Lecture 19: Introduction to Greedy Algorithms.
11 -1 Chapter 12 On-Line Algorithms On-Line Algorithms On-line algorithms are used to solve on-line problems. The disk scheduling problem The requests.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Non-LP-Based Approximation Algorithms Fabrizio Grandoni IDSIA
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
Approximation Algorithms based on linear programming.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
TU/e Algorithms (2IL15) – Lecture 8 1 MAXIMUM FLOW (part II)
TU/e Algorithms (2IL15) – Lecture 11 1 Approximation Algorithms.
The NP class. NP-completeness
What about streaming data?
Computational Advertising
AdWords and Generalized On-line Matching
Maximum Matching in the Online Batch-Arrival Model
Great Theoretical Ideas in Computer Science
Cloud Computing Group 7.
Discrete Mathematics for Computer Science
Instructor: Shengyu Zhang
Online algorithms Search advertising
Coverage Approximation Algorithms
Advertising on the web.
Algorithms (2IL15) – Lecture 7
Computational Advertising and
Presentation transcript:

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at:

 Classic model of algorithms  You get to see the entire input, then compute some function of it  In this context, “offline algorithm”  Online Algorithms  You get to see the input one piece at a time, and need to make irrevocable decisions along the way 2J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

4 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Bipartite Graphs  Bipartite graph:  Two sets of nodes: A and B  There are no edges between nodes that belong to the same set.  Edges are only between nodes in different sets a b c d AB

5 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Bipartite Matching  Maximum Bipartite Matching: Choose a subset of edges E M such that: 1. Each vertex is connected to at most one edge in E M 2. The size of E M is as large as possible  Example: Matching projects to groups a b c d Projects Groups M = {(1,a),(2,b),(3,d)} is a matching Cardinality of matching = |M| = 3

6 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Bipartite Matching  Maximum Bipartite Matching: Choose a subset of edges E M such that: 1. Each vertex is connected to at most one edge in E M 2. The size of E M is as large as possible  Example: Matching projects to groups a b c d Projects Groups M = {(1,c),(2,b),(3,d),(4,a)} is a maximum matching Cardinality of matching = |M| = 4

M = {(1,c),(2,b),(3,d),(4,a)} is a perfect matching 7J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Perfect matching … all vertices of the graph are matched Maximum matching … a matching that contains the largest possible number of matches a b c d Projects Groups

 Problem: Find a maximum matching for a given bipartite graph  A perfect one if it exists  There is a polynomial-time offline algorithm based on augmenting paths (Hopcroft & Karp 1973, see )  But what if we do not know the entire graph upfront? 8J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

9 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Online Bipartite Matching Problem  Initially, we are given the set of projects  The TA receives an indicating the preferences of one group.  The TA must decide at that point to either: assign a prefered project to this group, or not assign any projects to this group  Objective is to maximize the number of preferred assignments Note: This is not how your projects were assigned

10 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Greedy Online Bipartite Matching  Greedy algorithm For each group g Let P g be the set of projects group g prefers if there is a p ∈ P g that is not already assigned to another group assign project p to group g else do not assign any project to g

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, a b c d (1,a) (2,b) (3,d)

 For input I, suppose greedy produces matching M greedy while an optimal matching is M opt Competitive ratio = min all possible inputs I (|M greedy |/|M opt |) (what is greedy’s worst performance over all possible inputs I) 12J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

13 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of the Greedy Algorithm Step 1: Find a lower bound for the competitive ratio A L B Definitions: M o : The optimal matching M g : The greedy matching L: The set of vertices from A that are in M o, but not in M g R: The set of vertices from B that are connected to at least one vertex in L R

14 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of the Greedy Algorithm (cont’d)  Claim: All vertices in R must be in M g Proof: By contradiction, assume there is a vertex v ∈ R that is not in M g. There must be another vertex u ∈ L that is connected to v. By definition u is not in M g either. When the greedy algorithm processed edge (u, v), both vertices u and v were available, but it matched none of them. This is a contradiction!  Fact: |M o | ≤ |M g | + |L| Adding the missing elements to Mg will make its size to be at least the size of the optimal matching.  Fact: |L| ≤ |R| Each vertex in L was matched to another vertex in M o

15 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of the Greedy Algorithm (cont’d)  Fact: |R| ≤ |M g | All vertices in R are in M g  Summary: |M o | ≤ |M g | + |L| |L| ≤ |R| |R| ≤ |M g |  Combine: |M o | ≤ |M g | + |L| ≤ |M g | + |R| ≤ 2 |M g | Lower-bound for competitive ratio:

16 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of the Greedy Algorithm (cont’d)  We have shown that the competitive ratio is at least 1/2. However, can it be better than 1/2?  Step 2: Find an upper bound for competitive ratio: Typical approach: Find an example. If there is at least one example that has competitive ratio of r, it must mean that competitive ratio cannot be greater than r a b c (1,a),(2,b) d Competitive ratio = ½ for this example So, competitive ratio <= ½ The optimal matching is: (4, a), (3,b), (1,c), (2, d) Greedy matching:

17 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Greedy Matching Algorithm  We have shown that competitive ratio for the greedy algorithm is 1/2.  We proved that both lower bound and upper bound is 1/2  Conclusion: The online greedy algorithm can result in a matching solution that has half the size of an optimal offline algorithm in the worst case.

 Banner ads ( )  Initial form of web advertising  Popular websites charged X$ for every 1,000 “impressions” of the ad  Called “CPM” rate (Cost per thousand impressions)  Modeled similar to TV, magazine ads  From untargeted to demographically targeted  Low click-through rates  Low ROI for advertisers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, CPM…cost per mille Mille…thousand in Latin

 Introduced by Overture around 2000  Advertisers bid on search keywords  When someone searches for that keyword, the highest bidder’s ad is shown  Advertiser is charged only if the ad is clicked on  Similar model adopted by Google with some changes around 2002  Called Adwords 20J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 Performance-based advertising works!  Multi-billion-dollar industry  Interesting problem: What ads to show for a given query?  (This lecture)  If I am an advertiser, which search terms should I bid on and how much should I bid?  (Not focus of this lecture) 22J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 Given:  1. A set of bids by advertisers for search queries  2. A click-through rate for each advertiser-query pair  3. A budget for each advertiser (say for 1 month)  4. A limit on the number of ads to be displayed with each search query  Respond to each search query with a set of advertisers such that:  1. The size of the set is no larger than the limit on the number of ads per query  2. Each advertiser has bid on the search query  3. Each advertiser has enough budget left to pay for the ad if it is clicked upon J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 A stream of queries arrives at the search engine: q 1, q 2, …  Several advertisers bid on each query  When query q i arrives, search engine must pick a subset of advertisers whose ads are shown  Goal: Maximize search engine’s revenues  Simplification: Instead of raw bids, use the “expected revenue per click” (i.e., Bid*CTR)  Clearly we need an online algorithm! 24J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

25 AdvertiserBidCTRBid * CTR A B C $1.00 $0.75 $0.50 1% 2% 2.5% 1 cent 1.5 cents cents Click through rate Expected revenue

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, AdvertiserBidCTRBid * CTR A B C $1.00 $0.75 $0.50 1% 2% 2.5% 1 cent 1.5 cents cents

 Two complications:  Budget  CTR of an ad is unknown  Each advertiser has a limited budget  Search engine guarantees that the advertiser will not be charged more than their daily budget J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 CTR: Each ad has a different likelihood of being clicked  Advertiser 1 bids $2, click probability = 0.1  Advertiser 2 bids $1, click probability = 0.5  Clickthrough rate (CTR) is measured historically  Very hard problem: Exploration vs. exploitation Exploit: Should we keep showing an ad for which we have good estimates of click-through rate or Explore: Shall we show a brand new ad to get a better sense of its click-through rate J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

29 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Simplified Problem  We will start with the following simple version of Adwords:  One ad shown for each query  All advertisers have the same budget B  All bids are $1  All ads are equally likely to be clicked and CTR = 1  We will generalize it later.

30 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Greedy Algorithm  Simple greedy algorithm: For the current query q, pick any advertiser who: 1. has bid 1 on q 2. has remaining budget  What is the competitive ratio of this greedy algorithm?  Can we model this problem as bipartite matching?

31 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Bipartite Matching Model B nodes for each advertiser bidsqueries Online algorithm: For each new query q assign a bid if available Equivalent to the online greedy bipartitite matching algorithm, which had competitive ratio = 1/2. So, the competitive ratio of this algorithm is also ½.

 Two advertisers A and B  A bids on query x, B bids on x and y  Both have budgets of $4  Query stream: x x x x y y y y  Worst case greedy choice: B B B B _ _ _ _  Optimal: A A A A B B B B  Competitive ratio = ½  This is the worst case!  Note: Greedy algorithm is deterministic – it always resolves draws in the same way 32J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 BALANCE Algorithm by Mehta, Saberi, Vazirani, and Vazirani  For each query, pick the advertiser with the largest unspent budget  Break ties arbitrarily (but in a deterministic way) 33J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 Two advertisers A and B  A bids on query x, B bids on x and y  Both have budgets of $4  Query stream: x x x x y y y y  BALANCE choice: A B A B B B _ _  Optimal: A A A A B B B B  Competitive ratio ≤ ¾ 34J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

35 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analyzing BALANCE: Simple Case  Try to prove a lower bound for the competitive ratio  i.e. Consider the worst-case behavior of BALANCE algorithm  Start with the simple case:  2 advertisers A 1 and A 2 with equal budgets B  Optimal solution exhausts both budgets  All queries assigned to at least one advertiser in the optimal solution Remove the queries that are not assigned by the optimal algorithm This only makes things worse for BALANCE A1A1 A2A2 B Queries allocated to A 1 in the optimal solution Queries allocated to A 2 in the optimal solution

36 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case  Claim: BALANCE must exhaust the budget of at least one advertiser  Proof by contradiction: Assume both advertisers have left over budgets Consider query q that is assigned in the optimal solution, but not in BALANCE. Contradiction: q should have been assigned to at least the same advertiser because both advertisers have available budget.

37 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case x y B A1A1 A2A2 x Optimal solution A1A1 A2A2 B BALANCE solution  Without loss of generality, assume the whole budget of A 2 is exhausted.  Claim: All blue queries (the ones assigned to A 1 in the optimal solution) must be assigned to A 1 and/or A 2 in the BALANCE solution.  Proof by contradiction: Assume a blue query q not assigned to either A 1 or A 2. Since budget of A 1 is not exhausted, it should have been assigned to A 1. z

38 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case Optimal solution A1A1 A2A2 B  Some of the green queries (the ones assigned to A 2 in the optimal solution) are not assigned to either A 1 or A 2. Let x be the # of such queries.  Prove an upper bound for x  Worst case for the BALANCE algorithm. x y B A1A1 A2A2 x BALANCE solution z

39 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case Optimal solution A1A1 A2A2 B  Consider two cases for z:  Case 1: z ≥ B/2 size (A 1 ) = y + z ≥ B/2 size (A 1 + A 2 ) = B + y + z ≥ 3B/2 x y B A1A1 A2A2 x BALANCE solution z

40 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case Optimal solution A1A1 A2A2 B  Case 2: z < B/2  Consider the time when last blue query was assigned to A 2 : x y B A1A1 A2A2 x BALANCE solution z A1A1 A2A2 ≥ B/2 A 2 has remaining budget of ≤ B/2 For A 2 to be chosen, A 1 must also have remaining budget of ≤ B/2

41 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case Optimal solution A1A1 A2A2 B  Case 2: z < B/2 size (A 1 ) ≥ B/2 size (A 1 + A 2 ) = B + size(A 1 ) ≥ 3B/2 x y B A1A1 A2A2 x BALANCE solution z

42 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Analysis of BALANCE: Simple Case

 For multiple advertisers, worst competitive ratio of BALANCE is 1–1/e = approx  Interestingly, no online algorithm has a better competitive ratio!  See textbook for the worst-case analysis. 43J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 Arbitrary bids and arbitrary budgets!  In a general setting BALANCE can be terrible  Consider two advertisers A 1 and A 2  A 1 : x 1 = 1, b 1 = 110  A 2 : x 2 = 10, b 2 = 100  Assume we see 10 instances of q  BALANCE always selects A 1 and earns 10  Optimal earns J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

 Arbitrary bids: consider query q, bidder i  Bid = x i  Budget = b i  Amount spent so far = m i  Fraction of budget left over f i = 1-m i /b i  Define  i (q) = x i (1-e -f i )  Allocate query q to bidder i with largest value of  i (q)  Same competitive ratio (1-1/e) 45J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

46 CS 425 – Lecture 7Mustafa Ozdal, Bilkent University Conclusions  Web Advertising: Try to maximize ad revenue from a stream of queries  Online algorithms: Make decisions without seeing the whole input set  Approximation algorithms: Theoretically prove upper and lower bounds w.r.t. the optimal solutions.