What about streaming data?
The Stream Model Data enters at a rapid rate from one or more input ports Such data are called stream tuples The system cannot store the entire (infinite) stream Distribution changes over time How do you make critical calculations (in real time) about the stream using a limited amount of (secondary) memory?
General Stream Processing Model Streams Entering Ad-hoc Queries . . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Processor Standing Queries Output Archival Storage Limited Working Storage
Applications In general, stream processing is important for applications where New data arrives frequently Important queries tend to ask about the most recent data, or summaries of data Mining query streams Google wants to know what queries are more frequent today than yesterday Mining click streams Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour Mining social network news feeds Look for trending topics on Twitter, Facebook
Sliding Windows A useful model of stream processing is that queries are about a window of length N the N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored in memory, or even on disk Or, there are so many streams that windows for all cannot be stored
Sliding Windows: Single Stream q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future Window size = 5 Step = 2 Window size = 5 Step = 1
Example: Averages Stream of integers Window size = N Standing query: What is the average of the integers in the window? For the first N inputs average = sum/count Afterward, when a new input i arrives Update average: average + (i – j)/N where j is the oldest value in the window
Let’s start simple: Counting bits Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow - need to scan k-bits to count What if we cannot afford to store N bits? E.g., we are processing 1 billion streams and N = 1 billion Estimate with an approximate answer N = 6
Example: Exponentially Increasing Blocks
How Good Is This? Store O(log22N) bits per stream (<< N bits) O(log N) counts of log N bits Gives approximate answer How accurate? If 1’s are evenly distributed, never off by more than 50% Otherwise error can be unbounded! E.g, all the 1’s are in the unknown block!
DGIM* Method Store O(log22N) bits per stream (<< N bits) O(log N) counts of log N bits Gives approximate answer Never off by more than 50% Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits Trick: Block size is based on count of 1s instead of fix-sized blocks *Datar, Gionis, Indyk, and Motwani (Stanford)
DGIM: Timestamps Each bit in the stream has a timestamp, starting 1, 2, … Record timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log2N) bits If N = 1M, then we need 20 bits
DGIM: Buckets A bucket is a record consisting of The timestamp of its right (most recent) end O(log2 N) bits The number of 1’s between its beginning and end Number of 1’s must be a power of 2 O(log2 log2 N) bits
Representing a Stream by Buckets Either one or two buckets with the same power-of-2 number of 1’s Buckets do not overlap in timestamps Buckets are sorted by size (# of 1’s). Earlier buckets are not smaller than later buckets Buckets disappear when their (right) end-time is > N time units in the past O(log2 N) buckets
Updating Buckets When a new bit comes in, drop the last (oldest) bucket if its (right) end-time is prior to N time units before the current time If the current bit is 0, no other changes are needed
Updating Buckets If the current bit is 1: Create a new bucket of size 1, for just this bit End timestamp = current time If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 And so on…
Example
Querying To estimate the number of 1’s in the most recent N bits: Sum the size of all buckets but the last “size” means the number of 1’s in the bucket Add half the size of the last bucket NOTE: We do not know how many 1’s of the last bucket are still within the wanted window
Example: Bucketized stream How many 1’s? 1 + 1 + 2 + 4 + 4 + 8 + 8 + ½*16 = 36
How good is the scheme? Suppose the last bucket has size 2r Assuming 2r-1 (i.e., half) of its 1’s are still within the window, we make an error of at most 2r-1 Since there is at least one bucket of each of the sizes less than 2r , the true sum is at least 1+2+4+…+2r-1 = 2r -1 actually 2r as the first bit of the last bucket is always 1 Thus, error is at most 50%
Extensions How many 1’s in the last k where k < N? Find earliest bucket B that overlaps with k Number of 1’s is the sum of sizes of more recent buckets + ½ size of B How about counting integers (sum of the last k elements)?
Counting positive integers Stream of positive integers We want the sum of the last k elements Amazon: Avg price of last k sales Solution If you know all integers have at most m bits (i.e. integer value upto 2m) Treat each of the m bits of each integer as a separate stream Use DGIM to count 1s in each bit stream The sum is =
Filtering data streams Consider a web crawler It keeps, centrally, a list of all the URLs it has found so far It assigns these URLs to any of a number of parallel tasks These tasks stream back the URLs they find in the links they discover on a page It needs to filter out those URLs it has seen before
Filtering data streams Recall: Each element of data stream is a tuple Given a list of keys S, determine which tuples of stream are in S Naïve solution: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream
Naïve solution Given a set of keys S that we want to filter Create a bit array B of n bits Initialize to all 0s Choose a hash function h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)] = 1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1
Naïve solution S = {A1, A2, A3} H(A1) = 7 H(A2) = 0 H(A3) = 2 1 Bit array H(a1) = 1. Since bit at position 1 is 0, drop a1. H(a2) = 7. Since bit at position 7 is 1, output a2.
Naïve solution False positives? Will invalid item passed through the filter? If the item is NOT in S, it may still be output Why? Collision False negatives? Will valid items be filtered out? If item is in S, we surely output it
Analysis on False Positives: Throwing darts If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart? In our case: Targets = bits/buckets Darts = hash values of items Equals 1/e as n → m = n(m/n) 1 – ( 1 – 1/n )n (m/n) = 1 – e –m/n Prob some target X not hit by a dart Prob at least one dart hits target X
Analysis: Throwing darts Fraction of 1’s in the array B = probability of false positives = 1-e-m/n Example: 109 darts, 8∙109 targets Fraction of 1’s in B = 1-e-1/8 = 0.1175 Lower than 1/8 = 0.125 (why?) Why? collision
Bloom filter Consider: |S| = m; |B| = n Use k independent hash functions h1, …, hk Initialization Set B to all 0’s Hash each element s S k times using each hash function hi, set B[hi(s)] = 1 (for each i = 1, .., k) Run-time When a stream element with key x arrives If B[hi(x)] == 1 for all i = 1, .., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash function hi(x) Otherwise, discard the element x
Bloom filter S = {A1, A2, A3} H1(A1) = 7 H2(A1) = 2 H1(A2) = 0 H = {H1, H2} 1 1 1 1 1 1 Bit array H1(a1) = 1 H2(a1) = 3 Since bit at position 3 is 0, drop a1 H1(a2) = 7 H2(a2) = 5 Since bits at positions 5 and 7 are both 1, output a2 (potentially a false positive)
Bloom filter: Analysis What fraction of the bit vector B are 1’s? Throwing k∙m darts at n targets So fraction of 1’s is (1 – e –km/n) But we have k independent hash functions and we only let the element x through if all k hash element x to a bucket of value 1 So false positive probability = (1 – e –km/n)k
Bloom Filter Bloom filters guarantee no false negatives and use limited memory Great for pre-processing before more expensive checks Suitable for hardware implementation Hash function computations can be parallelized Is it better to have 1 big B or k small B’s? It is the same: is (1 – e –km/n)k vs (1 – e –m/(n/k))k But keeping 1 big B is simpler
So far, … We have looked at some “operations” on data streams when memory is limited Counting bits Counting integers Filtering streams
Online (Web) advertising: Streaming queries, dynamic answers
Online advertisements Multi-billion-dollar industry! It was reported in 2012 that Google alone made 32.2 billion from advertising.
Banner Ads
Pay-Per-Impression Pay-per-1000 impressions (PPM): advertiser pays each time ad is displayed Models existing standards from magazine, radio, television Main business model for banner ads to date Untargeted to demographically targeted Low clickthrough rates Low ROI for advertisers Banner “blindness”: effectiveness drops with user experience Barrier to entry for small advertisers Contracts negotiated on a case-by-case basis with large minimums (typically, a few thousand dollars per month)
Pay-Per-Click Pay-per-click (PPC): advertiser pays only when user clicks on ad Common in search advertising Middle ground between PPM and PPA Does not require host to trust advertiser Provides incentives for host to improve ad displays
Sponsored (Performance-based) advertising Advertisements sold automatically through auctions: Advertisers “bids” (indicating value for clicks) on search keywords Low barrier-to-entry Increased transparency of mechanism Keyword bidding allowed increased targeting opportunities When someone searches for that keyword, a mechanism selects the “best” ad to display Advertiser is charged only if the ad is clicked on Example: Google’s “Adwords”
Ads vs. search results
Algorithmic Challenges Interesting problems What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid? Not our focus
Scenario A stream of queries arrives at the search engine q1, q2, q3, … Several advertisers bid on each query When query qi arrives, search engine must pick a subset of advertisers whose ads are shown Goal: maximize search engine’s revenues Clearly we need an online algorithm!
How to maximize revenue? Greedy solution: Initial GoTo model (First-price auction) Advertise based on highest bidders Upon a click, advertiser is charged a price equal to his bid Used first by Overture/Yahoo! Complications Each ad has a different likelihood of being clicked Advertiser 1 bids $2, click probability = 0.1 Advertiser 2 bids $1, click probability = 0.5 Clickthrough rate measured historically
How to maximize revenue? Google model: Second-price auction Advertisers ranked according to bid and click-through rate (CTR) Use the “expected revenue per click” Upon a click, advertiser is charged minimum amount required to maintain position in ranking
The Adwords Innovation Advertiser Bid CTR Bid * CTR A $1.00 1% 1 cent B $0.75 2% 1.5 cents C $0.50 2.5% 1.125 cents Click through rate Expected revenue
The Adwords Innovation Advertiser Bid CTR Bid * CTR B $0.75 2% 1.5 cents C $0.50 2.5% 1.125 cents A $1.00 1% 1 cent Instead of sorting advertisers by bid, sort by expected revenue!
More complications Each advertiser has a limited budget Search engine guarantees that the advertiser will not be charged more than their daily budget Solution: Ensure budget is not “exhausted” too quickly unnecessarily
Simplified model (for now) All bids are 0 or 1 Each advertiser has the same budget B One ad per query All ads are equally likely to be clicked Let’s try the greedy algorithm Arbitrarily pick an eligible advertiser (bid 1) for each keyword
Bad scenario for greedy Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y Worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B Competitive ratio minall possible inputs Revenuegreedy/Revenueopt = ½
Adwords Problem Given A set of bids by advertisers for search queries A click-through rate for each advertise-query pair A budget for each advertiser (say for 1 day, 1 month, …) A limit on the number of ads to be displayed with each search query, N Respond to each search query with a set of advertisers such that The size of the set is no larger than N Each advertiser has bid on the search query Each advertiser has enough budget left to pay for the ad if it is clicked
BALANCE algorithm [Mehta, Saberi, Vazirani, and Vazirani] For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily but deterministically Mehta, A. Saberi, U. Vazirani, V. Vazirani, Adwords and Generalized On-line Matching , Journal of the ACM (2007). Conference version appeared in IEEE Symposium on Foundations of Computer Science (2005).
Example: BALANCE Two advertisers A and B Query stream: x x x x y y y y A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y BALANCE choice: A B A B B B _ _ Optimal: A A A A B B B B Competitive ratio = ¾ For BALANCE with 2 advertisers
Analyzing BALANCE Consider simple case: two advertisers, A1 and A2, each with budget B (assume B ≥ 1) Assume optimal solution exhausts both advertisers’ budgets BALANCE must exhaust at least one advertiser’s budget If not, we can allocate more queries Assume BALANCE exhausts A2’s budget
Analyzing Balance A1 A2 B Queries allocated to A1 in optimal solution Opt revenue = 2B BALANCE revenue = B+y Goal: Show that y ≥ B/2 x y B A1 A2 BAL: B+B/2 = 3B/2 BAL/OPT = 3/4 Competitive Ratio = 3/4
Analyzing Balance A1 A2 B Queries allocated to A1 in optimal solution Case 1: BALANCE assigns at least B/2 blue queries to A1. So, y ≥ B/2 Case 2: BALANCE assigns more than B/2 blue queries to A2 At that time, A2‘s unspent budget must have been at least as big as A1‘s That means at least as many queries have been assigned to A1 as to A2 At this point, we have already assigned at least B/2 queries to A2 . S, y ≥ B/2 x y B A1 A2
General Result: > 2 advertisers Worst competitive ratio of BALANCE is 1–1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio!
General version of problem So far, all bids = 1, all budgets equal (=B) Arbitrary bids, budgets Consider query q, advertiser i Bid = xi Budget = bi BALANCE can be terrible Consider two advertisers A1 and A2 A1: x1 = 1, b1 = 110 A2: x2 = 10, b2 = 100 Suppose we see 10 instances of q BALANCE always selects A1 and earns 10 Optimal earns 100! Competitive ratio: 0.1!! Generalized BALANCE scheme with competitive ratio of 1 – 1/e Modify BALANCE to be based on fraction of unspent budget, and some others ….
Actual problem is even more sophisticated … Upon each search, Interested advertisers are selected from database using keyword matching algorithm Budget allocation algorithm retains interested advertisers with sufficient budget Advertisers compete for ad slots in allocation mechanism Upon click, advertiser charged with pricing scheme CTR updated according to CTR learning algorithm for future auctions
Conclusion Data streaming applications are becoming increasingly important Online algorithms are needed We have looked at data stream processing in general, and web advertising in particular
Acknowledgements The slides is adapted from numerous sources (not all listed here): Jure and Ullman (Stanford) Bergemann (Yale)