What about streaming data?

Slides:



Advertisements
Similar presentations
Mining Data Streams (Part 1)
Advertisements

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Sponsored Search Cory Pender Sherwin Doroudi. Optimal Delivery of Sponsored Search Advertisements Subject to Budget Constraints Zoe Abrams Ofer Mendelevitch.
Search Advertising These slides are modified from those by Anand Rajaram & Jeff UllmanAnand Rajaram & Jeff Ullman.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Indian Statistical Institute Kolkata
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
CS 345 Data Mining Online algorithms Search advertising.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
A survey on stream data mining
CS 345 Data Mining Online algorithms Search advertising.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Online Advertising and Ad Auctions at Google Vahab Mirrokni Google Research, New York.
1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Mining of Massive Datasets Ch4. Mining Data Streams
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Mining Data Streams (Part 1)
Bernd Skiera, Nadia Abou Nabout
Computational Advertising
CSC 427: Data Structures and Algorithm Analysis
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Analysis of Algorithms
Web-Mining Agents Stream Mining
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Online Advertising and Ad Auctions at Google
AdWords and Generalized On-line Matching
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Streaming & sampling.
Subject Name: File Structures
Hashing Alexandra Stefan.
Algorithm Analysis CSE 2011 Winter September 2018.
Chapter 12: Query Processing
Evaluation of Relational Operations
Laddered auction Ashish Goel tanford University
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Database Implementation Issues
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Hashing.
Online algorithms Search advertising
Indexing and Hashing Basic Concepts Ordered Indices
Hashing Alexandra Stefan.
Advanced Algorithms Analysis and Design
Advertising on the web.
Database Design and Programming
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Analysis of Algorithms
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
Evaluation of Relational Operations: Other Techniques
Analysis of Algorithms
Hash Functions for Network Applications (II)
Computational Advertising and
Analysis of Algorithms
Lecture-Hashing.
Counting Bits.
Presentation transcript:

What about streaming data?

The Stream Model Data enters at a rapid rate from one or more input ports Such data are called stream tuples The system cannot store the entire (infinite) stream Distribution changes over time How do you make critical calculations (in real time) about the stream using a limited amount of (secondary) memory?

General Stream Processing Model Streams Entering Ad-hoc Queries . . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Processor Standing Queries Output Archival Storage Limited Working Storage

Applications In general, stream processing is important for applications where New data arrives frequently Important queries tend to ask about the most recent data, or summaries of data Mining query streams Google wants to know what queries are more frequent today than yesterday Mining click streams Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour Mining social network news feeds Look for trending topics on Twitter, Facebook

Sliding Windows A useful model of stream processing is that queries are about a window of length N the N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored in memory, or even on disk Or, there are so many streams that windows for all cannot be stored

Sliding Windows: Single Stream q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future Window size = 5 Step = 2 Window size = 5 Step = 1

Example: Averages Stream of integers Window size = N Standing query: What is the average of the integers in the window? For the first N inputs average = sum/count Afterward, when a new input i arrives Update average: average + (i – j)/N where j is the oldest value in the window

Let’s start simple: Counting bits Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow - need to scan k-bits to count What if we cannot afford to store N bits? E.g., we are processing 1 billion streams and N = 1 billion Estimate with an approximate answer N = 6

Example: Exponentially Increasing Blocks

How Good Is This? Store O(log22N) bits per stream (<< N bits) O(log N) counts of log N bits Gives approximate answer How accurate? If 1’s are evenly distributed, never off by more than 50% Otherwise error can be unbounded! E.g, all the 1’s are in the unknown block!

DGIM* Method Store O(log22N) bits per stream (<< N bits) O(log N) counts of log N bits Gives approximate answer Never off by more than 50% Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits Trick: Block size is based on count of 1s instead of fix-sized blocks *Datar, Gionis, Indyk, and Motwani (Stanford)

DGIM: Timestamps Each bit in the stream has a timestamp, starting 1, 2, … Record timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log2N) bits If N = 1M, then we need 20 bits

DGIM: Buckets A bucket is a record consisting of The timestamp of its right (most recent) end O(log2 N) bits The number of 1’s between its beginning and end Number of 1’s must be a power of 2 O(log2 log2 N) bits

Representing a Stream by Buckets Either one or two buckets with the same power-of-2 number of 1’s Buckets do not overlap in timestamps Buckets are sorted by size (# of 1’s). Earlier buckets are not smaller than later buckets Buckets disappear when their (right) end-time is > N time units in the past O(log2 N) buckets

Updating Buckets When a new bit comes in, drop the last (oldest) bucket if its (right) end-time is prior to N time units before the current time If the current bit is 0, no other changes are needed

Updating Buckets If the current bit is 1: Create a new bucket of size 1, for just this bit End timestamp = current time If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 And so on…

Example

Querying To estimate the number of 1’s in the most recent N bits: Sum the size of all buckets but the last “size” means the number of 1’s in the bucket Add half the size of the last bucket NOTE: We do not know how many 1’s of the last bucket are still within the wanted window

Example: Bucketized stream How many 1’s? 1 + 1 + 2 + 4 + 4 + 8 + 8 + ½*16 = 36

How good is the scheme? Suppose the last bucket has size 2r Assuming 2r-1 (i.e., half) of its 1’s are still within the window, we make an error of at most 2r-1 Since there is at least one bucket of each of the sizes less than 2r , the true sum is at least 1+2+4+…+2r-1 = 2r -1 actually 2r as the first bit of the last bucket is always 1 Thus, error is at most 50%

Extensions How many 1’s in the last k where k < N? Find earliest bucket B that overlaps with k Number of 1’s is the sum of sizes of more recent buckets + ½ size of B How about counting integers (sum of the last k elements)?

Counting positive integers Stream of positive integers We want the sum of the last k elements Amazon: Avg price of last k sales Solution If you know all integers have at most m bits (i.e. integer value upto 2m) Treat each of the m bits of each integer as a separate stream Use DGIM to count 1s in each bit stream The sum is =

Filtering data streams Consider a web crawler It keeps, centrally, a list of all the URLs it has found so far It assigns these URLs to any of a number of parallel tasks These tasks stream back the URLs they find in the links they discover on a page It needs to filter out those URLs it has seen before

Filtering data streams Recall: Each element of data stream is a tuple Given a list of keys S, determine which tuples of stream are in S Naïve solution: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream

Naïve solution Given a set of keys S that we want to filter Create a bit array B of n bits Initialize to all 0s Choose a hash function h with range [0,n) Hash each member of s  S to one of n buckets, and set that bit to 1, i.e., B[h(s)] = 1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1

Naïve solution S = {A1, A2, A3} H(A1) = 7 H(A2) = 0 H(A3) = 2 1 Bit array H(a1) = 1. Since bit at position 1 is 0, drop a1. H(a2) = 7. Since bit at position 7 is 1, output a2.

Naïve solution False positives? Will invalid item passed through the filter? If the item is NOT in S, it may still be output Why? Collision False negatives? Will valid items be filtered out? If item is in S, we surely output it

Analysis on False Positives: Throwing darts If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart? In our case: Targets = bits/buckets Darts = hash values of items Equals 1/e as n →  m = n(m/n) 1 – ( 1 – 1/n )n (m/n) = 1 – e –m/n Prob some target X not hit by a dart Prob at least one dart hits target X

Analysis: Throwing darts Fraction of 1’s in the array B = probability of false positives = 1-e-m/n Example: 109 darts, 8∙109 targets Fraction of 1’s in B = 1-e-1/8 = 0.1175 Lower than 1/8 = 0.125 (why?) Why? collision

Bloom filter Consider: |S| = m; |B| = n Use k independent hash functions h1, …, hk Initialization Set B to all 0’s Hash each element s  S k times using each hash function hi, set B[hi(s)] = 1 (for each i = 1, .., k) Run-time When a stream element with key x arrives If B[hi(x)] == 1 for all i = 1, .., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash function hi(x) Otherwise, discard the element x

Bloom filter S = {A1, A2, A3} H1(A1) = 7 H2(A1) = 2 H1(A2) = 0 H = {H1, H2} 1 1 1 1 1 1 Bit array H1(a1) = 1 H2(a1) = 3 Since bit at position 3 is 0, drop a1 H1(a2) = 7 H2(a2) = 5 Since bits at positions 5 and 7 are both 1, output a2 (potentially a false positive)

Bloom filter: Analysis What fraction of the bit vector B are 1’s? Throwing k∙m darts at n targets So fraction of 1’s is (1 – e –km/n) But we have k independent hash functions and we only let the element x through if all k hash element x to a bucket of value 1 So false positive probability = (1 – e –km/n)k

Bloom Filter Bloom filters guarantee no false negatives and use limited memory Great for pre-processing before more expensive checks Suitable for hardware implementation Hash function computations can be parallelized Is it better to have 1 big B or k small B’s? It is the same: is (1 – e –km/n)k vs (1 – e –m/(n/k))k But keeping 1 big B is simpler

So far, … We have looked at some “operations” on data streams when memory is limited Counting bits Counting integers Filtering streams

Online (Web) advertising: Streaming queries, dynamic answers

Online advertisements Multi-billion-dollar industry! It was reported in 2012 that Google alone made 32.2 billion from advertising.

Banner Ads

Pay-Per-Impression Pay-per-1000 impressions (PPM): advertiser pays each time ad is displayed Models existing standards from magazine, radio, television Main business model for banner ads to date Untargeted to demographically targeted Low clickthrough rates Low ROI for advertisers Banner “blindness”: effectiveness drops with user experience Barrier to entry for small advertisers Contracts negotiated on a case-by-case basis with large minimums (typically, a few thousand dollars per month)

Pay-Per-Click Pay-per-click (PPC): advertiser pays only when user clicks on ad Common in search advertising Middle ground between PPM and PPA Does not require host to trust advertiser Provides incentives for host to improve ad displays

Sponsored (Performance-based) advertising Advertisements sold automatically through auctions: Advertisers “bids” (indicating value for clicks) on search keywords Low barrier-to-entry Increased transparency of mechanism Keyword bidding allowed increased targeting opportunities When someone searches for that keyword, a mechanism selects the “best” ad to display Advertiser is charged only if the ad is clicked on Example: Google’s “Adwords”

Ads vs. search results

Algorithmic Challenges Interesting problems What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid? Not our focus

Scenario A stream of queries arrives at the search engine q1, q2, q3, … Several advertisers bid on each query When query qi arrives, search engine must pick a subset of advertisers whose ads are shown Goal: maximize search engine’s revenues Clearly we need an online algorithm!

How to maximize revenue? Greedy solution: Initial GoTo model (First-price auction) Advertise based on highest bidders Upon a click, advertiser is charged a price equal to his bid Used first by Overture/Yahoo! Complications Each ad has a different likelihood of being clicked Advertiser 1 bids $2, click probability = 0.1 Advertiser 2 bids $1, click probability = 0.5 Clickthrough rate measured historically

How to maximize revenue? Google model: Second-price auction Advertisers ranked according to bid and click-through rate (CTR) Use the “expected revenue per click” Upon a click, advertiser is charged minimum amount required to maintain position in ranking

The Adwords Innovation Advertiser Bid CTR Bid * CTR A $1.00 1% 1 cent B $0.75 2% 1.5 cents C $0.50 2.5% 1.125 cents Click through rate Expected revenue

The Adwords Innovation Advertiser Bid CTR Bid * CTR B $0.75 2% 1.5 cents C $0.50 2.5% 1.125 cents A $1.00 1% 1 cent Instead of sorting advertisers by bid, sort by expected revenue!

More complications Each advertiser has a limited budget Search engine guarantees that the advertiser will not be charged more than their daily budget Solution: Ensure budget is not “exhausted” too quickly unnecessarily

Simplified model (for now) All bids are 0 or 1 Each advertiser has the same budget B One ad per query All ads are equally likely to be clicked Let’s try the greedy algorithm Arbitrarily pick an eligible advertiser (bid 1) for each keyword

Bad scenario for greedy Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y Worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B Competitive ratio minall possible inputs Revenuegreedy/Revenueopt = ½

Adwords Problem Given A set of bids by advertisers for search queries A click-through rate for each advertise-query pair A budget for each advertiser (say for 1 day, 1 month, …) A limit on the number of ads to be displayed with each search query, N Respond to each search query with a set of advertisers such that The size of the set is no larger than N Each advertiser has bid on the search query Each advertiser has enough budget left to pay for the ad if it is clicked

BALANCE algorithm [Mehta, Saberi, Vazirani, and Vazirani] For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily but deterministically Mehta, A. Saberi, U. Vazirani, V. Vazirani, Adwords and Generalized On-line Matching , Journal of the ACM (2007). Conference version appeared in IEEE Symposium on Foundations of Computer Science (2005).

Example: BALANCE Two advertisers A and B Query stream: x x x x y y y y A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y BALANCE choice: A B A B B B _ _ Optimal: A A A A B B B B Competitive ratio = ¾ For BALANCE with 2 advertisers

Analyzing BALANCE Consider simple case: two advertisers, A1 and A2, each with budget B (assume B ≥ 1) Assume optimal solution exhausts both advertisers’ budgets BALANCE must exhaust at least one advertiser’s budget If not, we can allocate more queries Assume BALANCE exhausts A2’s budget

Analyzing Balance A1 A2 B Queries allocated to A1 in optimal solution Opt revenue = 2B BALANCE revenue = B+y Goal: Show that y ≥ B/2 x y B A1 A2 BAL: B+B/2 = 3B/2 BAL/OPT = 3/4 Competitive Ratio = 3/4

Analyzing Balance A1 A2 B Queries allocated to A1 in optimal solution Case 1: BALANCE assigns at least B/2 blue queries to A1. So, y ≥ B/2 Case 2: BALANCE assigns more than B/2 blue queries to A2 At that time, A2‘s unspent budget must have been at least as big as A1‘s That means at least as many queries have been assigned to A1 as to A2 At this point, we have already assigned at least B/2 queries to A2 . S, y ≥ B/2 x y B A1 A2

General Result: > 2 advertisers Worst competitive ratio of BALANCE is 1–1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio!

General version of problem So far, all bids = 1, all budgets equal (=B) Arbitrary bids, budgets Consider query q, advertiser i Bid = xi Budget = bi BALANCE can be terrible Consider two advertisers A1 and A2 A1: x1 = 1, b1 = 110 A2: x2 = 10, b2 = 100 Suppose we see 10 instances of q BALANCE always selects A1 and earns 10 Optimal earns 100! Competitive ratio: 0.1!! Generalized BALANCE scheme with competitive ratio of 1 – 1/e Modify BALANCE to be based on fraction of unspent budget, and some others ….

Actual problem is even more sophisticated … Upon each search, Interested advertisers are selected from database using keyword matching algorithm Budget allocation algorithm retains interested advertisers with sufficient budget Advertisers compete for ad slots in allocation mechanism Upon click, advertiser charged with pricing scheme CTR updated according to CTR learning algorithm for future auctions

Conclusion Data streaming applications are becoming increasingly important Online algorithms are needed We have looked at data stream processing in general, and web advertising in particular

Acknowledgements The slides is adapted from numerous sources (not all listed here): Jure and Ullman (Stanford) Bergemann (Yale)