Big Data Infrastructure

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining Data Streams.
Indian Statistical Institute Kolkata
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
1 Data Warehouses BUAD/American University Data Warehouses.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 20, 2015 Session 11: Beyond MapReduce — Stream Processing This work is licensed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Calculating frequency moments of Data Stream
Mining of Massive Datasets Ch4. Mining Data Streams
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
SketchVisor: Robust Network Measurement for Software Packet Processing
Big Data Infrastructure
Big Data Infrastructure
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Mining Data Streams (Part 1)
Big Data Infrastructure
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
CSCI5570 Large Scale Data Processing Systems
Big Data Infrastructure
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
International Conference on Data Engineering (ICDE 2016)
Updating SF-Tree Speaker: Ho Wai Shing.
The Stream Model Sliding Windows Counting 1’s
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Streaming & sampling.
Big Data Infrastructure
Augmented Sketch: Faster and More Accurate Stream Processing
Data Warehouse.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Mining Data Streams (Part 1)
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Data-Intensive Distributed Computing
Data-Intensive Distributed Computing
Randomized Algorithms CS648
Cse 344 May 4th – Map/Reduce.
Range-Efficient Computation of F0 over Massive Data Streams
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
By: Ran Ben Basat, Technion, Israel
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Data-Intensive Distributed Computing
Big DATA.
Online Analytical Processing Stream Data: Is It Feasible?
Approximation and Load Shedding Sampling Methods
Data science laboratory (DSLAB)
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Lu Tang , Qun Huang, Patrick P. C. Lee
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 12: Real-Time Data Analytics (1/2) March 28, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2017w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

ETL (Extract, Transform, and Load) users Frontend Backend OLTP database ETL (Extract, Transform, and Load) Data Warehouse My data is a day old… Meh. BI tools analysts

Twitter’s data warehousing architecture

Mishne et al. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. SIGMOD 2013.

Case Studies: Steve Jobs passes away

Initial Implementation Algorithm: Co-occurrences within query sessions Implementation: Pig scripts over query logs on HDFS Problem: Query suggestions were several hours old! Why? Log collection lag Hadoop scheduling lag Hadoop job latencies We need real-time processing! (stay tuned next time for actual solution)

real-time vs. online streaming

What is a data stream? Sequence of items: Structured (e.g., tuples) Ordered (implicitly or timestamped) Arriving continuously at high volumes Sometimes not possible to store entirely Sometimes not possible to even examine all items

Applications Network traffic monitoring Datacenter telemetry monitoring Sensor networks monitoring Credit card fraud detection Stock market analysis Online mining of click streams Monitoring social media streams

What exactly do you do? “Standard” relational operations: Select Project Transform (i.e., apply custom UDF) Group by Join Aggregations What else do you need to make this “work”?

Issues of Semantics What’s the solution? Group by… aggregate When do you stop grouping and start aggregating? Joining a stream and a static source Simple lookup Joining two streams How long do you wait for the join key in the other stream? Joining two streams, group by and aggregation When do you stop joining? What’s the solution?

Windows Windows restrict processing scope: Windows based on ordering attributes (e.g., time) Windows based on item (record) counts Windows based on explicit markers (e.g., punctuations)

Windows on Ordering Attributes Assumes the existence of an attribute that defines the order of stream elements (e.g., time) Let T be the window size in units of the ordering attribute sliding window t1 t2 t3 t4 t1' t2’ t3’ t4’ ti’ – ti = T t3 tumbling window t1 t2 ti+1 – ti = T

Window of size N elements (sliding, tumbling) over the stream Windows on Counts Window of size N elements (sliding, tumbling) over the stream t1 t2 t1' t3 t2’ t3’ t4’

Windows from “Punctuations” Application-inserted “end-of-processing” Example: stream of actions… “end of user session” Properties Advantage: application-controlled semantics Disadvantage: unpredictable window size (too large or too small)

Streams Processing Challenges Inherent challenges Latency requirements Space bounds System challenges Load balancing Unreliable and out-of-order message delivery Fault-tolerance Consistency semantics (at most once, exactly once, at least once)

Common Techniques Source: Wikipedia (Forge)

Streams Processing Challenges Inherent challenges Latency requirements Space bounds System challenges Load balancing Unreliable and out-of-order message delivery Fault-tolerance Consistency semantics (at most once, exactly once, at least once)

Space: The Final Frontier Data streams are potentially infinite Unbounded space! How do we get around this? Throw old data away Accept some approximation General techniques Sampling Hashing

Reservoir Sampling Task: select s elements from a stream of size N with uniform probability N can be very very large We might not even know what N is! (infinite stream) Solution: Reservoir sampling Store first s elements For the k-th element thereafter, keep with probability s/k (randomly discard an existing element) Example: s = 10 Keep first 10 elements 11th element: keep with 10/11 12th element: keep with 10/12 …

Reservoir Sampling: How does it work? Example: s = 10 Keep first 10 elements 11th element: keep with 10/11 If we decide to keep it: sampled uniformly by definition probability existing item is discarded: 10/11 × 1/10 = 1/11 probability existing item survives: 10/11 General case: at the (k + 1)th element Probability of selecting each item up until now is s/k Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1) Probability existing item survives: k/(k + 1) Probability each item survives to (k + 1)th round: (s/k) × k/(k + 1) = s/(k + 1)

Hashing for Three Common Tasks Cardinality estimation What’s the cardinality of set S? How many unique visitors to this page? HashSet HLL counter Set membership Is x a member of set S? Has this user seen this ad before? HashSet Bloom Filter Frequency estimation How many times have we observed x? How many queries has this user issued? HashMap CMS

HyperLogLog Counter Task: cardinality estimation of set size() → number of unique elements in the set Observation: hash each item and examine the hash code On expectation, 1/2 of the hash codes will start with 1 On expectation, 1/4 of the hash codes will start with 01 On expectation, 1/8 of the hash codes will start with 001 On expectation, 1/16 of the hash codes will start with 0001 … How do we take advantage of this observation?

Bloom Filters Task: keep track of set membership Components put(x) → insert x into the set contains(x) → yes if x is a member of the set Components m-bit bit vector k hash functions: h1 … hk

Bloom Filters: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11

Bloom Filters: put put x 1 1 1

Bloom Filters: contains h1(x) = 2 contains x h2(x) = 5 h3(x) = 11 1 1 1

Bloom Filters: contains h1(x) = 2 contains x h2(x) = 5 h3(x) = 11 AND = YES A[h1(x)] A[h2(x)] A[h3(x)] 1 1 1

Bloom Filters: contains h1(y) = 2 contains y h2(y) = 6 h3(y) = 9 1 1 1

Bloom Filters: contains h1(y) = 2 contains y h2(y) = 6 h3(y) = 9 AND = NO A[h1(y)] A[h2(y)] A[h3(y)] 1 1 1 What’s going on here?

Bloom Filters Error properties: contains(x) Usage False positives possible No false negatives Usage Constraints: capacity, error probability Tunable parameters: size of bit vector m, number of hash functions k

Count-Min Sketches Task: frequency estimation Components put(x) → increment count of x by one get(x) → returns the frequency of x Components m by k array of counters k hash functions: h1 … hk m k

Count-Min Sketches: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11 h4(x) = 4

Count-Min Sketches: put x 1

Count-Min Sketches: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11 h4(x) = 4 1

Count-Min Sketches: put x 2

Count-Min Sketches: put h1(y) = 6 put y h2(y) = 5 h3(y) = 12 h4(y) = 2 2

Count-Min Sketches: put y 2 1 3

Count-Min Sketches: get h1(x) = 2 get x h2(x) = 5 h3(x) = 11 h4(x) = 4 2 1 3

Count-Min Sketches: get h1(x) = 2 get x h2(x) = 5 h3(x) = 11 h4(x) = 4 A[h3(x)] MIN = 2 A[h1(x)] A[h2(x)] A[h4(x)] 2 1 3

Count-Min Sketches: get h1(y) = 6 get y h2(y) = 5 h3(y) = 12 h4(y) = 2 2 1 3

Count-Min Sketches: get h1(y) = 6 get y h2(y) = 5 h3(y) = 12 h4(y) = 2 MIN = 1 A[h3(y)] A[h1(y)] A[h2(y)] A[h4(y)] 2 1 3

Count-Min Sketches Error properties: get(x) Usage Reasonable estimation of heavy-hitters Frequent over-estimation of tail Usage Constraints: number of distinct events, distribution of events, error bounds Tunable parameters: number of counters m and hash functions k, size of counters

Hashing for Three Common Tasks Cardinality estimation What’s the cardinality of set S? How many unique visitors to this page? HashSet HLL counter Set membership Is x a member of set S? Has this user seen this ad before? HashSet Bloom Filter Frequency estimation How many times have we observed x? How many queries has this user issued? HashMap CMS

Stream Processing Architectures Next time: Stream Processing Architectures Source: Wikipedia (River)

Questions? Source: Wikipedia (Japanese rock garden)