Big Data Infrastructure

Slides:

Advertisements

Similar presentations

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Mining Data Streams.

Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Heavy hitter computation over data stream

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Look-up problem IP address did we see the IP address before?

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 20, 2015 Session 11: Beyond MapReduce — Stream Processing This work is licensed.

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Calculating frequency moments of Data Stream

Mining of Massive Datasets Ch4. Mining Data Streams

Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.

Big Data Infrastructure

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Mining Data Streams (Part 1)

Big Data Infrastructure

Distributed, real-time actionable insights on high-volume data streams

Frequency Counts over Data Streams

COMP3211 Advanced Databases

Big Data Infrastructure

International Conference on Data Engineering (ICDE 2016)

Updating SF-Tree Speaker: Ho Wai Shing.

SONATA: Scalable Streaming Analytics for Network Monitoring

The Stream Model Sliding Windows Counting 1’s

Lower bounds for approximate membership dynamic data structures

Applying Control Theory to Stream Processing Systems

Dynamo: Amazon’s Highly Available Key-value Store

The Variable-Increment Counting Bloom Filter

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Streaming & sampling.

Data Warehouse.

SONATA: Query-Driven Network Telemetry

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Mining Data Streams (Part 1)

Chapter 15 QUERY EXECUTION.

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Data-Intensive Distributed Computing

Qun Huang, Patrick P. C. Lee, Yungang Bao

Lecture 15: Bitmap Indexes

Approximate Frequency Counts over Data Streams

Range-Efficient Computation of F0 over Massive Data Streams

Data Intensive and Cloud Computing Data Streams Lecture 10

Introduction to Stream Computing and Reservoir Sampling

Heavy Hitters in Streams and Sliding Windows

Data science laboratory (DSLAB)

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

Lu Tang , Qun Huang, Patrick P. C. Lee

Author: Ramana Rao Kompella, Kirill Levchenko, Alex C

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (1/2) March 29, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2016w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

OLTP/OLAP Architecture ETL (Extract, Transform, and Load)

Twitter’s data warehousing architecture What’s the issue? Twitter’s data warehousing architecture

real-time vs. online streaming

What is a data stream? Sequence of items: Structured (e.g., tuples) Ordered (implicitly or timestamped) Arriving continuously at high volumes Sometimes not possible to store entirely Sometimes not possible to even examine all items

What to do with data streams? Network traffic monitoring Datacenter telemetry monitoring Sensor networks monitoring Credit card fraud detection Stock market analysis Online mining of click streams Monitoring social media streams

What’s the scale? Packet data streams Single 2 Gb/sec link; say avg. packet size is 50 bytes Number of packets/sec = 5 million Time per packet = 0.2 microseconds If we only capture header information per packet: source/destination IP, time, no. of bytes, etc. – at least 10 bytes 50 MB per second 4+ TB per day Per link! What if you wanted to do deep-packet inspection? Source: Minos Garofalakis, Berkeley CS 286

R2 R1 R3 Set-Expression Query SQL Join Query Off-line analysis – Data access is slow, expensive DBMS What are the top (most frequent) 1000 (source, dest) pairs seen by R1 over the last month? How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3? Set-Expression Query Network Operations Center (NOC) SELECT COUNT (R1.source, R1.dest) FROM R1, R2 WHERE R1.source = R2.source SQL Join Query R1 R2 BGP R3 Peer Converged IP/MPLS Network Enterprise Networks PSTN DSL/Cable Networks Source: Minos Garofalakis, Berkeley CS 286

Common Architecture queries DSMS data feeds DBMS DSMS data streams queries Data stream management system (DSMS) at observation points Voluminous streams-in, reduced streams-out Database management system (DBMS) Outputs of DSMS can be treated as data feeds to databases Source: Peter Bonz

OLTP/OLAP Architecture ETL (Extract, Transform, and Load)

DBMS vs. DSMS DBMS DSMS Model: persistent relations Relation: tuple set/bag Data update: modifications Query: transient Query answer: exact Query evaluation: arbitrary Query plan: fixed DSMS Model: (mostly) transient relations Relation: tuple sequence Data update: appends Query: persistent Query answer: approximate Query evaluation: one pass Query plan: adaptive Source: Peter Bonz

What makes it hard? Intrinsic challenges: System challenges: Volume Velocity Limited storage Strict latency requirements System challenges: Load balancing Unreliable and out-of-order message delivery Fault-tolerance Consistency semantics (at most once, exactly once, at least once)

What exactly do you do? “Standard” relational operations: Select Project Transform (i.e., apply custom UDF) Group by Join Aggregations What else do you need to make this “work”?

Issues of Semantics What’s the solution? Group by… aggregate When do you stop grouping and start aggregating? Joining a stream and a static source Simple lookup Joining two streams How long do you wait for the join key in the other stream? Joining two streams, group by and aggregation When do you stop joining? What’s the solution?

Windows Mechanism for extracting finite relations from an infinite stream Windows restrict processing scope: Windows based on ordering attributes (e.g., time) Windows based on item (record) counts Windows based on explicit markers (e.g., punctuations) Variants (e.g., some semantic partitioning constraint)

Windows on Ordering Attributes Assumes the existence of an attribute that defines the order of stream elements (e.g., time) Let T be the window size in units of the ordering attribute sliding window t1 t2 t3 t4 t1' t2’ t3’ t4’ ti’ – ti = T t3 tumbling window t1 t2 ti+1 – ti = T Source: Peter Bonz

Windows on Counts Window of size N elements (sliding, tumbling) over the stream Challenges: Problematic with non-unique timestamps: non-deterministic output Unpredictable window size (and storage requirements) t1 t2 t1' t3 t2’ t3’ t4’ Source: Peter Bonz

Windows from “Punctuations” Application-inserted “end-of-processing” Example: stream of actions… “end of user session” Properties Advantage: application-controlled semantics Disadvantage: unpredictable window size (too large or too small)

Common Techniques Source: Wikipedia (Forge)

“Hello World” Stream Processing Problem: Count the frequency of items in the stream Why? Take some action when frequency exceeds a threshold Data mining: raw counts → co-occurring counts → association rules

The Raw Stream… Source: Peter Bonz

Divide Into Windows… Source: Peter Bonz

First Window Source: Peter Bonz

Second Window Source: Peter Bonz

Solutions are approximate (or lossy) Window Counting What’s the issue? Lessons learned? Solutions are approximate (or lossy)

General Strategies Sampling Hashing

Reservoir Sampling Task: select s elements from a stream of size N with uniform probability N can be very very large We might not even know what N is! (infinite stream) Solution: Reservoir sampling Store first s elements For the k-th element thereafter, keep with probability s/k (randomly discard an existing element) Example: s = 10 Keep first 10 elements 11th element: keep with 10/11 12th element: keep with 10/12 …

Reservoir Sampling: How does it work? Example: s = 10 Keep first 10 elements 11th element: keep with 10/11 General case: at the (k + 1)th element Probability of selecting each item up until now is s/k Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1) Probability existing item survives: k/(k + 1) Probability each item survives to (k + 1)th round: (s/k) × k/(k + 1) = s/(k + 1) If we decide to keep it: sampled uniformly by definition probability existing item is discarded: 10/11 × 1/10 = 1/11 probability existing item survives: 10/11

Hashing for Three Common Tasks Cardinality estimation What’s the cardinality of set S? How many unique visitors to this page? Set membership Is x a member of set S? Has this user seen this ad before? Frequency estimation How many times have we observed x? How many queries has this user issued? HashSet HLL counter HashSet Bloom Filter HashMap CMS

How do we take advantage of this observation? HyperLogLog Counter Task: cardinality estimation of set size() → number of unique elements in the set Observation: hash each item and examine the hash code On expectation, 1/2 of the hash codes will start with 1 On expectation, 1/4 of the hash codes will start with 01 On expectation, 1/8 of the hash codes will start with 001 On expectation, 1/16 of the hash codes will start with 0001 … How do we take advantage of this observation?

Bloom Filters Task: keep track of set membership Components put(x) → insert x into the set contains(x) → yes if x is a member of the set Components m-bit bit vector k hash functions: h1 … hk

Bloom Filters: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11

Bloom Filters: put put x 1 1 1

Bloom Filters: contains h1(x) = 2 contains x h2(x) = 5 h3(x) = 11 1 1 1

Bloom Filters: contains h1(x) = 2 contains x h2(x) = 5 h3(x) = 11 AND = YES A[h1(x)] A[h2(x)] A[h3(x)] 1 1 1

Bloom Filters: contains h1(y) = 2 contains y h2(y) = 6 h3(y) = 9 1 1 1

Bloom Filters: contains h1(y) = 2 contains y h2(y) = 6 h3(y) = 9 AND = NO A[h1(y)] A[h2(y)] A[h3(y)] 1 1 1 What’s going on here?

Bloom Filters Error properties: contains(x) Usage: False positives possible No false negatives Usage: Constraints: capacity, error probability Tunable parameters: size of bit vector m, number of hash functions k

Count-Min Sketches Task: frequency estimation Components put(x) → increment count of x by one get(x) → returns the frequency of x Components k hash functions: h1 … hk m by k array of counters m k

Count-Min Sketches: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11 h4(x) = 4

Count-Min Sketches: put x 1

Count-Min Sketches: put h1(x) = 2 put x h2(x) = 5 h3(x) = 11 h4(x) = 4 1

Count-Min Sketches: put x 2

Count-Min Sketches: put h1(y) = 6 put y h2(y) = 5 h3(y) = 12 h4(y) = 2 2

Count-Min Sketches: put y 2 1 3

Count-Min Sketches: get h1(x) = 2 get x h2(x) = 5 h3(x) = 11 h4(x) = 4 2 1 3

Count-Min Sketches: get h1(x) = 2 get x h2(x) = 5 h3(x) = 11 h4(x) = 4 A[h3(x)] MIN = 2 A[h1(x)] A[h2(x)] A[h4(x)] 2 1 3

Count-Min Sketches: get h1(y) = 6 get y h2(y) = 5 h3(y) = 12 h4(y) = 2 2 1 3

Count-Min Sketches: get h1(y) = 6 get y h2(y) = 5 h3(y) = 12 h4(y) = 2 MIN = 1 A[h3(y)] A[h1(y)] A[h2(y)] A[h4(y)] 2 1 3

Count-Min Sketches Error properties: Usage: Reasonable estimation of heavy-hitters Frequent over-estimation of tail Usage: Constraints: number of distinct events, distribution of events, error bounds Tunable parameters: number of counters m, number of hash functions k, size of counters

Three Common Tasks Cardinality estimation Set membership What’s the cardinality of set S? How many unique visitors to this page? Set membership Is x a member of set S? Has this user seen this ad before? Frequency estimation How many times have we observed x? How many queries has this user issued? HashSet HLL counter HashSet Bloom Filter HashMap CMS

Stream Processing Architectures Next time: Stream Processing Architectures Source: Wikipedia (River)

Questions? Source: Wikipedia (Japanese rock garden)