Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Slides:



Advertisements
Similar presentations
Spark Streaming Large-scale near-real-time stream processing
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
Apache Storm A scalable distributed & fault tolerant real time computation system ( Free & Open Source ) Shyam Rajendran 16-Feb-15.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Summingbird : A Framework for Integrating Batch and Online MapReduce Computations Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin Twitter, Inc.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 20, 2015 Session 11: Beyond MapReduce — Stream Processing This work is licensed.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Engineering How MapReduce Works
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Lecture 9:FXML and Useful Java Collections Michael Hsu CSULA.
Part III BigData Analysis Tools (Storm) Yuan Xue
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Image taken from: slideshare
Big Data Infrastructure
Big Data Infrastructure
Fast, Interactive, Language-Integrated Cluster Computing
CSCI5570 Large Scale Data Processing Systems
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Introduction to Spark Streaming for Real Time data analysis
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Original Slides by Nathan Twitter Shyam Nutanix
Spark Presentation.
Big Data Infrastructure
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Data-Intensive Distributed Computing
Data-Intensive Distributed Computing
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Distributed System Gang Wu Spring,2018.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Data science laboratory (DSLAB)
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
COS 518: Distributed Systems Lecture 11 Mike Freedman
CS639: Data Management for Data Science
Presentation transcript:

Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 31, 2016 These slides are available at

Twitter’s data warehousing architecture What’s the issue?

Hashing for Three Common Tasks Cardinality estimation What’s the cardinality of set S? How many unique visitors to this page? Set membership Is x a member of set S? Has this user seen this ad before? Frequency estimation How many times have we observed x? How many queries has this user issued? HashSet HashMap HLL counter Bloom Filter CMS

HyperLogLog Counter Task: cardinality estimation of set size() → number of unique elements in the set Observation: hash each item and examine the hash code On expectation, 1/2 of the hash codes will start with 1 On expectation, 1/4 of the hash codes will start with 01 On expectation, 1/8 of the hash codes will start with 001 On expectation, 1/16 of the hash codes will start with 0001 … How do we take advantage of this observation?

Bloom Filters Task: keep track of set membership put(x) → insert x into the set contains(x) → yes if x is a member of the set Components m-bit bit vector k hash functions: h 1 … h k

Bloom Filters: put x put h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11

Bloom Filters: put x put

Bloom Filters: contains x contain s h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11

Bloom Filters: contains x contain s h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11 AND = YES A[h 1 (x)] A[h 2 (x)] A[h 3 (x)]

Bloom Filters: contains y contain s h 1 (y) = 2 h 2 (y) = 6 h 3 (y) = 9

Bloom Filters: contains y contain s h 1 (y) = 2 h 2 (y) = 6 h 3 (y) = 9 What’s going on here? AND = NO A[h 1 (y)] A[h 2 (y)] A[h 3 (y)]

Bloom Filters Error properties: contains(x) False positives possible No false negatives Usage: Constraints: capacity, error probability Tunable parameters: size of bit vector m, number of hash functions k

Count-Min Sketches Task: frequency estimation put(x) → increment count of x by one get(x) → returns the frequency of x Components k hash functions: h 1 … h k m by k array of counters m k

Count-Min Sketches: put x put h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11 h 4 (x) = 4

Count-Min Sketches: put x put

Count-Min Sketches: put x put h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11 h 4 (x) = 4

Count-Min Sketches: put x put

Count-Min Sketches: put y put h 1 (y) = 6 h 2 (y) = 5 h 3 (y) = 12 h 4 (y) = 2

Count-Min Sketches: put y put

Count-Min Sketches: get x get h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11 h 4 (x) = 4

Count-Min Sketches: get x get h 1 (x) = 2 h 2 (x) = 5 h 3 (x) = 11 h 4 (x) = 4 A[h 3 (x)] MIN = 2 A[h 1 (x)] A[h 2 (x)] A[h 4 (x)]

Count-Min Sketches: get y get h 1 (y) = 6 h 2 (y) = 5 h 3 (y) = 12 h 4 (y) = 2

Count-Min Sketches: get y get h 1 (y) = 6 h 2 (y) = 5 h 3 (y) = 12 h 4 (y) = 2 MIN = 1 A[h 3 (y)] A[h 1 (y)] A[h 2 (y)] A[h 4 (y)]

Count-Min Sketches Error properties: Reasonable estimation of heavy-hitters Frequent over-estimation of tail Usage: Constraints: number of distinct events, distribution of events, error bounds Tunable parameters: number of counters m, number of hash functions k, size of counters

Three Common Tasks Cardinality estimation What’s the cardinality of set S? How many unique visitors to this page? Set membership Is x a member of set S? Has this user seen this ad before? Frequency estimation How many times have we observed x? How many queries has this user issued? HashSet HashMap HLL counter Bloom Filter CMS

Source: Wikipedia (River) Stream Processing Architectures

Producer/Consumers Producer Consumer How do consumers get data from producers?

Producer/Consumers Producer Consumer Producer pushes e.g., callback

Producer/Consumers Producer Consumer e.g., poll, tail Consumer pulls

Producer/Consumers Producer Consumer Producer

Producer/Consumers Producer Consumer Producer Broker Queue, Pub/Sub Kafka

Tuple-at-a-Time Processing

Storm Open-source real-time distributed stream processing system Started at BackType BackType acquired by Twitter in 2011 Now an Apache project Storm aspires to be the Hadoop of real-time processing!

Storm Topologies Storm topologies = “job” Once started, runs continuously until killed A Storm topology is a computation graph Graph contains nodes and edges Nodes hold processing logic (i.e., transformation over its input) Directed edges indicate communication between nodes Processing semantics: At most once: without acknowledgments At least once: with acknowledgements

Streams, Spouts, and Bolts bolt spout stream Spouts –Stream generators –May propagate a single stream to multiple consumers Spouts –Stream generators –May propagate a single stream to multiple consumers Bolts –Subscribe to streams –Streams transformers –Process incoming streams and produce new ones Bolts –Subscribe to streams –Streams transformers –Process incoming streams and produce new ones Streams –The basic collection abstraction: an unbounded sequence of tuples –Streams are transformed by the processing elements of a topology Streams –The basic collection abstraction: an unbounded sequence of tuples –Streams are transformed by the processing elements of a topology

Stream Groupings Bolts are executed by multiple workers in parallel When a bolt emits a tuple, where should it go? Stream groupings: Shuffle grouping: round-robin Field grouping: based on data value

From Storm to Heron Heron = API compatible re-implementation of Storm Source:

Mini-Batch Processing

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches Source: All following Spark Streaming slides by Tathagata Das

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Spark Streaming batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(, ) DStream: a sequence of RDD representing a stream of data t+1 t t+2 tweets DStream Twitter Streaming API stored in memory as an RDD (immutable, distributed)

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) flatMa p … transformation: modify data in one Dstream to create another DStream new DStream new RDDs created for every batch t+1 t t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMa p save t+1 t+2 tweets DStream hashTags DStream every batch saved to HDFS

Fault-tolerance  RDDs are remember the sequence of operations that created it from the original fault- tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault- tolerant  Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTag s RDD

Key concepts  DStream – sequence of RDDs representing a stream of data -Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets  Transformations – modify data from on DStream to another -Standard RDD operations – map, countByValue, reduce, join, … -Stateful operations – window, countByValueAndWindow, …  Output Operations – send data to external entity -saveAsHadoopFiles – saves to HDFS -foreach – do anything with each batch of results

Example: Count the hashtags val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() flatMap map reduceByKe y flatMap map reduceByKe y … flatMap map reduceByKe y t+1 t t+2 hashTag s tweets tagCounts [(#cat, 10), (#dog, 25),... ]

Example: Count the hashtags over last 10 mins val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window operation window length sliding interval

tagCounts Example: Count the hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() hashTags t-1 tt+1 t+2t+3 sliding window countByValue count over all the data in the window

? Smart window-based countByValue val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 tt+1 t+2t – countByValue add the counts from the new batch in the window subtract the counts from batch before the window tagCounts

Smart window-based reduce  Technique to incrementally compute count generalizes to many reduce operations -Need a function to “inverse reduce” (“subtract” for counting)  Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

Integrating Batch and Online Processing

A domain-specific language (in Scala) designed to integrate batch and online MapReduce computations Summingbird Idea #1: Algebraic structures provide the basis for seamless integration of batch and online processing Probabilistic data structures as monoids Idea #2: For many tasks, close enough is good enough

“map” flatMap[T, U](fn: T => List[U]): List[U] map[T, U](fn: T => U): List[U] filter[T](fn: T => Boolean): List[T] sumByKey Batch and Online MapReduce “reduce”

Semigroup = ( M,  )  : M × M → M, s.t., ∀ m 1, m 2, m 3 ∋ M Idea #1: Algebraic structures provide the basis for seamless integration of batch and online processing (m 1  m 2 )  m 3 = m 1  (m 2  m 3 ) Monoid = Semigroup + identity Commutative Monoid = Monoid + commutativity  s.t.,   m = m   = m, ∀ m ∋ M ∀ m 1, m 2 ∋ M, m 1  m 2 = m 2  m 1 Simplest example: integers with + (addition)

( a  b  c  d  e  f ) You can put the parentheses anywhere! Batch = Hadoop Mini-batches Online = Storm Summingbird values must be at least semigroups (most are commutative monoids in practice) ((((( a  b )  c )  d )  e )  f ) (( a  b  c )  ( d  e  f )) Idea #1: Algebraic structures provide the basis for seamless integration of batch and online processing Power of associativity = Results are exactly the same!

def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) Scalding.run { wordCount[Scalding]( Scalding.source[Tweet]("source_data"), Scalding.store[String, Long]("count_out") ) } Storm.run { wordCount[Storm]( new TweetSpout(), new MemcacheStore[String, Long] ) } Summingbird Word Count Run on Scalding (Cascading/Hadoop) Run on Storm where data comes from where data goes “map” “reduce” read from HDFS write to HDFS read from message queue write to KV store

Map Input Reduce Output Spout Bolt memcached Bolt

“Boring” monoids addition, multiplication, max, min moments (mean, variance, etc.) sets hashmaps with monoid values More interesting monoids? tuples of monoids

Idea #2: For many tasks, close enough is good enough! “Interesting” monoids Bloom filters (set membership) HyperLogLog counters (cardinality estimation) Count-min sketches (event counts) 1. Variations on hashing 2. Bounded error Common features

Cheat sheet Set membership Set cardinality Frequency count set hashmap Bloom filter hyperloglog counter count-min sketches ExactApproximate

def wordCount[P <: Platform[P]] (source: Producer[P, Query], store: P#Store[Long, Map[String, Long]]) = source.flatMap { query => (query.getHour, Map(query.getQuery -> 1L)) }.sumByKey(store) def wordCount[P <: Platform[P]] (source: Producer[P, Query], store: P#Store[Long, SketchMap[String, Long]]) (implicit countMonoid: SketchMapMonoid[String, Long]) = source.flatMap { query => (query.getHour, countMonoid.create((query.getQuery, 1L))) }.sumByKey(store) Exact with hashmaps Task: count queries by hour Approximate with CMS

Hybrid Online/Batch Processing online results key-value store batch results key-value store client Summingbird program Message Queue Hadoop job Storm topology store 1 source 2 source 3 … store 2 store 3 … source 1 read write ingest HDFS readwrite query online batch client library Example: count historical clicks and clicks in real time

Source: Wikipedia (Japanese rock garden) Questions?