Spark Streaming Large-scale near-real-time stream processing

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Making Fly Parviz Deyhim
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Outline | Motivation| Design | Results| Status| Future
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
An Example Use Case Scenario
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Spark Streaming Large-scale near-real-time stream processing
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Big thanks to everyone!.
Architecture and design
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
Hadoop Tutorials Spark
Some slides borrowed from the authors
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Berkeley Data Analytics Stack (BDAS) Overview
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Spark.
Data-Intensive Distributed Computing
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Spark and Scala.
Introduction to Spark.
Apache Hadoop and Spark
Data science laboratory (DSLAB)
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Presentation transcript:

Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrustion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds

But what are the requirements Need for a framework … … for building such complex stream processing applications But what are the requirements from such a framework?

Requirements Scalable to large clusters Second-scale latencies Simple programming model

Case study: Conviva, Inc. Real-time monitoring of online video metadata HBO, ESPN, ABC, SyFy, … Two processing stacks Custom-built distributed stream processing system 1000s complex metrics on millions of video sessions Requires many dozens of nodes for processing Hadoop backend for offline analysis Generating daily and monthly reports Similar computation as the streaming system

Case study: XYZ, Inc. Any company who wants to process live streaming data has this problem Twice the effort to implement any new function Twice the number of bugs to solve Twice the headache Two processing stacks Custom-built distributed stream processing system 1000s complex metrics on millions of videos sessions Requires many dozens of nodes for processing Hadoop backend for offline analysis Generating daily and monthly reports Similar computation as the streaming system

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing

Stateful Stream Processing Traditional streaming systems have a event- driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault- tolerant is challenging mutable state node 1 node 3 input records node 2 Traditional streaming systems have what we call a “record-at-a-time” processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard.

Existing Streaming Systems Storm Replays record if not processed by a node Processes each record at least once May update mutable state twice! Mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Per state transaction updates slow

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing Efficient fault-tolerance in stateful computations

Spark Streaming

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Batch sizes as low as ½ second, latency ~ 1 second Potential for combining batch processing and streaming processing in the same system batches of X seconds Spark processed results

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t batch @ t+2 Twitter Streaming API tweets DStream stored in memory as an RDD (immutable, distributed)

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream batch @ t+1 batch @ t batch @ t+2 tweets DStream flatMap flatMap flatMap … hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save every batch saved to HDFS

Function object to define the transformation Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) Function object to define the transformation

lost partitions recomputed on other workers Fault-tolerance RDDs are remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant Data lost due to worker failure, can be recomputed from input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers

Key concepts DStream – sequence of RDDs representing a stream of data Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results

Example 2 – Count the hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() batch @ t batch @ t+1 batch @ t+2 tweets flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey hashTags tagCounts [(#cat, 10), (#dog, 25), ... ]

Example 3 – Count the hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window operation window length sliding interval

Example 3 – Counting the hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window tagCounts

Smart window-based countByValue val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 t t+1 t+2 t+3 countByValue add the counts from the new batch in the window + – subtract the counts from batch before the window tagCounts ?

Smart window-based reduce Technique to incrementally compute count generalizes to many reduce operations Need a function to “inverse reduce” (“subtract” for counting) Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

Demo

Fault-tolerant Stateful Processing All intermediate data are RDDs, hence can be recomputed if lost t-1 t t+1 t+2 t+3 hashTags tagCounts

Fault-tolerant Stateful Processing State data not lost even if a worker node dies Does not change the value of your result Exactly once semantics to all transformations No double counting!

Other Interesting Operations Maintaining arbitrary state, track sessions Maintain per-user mood as state, and update it with his/her tweets tweets.updateStateByKey(tweet => updateMood(tweet)) Do arbitrary Spark RDD computation within DStream Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency Tested with 100 streams of data on 100 EC2 instances with 4 cores each

Comparison with Storm and S4 Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack

Fast Fault Recovery Recovers from faults/stragglers within 1 sec

Real Applications: Conviva Real-time monitoring of video metadata Achieved 1-2 second latency Millions of video sessions processed Scales linearly with cluster size

Real Applications: Mobile Millennium Project Traffic transit time estimation using online machine learning on GPS observations Markov chain Monte Carlo simulations on GPS observations Very CPU intensive, requires dozens of machines for useful computation Scales linearly with cluster size

Vision - one stack to rule them all Ad-hoc Queries Batch Processing Stream Processing Spark + Shark Streaming

Spark program vs Spark Streaming program Spark Streaming program on Twitter stream val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Spark program on Twitter log file val tweets = sc.hadoopFile("hdfs://...") hashTags.saveAsHadoopFile("hdfs://...")

Vision - one stack to rule them all $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) scala> val mapped = file.map(...) Explore data interactively using Spark Shell / PySpark to identify problems Use same code in Spark stand-alone programs to identify problems in production logs Use similar code in Spark Streaming to identify problems in live log streams object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... }

Vision - one stack to rule them all Ad-hoc Queries Batch Processing Stream Processing $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) scala> val mapped = file.map(...) Explore data interactively using Spark Shell / PySpark to identify problems Use same code in Spark stand-alone programs to identify problems in production logs Use similar code in Spark Streaming to identify problems in live log streams Spark + Shark Streaming object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = file.filter(_.contains(“ERROR”)) val mapped = file.map(...) ... }

Alpha Release with Spark 0.7 Integrated with Spark 0.7 Import spark.streaming to get all the functionality Both Java and Scala API Give it a spin! Run locally or in a cluster Try it out in the hands-on tutorial later today

Summary Stream processing framework that is ... Scalable to large clusters Achieves second-scale latencies Has simple programming model Integrates with batch & interactive workloads Ensures efficient fault-tolerance in stateful computations For more information, checkout our paper: http://tinyurl.com/dstreams