Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Motivation  Many important applications must process large data streams at second-scale latencies – Check-ins, status updates, site statistics, spam filtering, …  Require large clusters to handle workloads  Require latencies of few seconds 2

Case study: Conviva, Inc.  Real-time monitoring of online video metadata  Custom-built distributed streaming system – 1000s complex metrics on millions of videos sessions – Requires many dozens of nodes for processing  Hadoop backend for offline analysis – Generating daily and monthly reports – Similar computation as the streaming system 3 Painful to maintain two stacks

Goals  Framework for large-scale stream processing  Scalable to large clusters (~ 100 nodes) with near-real-time latency (~ 1 second)  Efficiently recovers from faults and stragglers  Simple programming model that integrates well with batch & interactive queries 4 Existing system do not achieve all of them Existing system do not achieve all of them

Existing Streaming Systems  Record-at-a-time processing model – Each node has mutable state – For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records 5

Existing Streaming Systems  Storm – Replays records if not processed due to failure – Processes each record at least once – May update mutable state twice! – Mutable state can be lost due to failure!  Trident – Uses transactions to update state – Processes each record exactly once – Per state transaction updates slow 6 No integration with batch processing & Cannot handle stragglers No integration with batch processing & Cannot handle stragglers

Spark Streaming 7

Discretized Stream Processing  Run a streaming computation as a series of very small, deterministic batch jobs  Batch processing models, like MapReduce, recover from faults and stragglers efficiently – Divide job into deterministic tasks – Rerun failed/slow tasks in parallel on other nodes  Same recovery techniques at lower time scales 8

Spark Streaming  State between batches kept in memory as immutable, fault-tolerant dataset – Specifically as Spark’s Resilient Distributed Dataset  Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency  Potentially combine streaming and batch workloads to build a single unified stack 9

Discretized Stream Processing time = 0 - 1: time = 1 - 2: batch operations input immutable distributed dataset (replicated in memory) immutable distributed dataset (replicated in memory) immutable distributed dataset, stored in memory as RDD input stream state stream ……… state / output 10

Fault Recovery  State stored as Resilient Distributed Dataset (RDD) – Deterministically re-computable parallel collection – Remembers lineage of operations used to create them  Fault / straggler recovery is done in parallel on other nodes operation input dataset (replicated and fault-tolerant) state RDD (not replicated) Fast recovery from faults without full data replication 11

Programming Model  A Discretized Stream or DStream is a series of RDDs representing a stream of data – API very similar to RDDs  DStreams can be created… – Either from live streaming data – Or by transforming other DStreams 12

DStream Data Sources  Many sources out of the box – HDFS – Kafka – Flume – Twitter – TCP sockets – Akka actor – ZeroMQ  Easy to add your own 13 Contributed by external developers

Transformations Build new streams from existing streams – RDD-like operations map, flatMap, filter, count, reduce, groupByKey, reduceByKey, sortByKey, join etc. – New window and stateful operations window, countByWindow, reduceByWindow countByValueAndWindow, reduceByKeyAndWindow updateStateByKey etc.

Output Operations Send data to outside world – saveAsHadoopFiles – print – prints on the driver’s screen – foreach - arbitrary operation on every RDD

Example Process a stream of Tweets to find the 20 most popular hashtags in the last 10 mins 1.Get the stream of Tweets and isolate the hashtags 2.Count the hashtags over 10 minute window 3.Sort the hashtags by their counts 4.Get the top 20 hashtags 16

1. Get the stream of Hashtags val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) 17 transformation DStream = RDD t-1 tt+1 t+2 t+4 t+3 flatMap tweets hashTags

tagCounts 2. Count the hashtags over 10 min val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).map(tag => (tag, 1)).reduceByKey(_ + _) sliding window operation hashTags t-1 tt+1 t+2 t+4 t+3

2. Count the hashtags over 10 min val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 tt+1 t+2 t+4 t+3 + + – tagCounts

Smart window-based reduce  Technique with count generalizes to reduce – Need a function to “subtract” – Applies to invertible reduce functions  Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …) 20

3. Sort the hashtags by their counts val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1)) val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) }.transform(_.sortByKey(false)) allows arbitrary RDD operations to create a new DStream allows arbitrary RDD operations to create a new DStream

4. Get the top 20 hashtags val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1)) val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) }.transform(_.sortByKey(false)) sortedTags.foreach(showTopTags(20) _) output operation

10 popular hashtags in last 10 min // Create the stream of tweets val tweets = ssc.twitterStream(, ) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)).countByValueAndWindow (Minutes(10), Second(1)) // Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) }.transform(_.sortByKey(false)) // Show the top 10 tags sortedTags.foreach(showTopTags(10) _) 23

Demo 24

Other Operations  Maintaining arbitrary state, tracking sessions tweets.updateStateByKey(tweet => updateMood(tweet))  Selecting data directly from a DStream tagCounts.slice(, ).sortByKey() 25 tweets t-1 tt+1 t+2 t+4 t+3 user mood

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency 26

Comparison with others Higher throughput than Storm – Spark Streaming: 670k records/second/node – Storm: 115k records/second/node – Apache S4: 7.5k records/second/node 27

Fast Fault Recovery Recovers from faults/stragglers within 1 sec 28

Real Applications: Conviva Real-time monitoring of video metadata 29 Implemented Shadoop – a wrapper for Hadoop jobs to run over Spark / Spark Streaming Ported parts of Conviva’s Hadoop stack to run on Spark Streaming Shadoop Hadoop Job Spark Streaming val shJob = new SparkHadoopJob[…]( ) val shJob.run( )

Real Applications: Conviva Real-time monitoring of video metadata 30 Achieved 1-2 second latency Millions of video sessions processed scales linearly with cluster size

Real Applications: Mobile Millennium Project Traffic estimation using online machine learning 31 Markov chain Monte Carlo simulations on GPS observations Very CPU intensive, requires 10s of machines for useful computation Scales linearly with cluster size

Failure Semantics  Input data replicated by the system  Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails  Transformations – exactly once  Output operations – at least once 32

Java API for Streaming  Developed by Patrick Wendell  Similar to Spark Java API  Don’t need to know scala to try streaming! 33

Contributors  5 contributors from UCB, 3 external contributors – Matei Zaharia, Haoyuan Li – Patrick Wendell – Denny Britz – Sean McNamara* – Prashant Sharma* – Nick Pentreath* – Tathagata Das 34

Vision - one stack to rule them all Ad-hoc Queries Batch Processing Stream Processing Spark + Spark Streaming

Conclusion Alpha to be release with Spark 0.7 by weekend Look at the new Streaming Programming Guide More about Spark Streaming system in our paper http://tinyurl.com/dstreams Join us in Strata on Feb 26 in Santa Clara 37

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Similar presentations

Presentation on theme: "Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Similar presentations

Presentation on theme: "Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)"— Presentation transcript:

Similar presentations

About project

Feedback