Download presentation
Presentation is loading. Please wait.
Published byKelly Darrin Modified over 9 years ago
1
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
2
Motivation Many important applications must process large data streams at second-scale latencies – Check-ins, status updates, site statistics, spam filtering, … Require large clusters to handle workloads Require latencies of few seconds 2
3
Case study: Conviva, Inc. Real-time monitoring of online video metadata Custom-built distributed streaming system – 1000s complex metrics on millions of videos sessions – Requires many dozens of nodes for processing Hadoop backend for offline analysis – Generating daily and monthly reports – Similar computation as the streaming system 3 Painful to maintain two stacks
4
Goals Framework for large-scale stream processing Scalable to large clusters (~ 100 nodes) with near-real-time latency (~ 1 second) Efficiently recovers from faults and stragglers Simple programming model that integrates well with batch & interactive queries 4 Existing system do not achieve all of them Existing system do not achieve all of them
5
Existing Streaming Systems Record-at-a-time processing model – Each node has mutable state – For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records 5
6
Existing Streaming Systems Storm – Replays records if not processed due to failure – Processes each record at least once – May update mutable state twice! – Mutable state can be lost due to failure! Trident – Uses transactions to update state – Processes each record exactly once – Per state transaction updates slow 6 No integration with batch processing & Cannot handle stragglers No integration with batch processing & Cannot handle stragglers
7
Spark Streaming 7
8
Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Batch processing models, like MapReduce, recover from faults and stragglers efficiently – Divide job into deterministic tasks – Rerun failed/slow tasks in parallel on other nodes Same recovery techniques at lower time scales 8
9
Spark Streaming State between batches kept in memory as immutable, fault-tolerant dataset – Specifically as Spark’s Resilient Distributed Dataset Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency Potentially combine streaming and batch workloads to build a single unified stack 9
10
Discretized Stream Processing time = 0 - 1: time = 1 - 2: batch operations input immutable distributed dataset (replicated in memory) immutable distributed dataset (replicated in memory) immutable distributed dataset, stored in memory as RDD input stream state stream ……… state / output 10
11
Fault Recovery State stored as Resilient Distributed Dataset (RDD) – Deterministically re-computable parallel collection – Remembers lineage of operations used to create them Fault / straggler recovery is done in parallel on other nodes operation input dataset (replicated and fault-tolerant) state RDD (not replicated) Fast recovery from faults without full data replication 11
12
Programming Model A Discretized Stream or DStream is a series of RDDs representing a stream of data – API very similar to RDDs DStreams can be created… – Either from live streaming data – Or by transforming other DStreams 12
13
DStream Data Sources Many sources out of the box – HDFS – Kafka – Flume – Twitter – TCP sockets – Akka actor – ZeroMQ Easy to add your own 13 Contributed by external developers
14
Transformations Build new streams from existing streams – RDD-like operations map, flatMap, filter, count, reduce, groupByKey, reduceByKey, sortByKey, join etc. – New window and stateful operations window, countByWindow, reduceByWindow countByValueAndWindow, reduceByKeyAndWindow updateStateByKey etc.
15
Output Operations Send data to outside world – saveAsHadoopFiles – print – prints on the driver’s screen – foreach - arbitrary operation on every RDD
16
Example Process a stream of Tweets to find the 20 most popular hashtags in the last 10 mins 1.Get the stream of Tweets and isolate the hashtags 2.Count the hashtags over 10 minute window 3.Sort the hashtags by their counts 4.Get the top 20 hashtags 16
17
1. Get the stream of Hashtags val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) 17 transformation DStream = RDD t-1 tt+1 t+2 t+4 t+3 flatMap tweets hashTags
18
tagCounts 2. Count the hashtags over 10 min val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).map(tag => (tag, 1)).reduceByKey(_ + _) sliding window operation hashTags t-1 tt+1 t+2 t+4 t+3
19
2. Count the hashtags over 10 min val tweets = ssc.twitterStream(, ) val hashtags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 tt+1 t+2 t+4 t+3 + + – tagCounts
20
Smart window-based reduce Technique with count generalizes to reduce – Need a function to “subtract” – Applies to invertible reduce functions Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …) 20
21
3. Sort the hashtags by their counts val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1)) val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) }.transform(_.sortByKey(false)) allows arbitrary RDD operations to create a new DStream allows arbitrary RDD operations to create a new DStream
22
4. Get the top 20 hashtags val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1)) val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) }.transform(_.sortByKey(false)) sortedTags.foreach(showTopTags(20) _) output operation
23
10 popular hashtags in last 10 min // Create the stream of tweets val tweets = ssc.twitterStream(, ) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)).countByValueAndWindow (Minutes(10), Second(1)) // Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) }.transform(_.sortByKey(false)) // Show the top 10 tags sortedTags.foreach(showTopTags(10) _) 23
24
Demo 24
25
Other Operations Maintaining arbitrary state, tracking sessions tweets.updateStateByKey(tweet => updateMood(tweet)) Selecting data directly from a DStream tagCounts.slice(, ).sortByKey() 25 tweets t-1 tt+1 t+2 t+4 t+3 user mood
26
Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency 26
27
Comparison with others Higher throughput than Storm – Spark Streaming: 670k records/second/node – Storm: 115k records/second/node – Apache S4: 7.5k records/second/node 27
28
Fast Fault Recovery Recovers from faults/stragglers within 1 sec 28
29
Real Applications: Conviva Real-time monitoring of video metadata 29 Implemented Shadoop – a wrapper for Hadoop jobs to run over Spark / Spark Streaming Ported parts of Conviva’s Hadoop stack to run on Spark Streaming Shadoop Hadoop Job Spark Streaming val shJob = new SparkHadoopJob[…]( ) val shJob.run( )
30
Real Applications: Conviva Real-time monitoring of video metadata 30 Achieved 1-2 second latency Millions of video sessions processed scales linearly with cluster size
31
Real Applications: Mobile Millennium Project Traffic estimation using online machine learning 31 Markov chain Monte Carlo simulations on GPS observations Very CPU intensive, requires 10s of machines for useful computation Scales linearly with cluster size
32
Failure Semantics Input data replicated by the system Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails Transformations – exactly once Output operations – at least once 32
33
Java API for Streaming Developed by Patrick Wendell Similar to Spark Java API Don’t need to know scala to try streaming! 33
34
Contributors 5 contributors from UCB, 3 external contributors – Matei Zaharia, Haoyuan Li – Patrick Wendell – Denny Britz – Sean McNamara* – Prashant Sharma* – Nick Pentreath* – Tathagata Das 34
35
Vision - one stack to rule them all Ad-hoc Queries Batch Processing Stream Processing Spark + Spark Streaming
36
36
37
Conclusion Alpha to be release with Spark 0.7 by weekend Look at the new Streaming Programming Guide More about Spark Streaming system in our paper http://tinyurl.com/dstreams Join us in Strata on Feb 26 in Santa Clara 37
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.