Download presentation
Presentation is loading. Please wait.
1
Streaming data processing using Spark
Spark streaming
2
Agenda What is Spark streaming? Why Spark streaming? How Spark streaming works A Twitter example Spark streaming alternatives Q & A Demo
3
Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrusion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds
4
What is Spark Streaming?
Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
5
What is Spark streaming?
Spark streaming is an extension of the core Spark API that enables scalable, high-throughput stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets
6
How Spark streaming works
Spark Streaming receives live input data streams and divides the data into batches. Each batch is processed by the Spark engine to generate the final stream of results.
7
Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs
8
output operation: to push data to external storage
Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage t t+1 t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save Every batch saved to HDFS, Write to database, update analytics UI, do whatever appropriate
9
Example ( Scala and Java)
val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream() JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
10
Combinations of Batch and Streaming Computations
Combine batch RDD with streaming DStream Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
11
Similar components Higher throughput than Storm
Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node
12
Thank You…
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.