Streaming data processing using Spark

Streaming data processing using Spark
Spark streaming

Agenda What is Spark streaming? Why Spark streaming? How Spark streaming works A Twitter example Spark streaming alternatives Q & A Demo

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrusion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds

What is Spark Streaming?
Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

What is Spark streaming?
Spark streaming is an extension of the core Spark API that enables scalable, high-throughput stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets

How Spark streaming works
Spark Streaming receives live input data streams and divides the data into batches. Each batch is processed by the Spark engine to generate the final stream of results.

Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs

output operation: to push data to external storage
Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage t t+1 t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save Every batch saved to HDFS, Write to database, update analytics UI, do whatever appropriate

Example ( Scala and Java)
val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream() JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })

Combinations of Batch and Streaming Computations
Combine batch RDD with streaming DStream Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Similar components Higher throughput than Storm
Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node

Thank You…

Streaming data processing using Spark

Similar presentations

Presentation on theme: "Streaming data processing using Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Streaming data processing using Spark

Similar presentations

Presentation on theme: "Streaming data processing using Spark"— Presentation transcript:

Similar presentations

About project

Feedback