Download presentation
Presentation is loading. Please wait.
1
Productionalizing Spark Streaming Applications
By: Robert Sanders
2
Quick Poll
3
Big Data Manager and Engineer
Robert Sanders Big Data Manager and Engineer Robert Sanders is an Engineering Manager at Clairvoyant. In his day job, Robert wears multiple hats and goes back and forth between Architecting and Engineering large scale Data platforms. Robert has deep background in enterprise systems, initially working on fullstack implementations and then focusing on building Data Management Platforms.
4
About Background Awards & Recognition
Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
5
Agenda What is Spark Streaming and Kafka? Steps to Production
Managing the Streaming Application (Starting and Stopping) Monitoring Prevent Data Loss Checkpointing Implementing Kafka Delivery Semantics Stability Summary
6
What is Spark Streaming?
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming -
7
Processing in Spark Streaming
Spark processes Micro Batches of data from the input on the Spark Engine Spark Streaming Processing -
8
What is Kafka? Apache Kafka® is a Distributed Streaming Platform
Kafka is a Circular Buffer Data gets written to disk As data gets filled up, old files are removed Kafka -
9
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2))
10
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams))
11
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print()
12
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start()
13
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination()
14
Starting the Spark Streaming Application
Build your JAR (or Python File) Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_ jar
15
Starting the Spark Streaming Application
Build your JAR (or Python File) Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_ jar What’s that?
16
Spark Masters Local --master local --master local[2] --master local[*]
Spark Standalone --master spark://{HOST}:{PORT}/ YARN --master yarn Mesos --master mesos://{HOST}:{PORT} Kubernetes --master k8s://{HOST}:{PORT}
17
Spark Masters Local --master local --master local[2] --master local[*]
Spark Standalone --master spark://{HOST}:{PORT}/ YARN --master yarn Mesos --master mesos://{HOST}:{PORT} Kubernetes --master k8s://{HOST}:{PORT}
18
Spark-YARN Integration
Spark Version <= 1.6.3 Yarn Client Mode: --master yarn-client YARN Cluster Mode: --master yarn-cluster Spark Version >= 2.0 YARN Client Mode: --master yarn --deploy-mode client --master yarn --deploy-mode cluster
19
Spark Architecture Spark Architecture -
20
YARN Client Mode YARN Client Mode -
21
YARN Cluster Mode YARN Cluster Mode -
22
Use YARN Cluster Mode
23
YARN Cluster Mode Configurations
spark-default.conf spark.yarn.maxAppAttempts=2 spark.yarn.am.attemptFailuresValidityInterval=1h
24
YARN Cluster Mode Configurations
spark-default.conf spark.yarn.maxAppAttempts=2 spark.yarn.am.attemptFailuresValidityInterval=1h Every 1 hour it will attempt to start the App 2 times.
25
YARN Cluster Mode Configurations
$ spark-submit --class "org.apache.testSimpleApp" --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=2 --conf spark.yarn.am.attemptFailuresValidityInterval=1h /path/to/jar/simple-project_ jar
26
YARN Cluster Mode Configurations
val sparkConf = new SparkConf() .setAppName("App") .set("spark.yarn.maxAppAttempts", "1") .set("spark.yarn.am.attemptFailuresValidityInterval", "2h") val ssc = new StreamingContext(sparkConf, Seconds(2))
27
Shutting Down the Spark Streaming Application
yarn application -kill {ApplicationID}
28
What if a Micro Batch is processing when we kill the application?!
29
Shut the Streaming Application down Gracefully
30
Graceful Shutdown On Spark Streaming Startup
Create a touch file in HDFS Within the Spark Code Periodically check if the touch file still exists If the touch file doesn’t exist, start the Graceful Shutdown process To Stop Delete the touch file and wait for the Graceful Shutdown process to complete Tip: Build a shell script to do these start and stop operations
31
The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination()
32
The Starting Point - Step one to Graceful Shutdown
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() Replace
33
Graceful Shutdown var TRIGGER_STOP = false
var ssc: StreamingContext = … // Define Stream Creation, Transformations and Actions. ssc.start() var isStopped = false while (!isStopped) { isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS) if (isStopped) LOGGER.info("The Spark Streaming context is stopped. Exiting application...") else LOGGER.info("Streaming App is still running. Timeout...") checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION) if (!isStopped && TRIGGER_STOP) { LOGGER.info("Stopping the ssc Spark Streaming Context...") ssc.stop(stopSparkContext = true, stopGracefully = true) LOGGER.info("Spark Streaming Context is Stopped!") }
34
Monitoring Operational Monitoring - Ganglia, Graphite
StreamingListener (Spark >=2.1) onBatchSubmitted onBatchStarted onBatchCompleted onReceiverStarted onReceiverStopped onReceiverError
35
Monitoring - Spark UI - Streaming Tab
36
Preventing Data Loss - Checkpointing
Metadata checkpointing Configuration DStream Operations Incomplete Batches Data checkpointing Saves the RDDs in each microbatch to a reliable storage Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
37
Preventing Data Loss - Checkpointing
Required if using stateful transformations (updateStateByKey or reduceByKeyAndWindow) Used to recover from Driver failures Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
38
Enable Metadata Checkpointing
val checkpointDirectory = "hdfs://..." // define checkpoint directory // Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc // Return the StreamingContext } // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext) ssc.start() // Start the context
39
Checkpointing Problems
Can’t survive Spark Version Upgrades Clear checkpoint between Code Upgrades
40
Use Checkpointing (but be careful)
41
Receiver Based Streaming
Spark Streaming Receiver Based Streaming -
42
Why have a WAL? Data in the Receiver is stored within the Executors Memory If we don’t have a WAL, on Executor failure, the data will be lost Once the data is written to the WAL, acknowledgement is passed to Kafka
43
Recovering data with the WAL
Enable Checkpointing Logs will be written to the Checkpoint Directory Enable WAL in Spark Configuration spark.streaming.receiver.wrteAheadLog.enable=true When using the WAL, the data is already persisted to HDFS. Disable in-memory replication. Use StorageLevel.MEMROY_AND_DISK_SER
44
Thought: Kafka already stores replicated copies of the data in a circular buffer. Why do I need a WAL?
45
Direct Stream Since Kafka already stores the data we won’t need a WAL
Spark Streaming Direct Stream -
46
Use the Direct Stream
47
Kafka Topics Creating a Topic:
kafka-topics --zookeeper <host>: create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas> Kafka Writes -
48
Consuming from a Kafka Topic
Kafka Reads -
49
When setting up your Kafka Topics, setup multiple Partitions
50
Direct Stream Gotchas Reminder: Checkpoints are not recoverable across code or cluster upgrades You need to track your own Kafka Offsets Use ZooKeeper, HDFS, HBase, Kudu, DB, etc For Exactly-Once Delivery Semantic Store offsets after an idempotent output OR Store offsets in an atomic transaction alongside output
51
Managing Kafka Offsets
52
Managing Offsets val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext) val kafkaDStream = storedOffsets match { case None => LOGGER.info("storedOffsets was None") kafkaParams += ("auto.offset.reset" -> "latest") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]] (topicsSet, kafkaParams) ) case Some(fromOffsets) => LOGGER.info("storedOffsets was Some(" + fromOffsets + ")") kafkaParams += ("auto.offset.reset" -> "none") (ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]] (fromOffsets.keys.toList, kafkaParams, fromOffsets) }
53
Average Batch Processing Time < Batch Interval
Stability Average Batch Processing Time < Batch Interval
54
Stability - Spark UI - Streaming Tab
55
Improving Stability Optimize reads, transformations and writes Caching
Increase Parallelism More partitions in Kafka More Executors Repartition the data after receiving the data dstream.repartition(100) Increase Batch Duration
56
Summary Use YARN Cluster Mode Gracefully Shutdown your application
Monitor your job Use Checkpointing (but be careful) Setup Multiple Partitions in your Kafka Topics Use Direct Streams Save your Offsets Stabilize your Streaming Application
57
Thank You! Questions? robert.sanders@clairvoyantsoft.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.