Productionalizing Spark Streaming Applications

Productionalizing Spark Streaming Applications
By: Robert Sanders

Quick Poll

Big Data Manager and Engineer
Robert Sanders Big Data Manager and Engineer Robert Sanders is an Engineering Manager at Clairvoyant. In his day job, Robert wears multiple hats and goes back and forth between Architecting and Engineering large scale Data platforms. Robert has deep background in enterprise systems, initially working on fullstack implementations and then focusing on building Data Management Platforms.

About Background Awards & Recognition
Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop

Agenda What is Spark Streaming and Kafka? Steps to Production
Managing the Streaming Application (Starting and Stopping) Monitoring Prevent Data Loss Checkpointing Implementing Kafka Delivery Semantics Stability Summary

What is Spark Streaming?
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming -

Processing in Spark Streaming
Spark processes Micro Batches of data from the input on the Spark Engine Spark Streaming Processing -

What is Kafka? Apache Kafka® is a Distributed Streaming Platform
Kafka is a Circular Buffer Data gets written to disk As data gets filled up, old files are removed Kafka -

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2))

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams))

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print()

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start()

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination()

Starting the Spark Streaming Application
Build your JAR (or Python File) Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_ jar

Starting the Spark Streaming Application
Build your JAR (or Python File) Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_ jar What’s that?

Spark Masters Local --master local --master local[2] --master local[*]
Spark Standalone --master spark://{HOST}:{PORT}/ YARN --master yarn Mesos --master mesos://{HOST}:{PORT} Kubernetes --master k8s://{HOST}:{PORT}

Spark-YARN Integration
Spark Version <= 1.6.3 Yarn Client Mode: --master yarn-client YARN Cluster Mode: --master yarn-cluster Spark Version >= 2.0 YARN Client Mode: --master yarn --deploy-mode client --master yarn --deploy-mode cluster

Spark Architecture Spark Architecture -

YARN Client Mode YARN Client Mode -

YARN Cluster Mode YARN Cluster Mode -

Use YARN Cluster Mode

YARN Cluster Mode Configurations
spark-default.conf spark.yarn.maxAppAttempts=2 spark.yarn.am.attemptFailuresValidityInterval=1h

spark-default.conf spark.yarn.maxAppAttempts=2 spark.yarn.am.attemptFailuresValidityInterval=1h Every 1 hour it will attempt to start the App 2 times.

$ spark-submit --class "org.apache.testSimpleApp" --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=2 --conf spark.yarn.am.attemptFailuresValidityInterval=1h /path/to/jar/simple-project_ jar

val sparkConf = new SparkConf() .setAppName("App") .set("spark.yarn.maxAppAttempts", "1") .set("spark.yarn.am.attemptFailuresValidityInterval", "2h") val ssc = new StreamingContext(sparkConf, Seconds(2))

Shutting Down the Spark Streaming Application
yarn application -kill {ApplicationID}

What if a Micro Batch is processing when we kill the application?!

Shut the Streaming Application down Gracefully

Graceful Shutdown On Spark Streaming Startup
Create a touch file in HDFS Within the Spark Code Periodically check if the touch file still exists If the touch file doesn’t exist, start the Graceful Shutdown process To Stop Delete the touch file and wait for the Graceful Shutdown process to complete Tip: Build a shell script to do these start and stop operations

The Starting Point val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination()

The Starting Point - Step one to Graceful Shutdown
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() Replace

Graceful Shutdown var TRIGGER_STOP = false
var ssc: StreamingContext = … // Define Stream Creation, Transformations and Actions. ssc.start() var isStopped = false while (!isStopped) { isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS) if (isStopped) LOGGER.info("The Spark Streaming context is stopped. Exiting application...") else LOGGER.info("Streaming App is still running. Timeout...") checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION) if (!isStopped && TRIGGER_STOP) { LOGGER.info("Stopping the ssc Spark Streaming Context...") ssc.stop(stopSparkContext = true, stopGracefully = true) LOGGER.info("Spark Streaming Context is Stopped!") }

Monitoring Operational Monitoring - Ganglia, Graphite
StreamingListener (Spark >=2.1) onBatchSubmitted onBatchStarted onBatchCompleted onReceiverStarted onReceiverStopped onReceiverError

Monitoring - Spark UI - Streaming Tab

Preventing Data Loss - Checkpointing
Metadata checkpointing Configuration DStream Operations Incomplete Batches Data checkpointing Saves the RDDs in each microbatch to a reliable storage Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

Preventing Data Loss - Checkpointing
Required if using stateful transformations (updateStateByKey or reduceByKeyAndWindow) Used to recover from Driver failures Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

Enable Metadata Checkpointing
val checkpointDirectory = "hdfs://..." // define checkpoint directory // Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc // Return the StreamingContext } // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext) ssc.start() // Start the context

Checkpointing Problems
Can’t survive Spark Version Upgrades Clear checkpoint between Code Upgrades

Use Checkpointing (but be careful)

Receiver Based Streaming
Spark Streaming Receiver Based Streaming -

Why have a WAL? Data in the Receiver is stored within the Executors Memory If we don’t have a WAL, on Executor failure, the data will be lost Once the data is written to the WAL, acknowledgement is passed to Kafka

Recovering data with the WAL
Enable Checkpointing Logs will be written to the Checkpoint Directory Enable WAL in Spark Configuration spark.streaming.receiver.wrteAheadLog.enable=true When using the WAL, the data is already persisted to HDFS. Disable in-memory replication. Use StorageLevel.MEMROY_AND_DISK_SER

Thought: Kafka already stores replicated copies of the data in a circular buffer. Why do I need a WAL?

Direct Stream Since Kafka already stores the data we won’t need a WAL
Spark Streaming Direct Stream -

Use the Direct Stream

Kafka Topics Creating a Topic:
kafka-topics --zookeeper <host>: create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas> Kafka Writes -

Consuming from a Kafka Topic
Kafka Reads -

When setting up your Kafka Topics, setup multiple Partitions

Direct Stream Gotchas Reminder: Checkpoints are not recoverable across code or cluster upgrades You need to track your own Kafka Offsets Use ZooKeeper, HDFS, HBase, Kudu, DB, etc For Exactly-Once Delivery Semantic Store offsets after an idempotent output OR Store offsets in an atomic transaction alongside output

Managing Kafka Offsets

Managing Offsets val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext) val kafkaDStream = storedOffsets match { case None => LOGGER.info("storedOffsets was None") kafkaParams += ("auto.offset.reset" -> "latest") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]] (topicsSet, kafkaParams) ) case Some(fromOffsets) => LOGGER.info("storedOffsets was Some(" + fromOffsets + ")") kafkaParams += ("auto.offset.reset" -> "none") (ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]] (fromOffsets.keys.toList, kafkaParams, fromOffsets) }

Average Batch Processing Time < Batch Interval
Stability Average Batch Processing Time < Batch Interval

Stability - Spark UI - Streaming Tab

Improving Stability Optimize reads, transformations and writes Caching
Increase Parallelism More partitions in Kafka More Executors Repartition the data after receiving the data dstream.repartition(100) Increase Batch Duration

Summary Use YARN Cluster Mode Gracefully Shutdown your application
Monitor your job Use Checkpointing (but be careful) Setup Multiple Partitions in your Kafka Topics Use Direct Streams Save your Offsets Stabilize your Streaming Application

Thank You! Questions? robert.sanders@clairvoyantsoft.com

Productionalizing Spark Streaming Applications

Similar presentations

Presentation on theme: "Productionalizing Spark Streaming Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Productionalizing Spark Streaming Applications

Similar presentations

Presentation on theme: "Productionalizing Spark Streaming Applications"— Presentation transcript:

Similar presentations

About project

Feedback