CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
Distributed Stream Processing Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Aleksandar Bradic

Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion Stoica SOSP 2013

Require tens to hundreds of nodes Require second-scale latencies
Motivation Many big-data applications need to process large data streams in near-real time Require tens to hundreds of nodes Require second-scale latencies Website monitoring Fraud detection Ad monetization There are many big data applications that need to process large streams of data and process results in a near real time manner. For example, a website monitoring systems may want watch over websites for load spikes. A fraud detection system may want to monitor bank transaction in real time to detect fraud. An ad agency may want to count clicks in real time. The common property across a lot of such systesm is that they require large clusters of tens to hundreds of nodes to handle the streaming load and they require an end to end latency of few seconds.

Challenge Stream processing systems must recover from failures and stragglers quickly and efficiently More important for streaming systems than batch systems Traditional streaming systems don’t achieve these properties simultaneously D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes And not just failures but they must deal with stragglers as well. And its important for streaming systems to recover from failures and straggler quickly and efficiently. In fact it is more important for streaming systems than batch systems, because you don’t want your fraud detection system to be down for a long time. The problem is that traditions streaming systems don’t achieve both thse properties together.

Outline Limitations of Traditional Streaming Systems
Discretized Stream Processing Unification with Batch and Interactive Processing In this talk, I am going to first discuss the limitations of traditional streaming systems. Then I am going to elaborate about discretized stream processing. Finally, I am also going to talk about how this unifies stream processing with batch and interactive processing.

Traditional Streaming Systems
Continuous operator model mutable state node 1 node 3 input records node 2 Each node runs an operator with in-memory mutable state For each input record, state is updated and new records are sent out Traditional stream processing use the continuous operator model, where every node in the processing pipeline continuously run an operator with in-memory mutable state. As each input records is received, the mutable state is updated and new records are sent out to downstream nodes. The problem with this model is that the mutable state is lost if the node fails. To deal with this ,various techniques have been developed to make this state fault-tolerant. I am going to divide them into two broad classes and explain their limitations. Mutable state is lost if node fails Various techniques exist to make state fault-tolerant

Fault-tolerance in Traditional Systems
Node Replication [e.g. Borealis, Flux ] Separate set of “hot failover” nodes process the same data streams Synchronization protocols ensures exact ordering of records in both sets On failure, the system switches over to the failover nodes input hot failover nodes input sync protocol First, systems like Borealis and Flux have used Node Replication, where a separate set of nodes process the same stream along with the main nodes. These nodes act as hot failover nodes. Synchronization protocols ensure that both sets of nodes process the records in exactly the same order. When a node fails, the system immediately switches over to the failover nodes, and processing continues with little or no disruption. This results in fast recovery but it comes at the cost of double hardware. At large scale that may not be desirable. You may not want to run 200 nodes to handle the processing of 100 nodes. Fast recovery, but 2x hardware cost

Fault-tolerance in Traditional Systems
Upstream Backup [e.g. TimeStream, Storm ] Each node maintains backup of the forwarded records since last checkpoint A “cold failover” node is maintained On failure, upstream nodes replay the backup records serially to the failover node to recreate the state input replay input backup cold failover node The second method is called upstream backup, where each maintains a back up of all the records it has forwarded to the down stream node, until the downstream node acknwoledge that it has done processing them, and checkpointed the state, etc. A cold failover node is maintained and when a node fails, the system replays the backed up records serially to the failover node to recreate the lost state. While this only requires one standby node, it is slow to recover because of recomputation. So you can see that either you can get fast recover at high cost, or slow recovery for cheap. Only need 1 standby, but slow recovery

Slow Nodes in Traditional Systems
Node Replication Upstream Backup input input input Furthermore, let us consider what happens when there are stragglers. Suppose this node happily decides to slow down. Then this causes the downstream nodes to slowdown as well. And in case of replicatoin, since the replicas need to be in sycn, they effectively slow down as well. So neither approach handles stragglers. Neither approach handles stragglers

Design Goals Scales to hundreds of nodes Achieves second-scale latency
Tolerate node failures and stragglers Sub-second fault and straggler recovery Minimal overhead beyond base processing (e.g. avoid 2 replication overhead) So we want to build a streaming system with the following ambitious goals. Besides scaling hundreds of nodes with second-scale latencies, the system should tolerate faults and stragglers. In particular, it should recover from faults and straggler within seconds, which is hard for upstream backup to do. And it should use minimal extra hardware beyond normal processing, unlike node replication.

Why is it hard? Stateful continuous operators tightly integrate “computation” with “mutable state” Makes it harder to define clear boundaries when computation and state can be moved around stateful continuous operator mutable state input records output So we accepted the challenge and decided to understand why it is hard. That is because the continuous operator model tightly integrates computation with mutable state. This makes it harder to define clear boundaries in the computation and state such that they can be move around.

Dissociate computation from state
Make state immutable and break computation into small, deterministic, stateless tasks Defines clear boundaries where state and computation can be moved around independently stateless task state 1 input 1 state 2 stateless task state 3 input 2 Instead what we propose is to dissociate the computation from state. that is, make the state immutable, and break the computation into smaller deterministic and stateless tasks. So to do stateful operation, the system would take the previous state as input to the stateless task and generate the next state as output. And this continues with more tasks. This defines clear boundaries where state and computation can be moved around independently. Now this sound very abstract but the interesting this is that existing systems already do this. stateless task input 3

Batch Processing Systems!

Batch Processing Systems
Batch processing systems like MapReduce divide Data into small partitions Jobs into small, deterministic, stateless map / reduce tasks M immutable map outputs R immutable input dataset immutable output dataset Systems like MapReduce to divide the data into smaller partitions, and the job into small deterministic and stateless map reduce tasks. Lets walk through a map reduce job. You start from an immutable dataset already divided into partitions. For each partition, the system runs a bunch of stateless map tasks which generate immutable map outputs. Then they are shuffled together and stateless reduce tasks generate the output immutable dataset stateless map tasks stateless reduce tasks

Parallel Recovery Failed tasks are re-executed on the other nodes in parallel M R R R M M M M R And because these tasks are stateless they can be moved around easily. For example, if a node fails, and some maps tasks are lost and some reduce tasks are lost as well, and some partitions in the output were not generated. The system can automatically re run the map tasks in parallel on the other nodes. And then also the reduce tasks, finally generating the necessary partitions. Lets see how we use this parallel recovery property in our streaming processing model that we call Discretized Stream Processing. R M M M R M immutable input dataset M M immutable output dataset stateless map tasks stateless reduce tasks

Discretized Stream Processing

Run a streaming computation as a series of small, deterministic batch jobs Store intermediate state data in cluster memory Try to make batch sizes as small as possible to get second-scale latencies

time = 0 - 1: batch operations input Input: replicated dataset stored reliably across the cluster Output or State: new immutable dataset stored in memory w/o replication time = 1 - 2: input intermediate state as RDD, to be processed along w/ the next batch of input data input stream output / state stream …

Example: Counting page views
Discretized Stream (DStream) is a sequence of immutable, partitioned datasets Can be created from live data streams or by applying bulk, parallel transformations on other DStreams views ones counts t: 0 - 1 t: 1 - 2 map reduce creating a DStream views = readStream(" "1 sec") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce((x,y) => x+y) Lets see how to program against this model. The programming abstraction that we provide is called Dstream, short for Discretized streams, which is a sequence of immutable partitioned datasets. These Dstreams can either be created from the live data stream, as shown in the program. readStream creates a dstream from live data stream of page views. Or Dstream can be created by applying bulk parallel transformations on other Dstreams, as shown in the program by the map and running reduce. This program maintains a running count of the page views. The underlying datasets look like this. as you can see the counts, the blue datasets are the state datasets that addup the counts across batches. transformation

Fine-grained Lineage Datasets track fine-grained operation lineage
Datasets are periodically checkpointed asynchronously to prevent long lineages views ones counts t: 0 - 1 map reduce t: 1 - 2 The system keeps track of the fine-grained partition level lineage. Now we don’t want the lineage to grow too long as that would cause a lot of recopmutation. so we periodically checkpoint some state rdds to stable storage and cut off the lineage. Note that these datasets are immutable, so checkpointing can be done asynchronously without blocking the computation. t: 2 - 3

Parallel Fault Recovery
Lineage is used to recompute partitions lost due to failures Datasets on different time steps recomputed in parallel Partitions within a dataset also recomputed in parallel views ones counts t: 0 - 1 map reduce t: 1 - 2 This lineage is used to recover lost data partitions in parallel. For example, suppose some nodes die a few partitions here and there gets lost. Now since these datasets are in different time intervals and don’t depend on each other, hence they will be recomputed in parallel by running the map-reduce tasks that generated them. Similarly, partitions within the same datasets will also be recomputed in parallel. Now if you compare this to upstream backup… t: 2 - 3

Comparison to Upstream Backup
Faster recovery than upstream backup, without the 2x cost of node replication views ones counts t: 0 - 1 t: 1 - 2 t: 2 - 3 Discretized Stream Processing Upstream Backup parallelism within a batch parallelism across time intervals Upstream backup replays the whole stream serially to one node to recreate the state that was lost. however, in our model, the lineage exposes all the stuff that can recomputed in parallel. Specifically it exposes parallelism within a dataset as well as across time, that is datasets in different time intervals. This ensures faster recovery than upstream backup without incurring the double cost of node replication. state stream replayed serially

How much faster than Upstream Backup?
System workload: assume maximum amount of workload a machine can do is 1 per minute (e.g. CPU usage is 100%) System actual workload: the actual workload is λ(0≤ λ ≤1) per minute(e.g. actual CPU usage is λ) There are total N nodes in the cluster Assume we do checkpoint once per minute and a failure recovers from the last checkpoint If one node fails, we have to take the lost λ workload of this node and recover with maximum ability(e.g. with 100% CPU usage)

For upstream recovery, we need 𝑡 𝑢𝑝 mins to recover The backup node has to takes all the λ lost workload, meanwhile ingest λ new workload per minutes during the recovery 𝑡 𝑢𝑝 ∗1=𝜆+ 𝑡 𝑢𝑝 ∗ 𝜆 𝑡 𝑢𝑝 = 𝜆 1 −𝜆 For parallel recovery, we need 𝑡 𝑝𝑎𝑟 mins to recover All remaining 𝑁−1 node shares all the λ lost workload. Each node takes meanwhile ingest λ new workload per minutes during the recovery 𝑡 𝑝𝑎𝑟 ∗1= 𝜆 𝑁 −1 + 𝑡 𝑝𝑎𝑟 ∗ 𝜆 𝑁 −1 𝑡 𝑢𝑝 = 𝜆 𝑁∗ 1 −𝜆 −1 ≈ 𝜆 𝑁∗ 1 −𝜆 With more nodes, parallel recovery catches up with the arriving stream much faster than the upstream backup

Recover time = time taken to recompute and catch up Depends on available resources in the cluster Lower system load before failure allows faster recovery Parallel recovery with 10 nodes faster than 5 nodes Parallel recovery with 5 nodes faster than upstream backup How much faster can we be from upstream backup. For this we did a theoretical anayslis. We define the recovery time as the time taken to recompute the lost data and catch up with the live stream. This naturally depends on the available resources in the cluster, which in turn depends on the system load prior to the failure. The graph shows recovery time with respect to different system loads, and naturally lower system load leads to faster recovery. Note that parallel recovery with 5 nodes is significantly fastter than upstream backup. And also, parallel recovery with 10 nodes is faster than 5 nodes. This is because with 10 nodes, there is more parallelism to be used for fault recovery. This is another very cool property of this model, that larger clusters allow you recover faster.

Parallel Straggler Recovery
Straggler mitigation techniques Detect slow tasks (e.g. 2X slower than other tasks) Speculatively launch more copies of the tasks in parallel on other machines Masks the impact of slow nodes on the progress of the system The same principles can be used to recover from stragglers as well. Straggler mitigation techniques in batch processing typically do the following. First they detect slow tasks or straggling task by some criteria – for example, say tasks running more than 2 times slower than other tasks in the same stage. These tasks are then specula6ively rexecuted by running more copies of them in parallel on the other machines. Whichever finishes first wins. Since these tasks are stateless, and the output immutable, running multiple copies of them and letting them race to the finish is semantically equivalent. This essentially masks the impact of slow nodes on the progress of the system.

Master recovery Stores D-Stream metadata in HDFS
The DAG graph (the user’s D-streams and Scala function objects representing user code) The time of the last checkpoint The IDs of RDDs since the checkpoint Upon recovery, the new master reads metadata from HDFS, reconnects to workers, fetches the information of RDD partitions(e.g. which node a partition situates) and resume the processing of each missed timestep

Spark Streaming Implemented using Spark processing engine*
Spark allows datasets to be stored in memory, and automatically recovers them using lineage Modifications required to reduce jobs launching overheads from seconds to milliseconds To evaluate this model, we built Spark Streaming. It is built using the Spark processing engine. For those who are not familiar with Spark, it is a super fast batch processing engine that allow datasets to be stored in memory and automatically recovers them using lineage. We had to make signification modification to Spark to reduce job launching overhead from seconds to milliseconds in order to achieve sub-second end-to-end latencies. [ *Resilient Distributed Datasets - NSDI, 2012 ]

System Architecture What is added or modified over Spark

System Architecture Master Workers Client library:
tracks the D-Stream lineage graph schedules tasks to compute new RDD partitions Workers receive data store the partitions of input and computed RDDs execute tasks Client library: send data into the system

Optimizations for stream processing
Network communication Use asynchronous I/O to let tasks with remote inputs (e.g. the reduce tasks) to fetch the input RDDs faster Timestep pipelining Tasks inside each timestep may not perfectly utilize the cluster (e.g., at the end of the timestep, there might only be a few tasks left running) Spark streaming allows to start tasks from the next timestep before the current one has finished Task scheduling Optimized Spark framework(e.g. reducing the size of control message) and allows to launch parallel jobs of hundreds of tasks every few hundred milliseconds

Optimizations for stream processing
Storage layer Rewrote Spark’s storage layer to support asynchronous checkpointing Checkpoint without blocking the computations and slowing the jobs Lineage cutoff Lineage graph between RDDs in D-Streams can grow indefinitely Modified the scheduler to forget lineage after an RDD has been checkpointed

Consistency Exactly-once processing across the cluster
Consistency guaranteed by the determinism of computations and the separate naming of RDDs from different intervals actually because D-Streams is essentially (small-)batch-processing (e.g., same count will not be calculated by 2 machines), not real-time querying system

Evaluation

How fast is Spark Streaming?
Can process 60M records/second on 100 nodes at 1 second latency Tested with core EC2 instances and 100 streams of text Spark Streaming can process 60 million records per second on 100 nodes with an end-to-end latency of 1 second. This was tested using core EC2 instances and 100 streams of text. The workload was Grep in which we count the number of sentences having a keyword. We also tried a stateful workload WordCount over a 30 second sliding window, and the throughput was slightly lower because of increased computation but still in the ballpark range of 10s of millions of records per second. Note that in both cases, the throughput of the cluster scale linearly with the size of the cluster. Furthermore if you allow a 2 second latency, then the throughput goes up a little bit more. Count the sentences having a keyword WordCount over 30 sec sliding window

How does it compare to others?
Throughput comparable to other commercial stream processing systems System Throughput per core [ records / sec ] Spark Streaming 160k Oracle CEP 125k Esper 100k StreamBase 30k Storm If you compare with other commercial streaming systems, Spark Streaming has comparable per-core performance to numbers reported by Oracle CEP, Esper and StreamBase. Note that these 3 are single node systems, not distributed. [ Refer to the paper for citations ]

How fast can it recover from faults?
Recovery time improves with more frequent checkpointing and more nodes Failure We test how we fast we can recover from faults. This graph shows the time take process batches before and after failure. Obviously right after failure the Word Count over 30 sec window

How fast can it recover from stragglers?
Speculative execution of slow tasks mask the effect of stragglers We also tested how we perform with stragglers by slowing down a node in the system. The graph shows that without the straggler the processing time for both workloads were about half a second. With straggler it increased to 2 to 3 seconds. But speculative execution it came back to levels comparable to the earlier half second. This shows that speculative execution can mask the effect of straggler. No other system to our knowledge has shown this kind of evaluation.

Unification with Batch and Interactive Processing

Unification with Batch and Interactive Processing
Discretized Streams creates a single programming and execution model for running streaming, batch and interactive jobs Combine live data streams with historic data liveCounts.join(historicCounts).map(...) Interactively query live streams liveCounts.slice(“21:00”, “21:05”).count()

App combining live + historic data
Mobile Millennium Project: Real-time estimation of traffic transit times using live and past GPS observations Markov chain Monte Carlo simulations on GPS observations Very CPU intensive Scales linearly with cluster size

Recent Related Work Naiad – Full cluster rollback on recovery
SEEP – Extends continuous operators to enable parallel recovery, but does not handle stragglers TimeStream – Recovery similar to upstream backup MillWheel – State stored in BigTable, transactions per state update can be expensive Naiad, the paper you are going to hear about next, takes a completely different approach to unification different processing models, without sacrificing latency. Which is great. But in terms of fault tolerance the state is synchronuous checkpointed compared to our asynchronous checkpointing. And in case of failure, all the nodes in the cluster has to rollback to their previous checkpoints, which can be very costly. SEEP attempts to extend the continuous operators to enable parallel recovery of their state, but it requires invasive rewriting of the operators. In our case we were simply able

Takeaways Large scale streaming systems must handle faults and stragglers Discretized Streams model streaming computation as series of batch jobs Uses simple techniques to exploit parallelism in streams Scales to 100 nodes with 1 second latency Recovers from failures and stragglers very fast Spark Streaming is open source - spark-project.org Used in production by ~ 10 organizations!

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback