Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.

Slides:

Advertisements

Similar presentations

Spark Streaming Large-scale near-real-time stream processing

Advertisements

Spark Streaming Large-scale near-real-time stream processing

Spark Streaming Real-time big-data processing

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.

Spark Streaming Large-scale near-real-time stream processing

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Distributed Graph Processing Abhishek Verma CS425.

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Frameworks (and Stream Processing) Aditya Akella.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Distributed Computations

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Adaptive Stream Processing using Dynamic Batch Sizing Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Outline | Motivation| Design | Results| Status| Future

Distributed Low-Latency Scheduling

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

PMIT-6102 Advanced Database Systems

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Spark Streaming Large-scale near-real-time stream processing

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Hadoop Aakash Kag What Why How 1.

Some slides borrowed from the authors

Scaling Apache Flink® to very large State

PA an Coordinated Memory Caching for Parallel Jobs

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

CSCI5570 Large Scale Data Processing Systems

Introduction to Spark.

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

Replication-based Fault-tolerance for Large-scale Graph Processing

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*

Cloud Computing MapReduce in Heterogeneous Environments

Fast, Interactive, Language-Integrated Cluster Computing

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

Presentation transcript:

Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion Stoica

Motivation Many big-data applications need to process large data streams in near-real time Website monitoring Fraud detection Ad monetization Require tens to hundreds of nodes Require second-scale latencies

Challenge Stream processing systems must recover from failures and stragglers quickly and efficiently – More important for streaming systems than batch systems Traditional streaming systems don’t achieve these properties simultaneously

Outline Limitations of Traditional Streaming Systems Discretized Stream Processing Unification with Batch and Interactive Processing

Traditional Streaming Systems Continuous operator model mutable state node 1 node 3 input records node 2 input records – Each node runs an operator with in-memory mutable state – For each input record, state is updated and new records are sent out Mutable state is lost if node fails Various techniques exist to make state fault-tolerant

Fault-tolerance in Traditional Systems Separate set of “hot failover” nodes process the same data streams Synchronization protocols ensures exact ordering of records in both sets On failure, the system switches over to the failover nodes sync protocol input hot failover nodes Fast recovery, but 2x hardware cost Node Replication [e.g. Borealis, Flux ]

input Fault-tolerance in Traditional Systems Upstream Backup [e.g. TimeStream, Storm ] cold failover node Each node maintains backup of the forwarded records since last checkpoint A “cold failover” node is maintained On failure, upstream nodes replay the backup records serially to the failover node to recreate the state Only need 1 standby, but slow recovery backup replay

Slow Nodes in Traditional Systems Node ReplicationUpstream Backup input Neither approach handles stragglers

Our Goal Scales to hundreds of nodes Achieves second-scale latency Tolerate node failures and stragglers Sub-second fault and straggler recovery Minimal overhead beyond base processing

Our Goal Scales to hundreds of nodes Achieves second-scale latency Tolerate node failures and stragglers Sub-second fault and straggler recovery Minimal overhead beyond base processing

Why is it hard? Stateful continuous operators tightly integrate “computation” with “mutable state” Makes it harder to define clear boundaries when computation and state can be moved around stateful continuous operator stateful continuous operator mutable state input records input records output records output records

Dissociate computation from state Make state immutable and break computation into small, deterministic, stateless tasks Defines clear boundaries where state and computation can be moved around independently stateless task state 1 input 1 state 2 stateless task state 2 input 2 stateless task input 3

Batch Processing Systems!

Batch Processing Systems Batch processing systems like MapReduce divide – Data into small partitions – Jobs into small, deterministic, stateless map / reduce tasks M M M M M M M M M M M M M M M M M M M M M M M M stateless map tasks stateless reduce tasks R R R R R R R R R R R R immutable input dataset immutable output dataset immutable map outputs

Parallel Recovery Failed tasks are re-executed on the other nodes in parallel M M M M M M R R R R M M M M M M M M M M M M stateless map tasks stateless reduce tasks R R R R R R R R M M M M M M M M M M M M R R R R immutable input dataset immutable output dataset

Discretized Stream Processing

Run a streaming computation as a series of small, deterministic batch jobs Store intermediate state data in cluster memory Try to make batch sizes as small as possible to get second-scale latencies

Discretized Stream Processing time = 0 - 1: batch operations Input: replicated dataset stored in memory Output or State: non-replicated dataset stored in memory input stream output / state stream …… input time = 1 - 2: input

Example: Counting page views Discretized Stream (DStream) is a sequence of immutable, partitioned datasets – Can be created from live data streams or by applying bulk, parallel transformations on other DStreams views = readStream(" "1 sec") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce((x,y) => x+y) creating a DStream transformation viewsonescounts t: t: map reduce

Fine-grained Lineage Datasets track fine-grained operation lineage Datasets are periodically checkpointed asynchronously to prevent long lineages viewsonescounts t: t: map reduce t: 2 - 3

Lineage is used to recompute partitions lost due to failures Datasets on different time steps recomputed in parallel Partitions within a dataset also recomputed in parallel viewsonescounts t: t: map reduce t: Parallel Fault Recovery

Upstream Backup stream replayed serially state viewsonescounts t: t: t: Discretized Stream Processing parallelism across time intervals parallelism within a batch Faster recovery than upstream backup, without the 2x cost of node replication Comparison to Upstream Backup

How much faster than Upstream Backup? Recover time = time taken to recompute and catch up – Depends on available resources in the cluster – Lower system load before failure allows faster recovery Parallel recovery with 5 nodes faster than upstream backup Parallel recovery with 10 nodes faster than 5 nodes

Parallel Straggler Recovery Straggler mitigation techniques – Detect slow tasks (e.g. 2X slower than other tasks) – Speculatively launch more copies of the tasks in parallel on other machines Masks the impact of slow nodes on the progress of the system

Evaluation

Spark Streaming Implemented using Spark processing engine* – Spark allows datasets to be stored in memory, and automatically recovers them using lineage Modifications required to reduce jobs launching overheads from seconds to milliseconds [ *Resilient Distributed Datasets - NSDI, 2012 ]

How fast is Spark Streaming? Can process 60M records/second on 100 nodes at 1 second latency Tested with core EC2 instances and 100 streams of text 28 Count the sentences having a keyword WordCount over 30 sec sliding window

How does it compare to others? Throughput comparable to other commercial stream processing systems 29 [ Refer to the paper for citations ]

How fast can it recover from faults? Recovery time improves with more frequent checkpointing and more nodes Failure Word Count over 30 sec window

How fast can it recover from stragglers? Speculative execution of slow tasks mask the effect of stragglers

Unification with Batch and Interactive Processing

Discretized Streams creates a single programming and execution model for running streaming, batch and interactive jobs Combine live data streams with historic data liveCounts.join(historicCounts).map(...) Interactively query live streams liveCounts.slice(“21:00”, “21:05”).count()

App combining live + historic data Mobile Millennium Project: Real-time estimation of traffic transit times using live and past GPS observations 34 Markov chain Monte Carlo simulations on GPS observations Very CPU intensive Scales linearly with cluster size

Recent Related Work Naiad – Full cluster rollback on recovery SEEP – Extends continuous operators to enable parallel recovery, but does not handle stragglers TimeStream – Recovery similar to upstream backup MillWheel – State stored in BigTable, transactions per state update can be expensive

Takeaways Discretized Streams model streaming computation as series of batch jobs – Uses simple techniques to exploit parallelism in streams – Scales to 100 nodes with 1 second latency – Recovers from failures and stragglers very fast Spark Streaming is open source - spark-project.orgspark-project.org – Used in production by ~ 10 organizations! Large scale streaming systems must handle faults and stragglers