Dryad and dataflow systems

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Naiad: A Timely Dataflow System
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
Spark Presentation.
Distributed Computations MapReduce/Dryad
Introduction to MapReduce and Hadoop
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Pregelix: Think Like a Vertex, Scale Like Spandex
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
DryadInc: Reusing work in large-scale computations
5/7/2019 Map Reduce Map reduce.
TensorFlow: A System for Large-Scale Machine Learning
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

Dryad and dataflow systems Michael Isard misard@microsoft.com Microsoft Research 4th June, 2014

Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions

Computation on large datasets Performance mostly efficient resource use Locality Data placed correctly in memory hierarchy Scheduling Get enough work done before being interrupted Decompose into independent batches Parallel computation Control communication and synchronization Distributed computation Writes must be explicitly shared

Computational model Vertices are independent Dataflow very powerful State and scheduling Dataflow very powerful Explicit batching and communication Processing vertices Channels Inputs Outputs

Why dataflow now? Collection-oriented programming model Operations on collections of objects Turn spurious (unordered) for into foreach Not every for is foreach Aggregation (sum, count, max, etc.) Grouping Join, Zip Iteration LINQ since ca 2008, now Spark via Scala, Java

Generics and extension methods Given some lines of text, find the most commonly occurring words. Read the lines from a file Split each line into its constituent words Count how many times each word appears Find the words with the highest counts Well-chosen syntactic sugar red blue yellow int SortKey(void* x) { return (KeyValuePair<string,int>*)x->count; } int SortKey(KeyValuePair<string,int> x) { return x.count; } red,2 blue,4 yellow,3 Collection<KeyValuePair<string,int>> Type inference var lines = FS.ReadAsLines(inputFileName); var words = lines.SelectMany(x => x.Split(‘ ‘)); var counts = words.CountInGroups(); var highest = counts.OrderByDescending(x => x.count).Take(10); Lambda expressions FooCollection FooTake(FooCollection c, int count) { … } Collection<T> Take(this Collection<T> c, int count) { … } Generics and extension methods

Collections compile to dataflow Each operator specifies a single data-parallel step Communication between steps explicit Collections reference collections, not individual objects! Communication under control of the system Partition, pipeline, exchange automatically LINQ innovation: embedded user-defined functions var words = lines.SelectMany(x => x.Split(‘ ‘)); Very expressive Programmer ‘naturally’ writes pure functions

Distributed sorting set sample compute histogram var sorted = set.OrderBy(x => x.key) set sample compute histogram range partition by key sort locally sorted

Quiet revolution in parallelism Programming model is more attractive Simpler, more concise, readable, maintainable Program is easier to optimize Programmer separates computation and communication System can re-order, distribute, batch, etc. etc.

Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions

What is Dryad? General-purpose DAG execution engine ca 2005 Cited as inspiration for e.g. Hyracks, Tez Engine behind Microsoft Cosmos/SCOPE Initially MSN Search/Bing, now used throughout MSFT Core of research batch cluster environment ca 2009 DryadLINQ Quincy scheduler TidyFS

What Dryad does Abstracts cluster resources Set of computers, network topology, etc. Recovers from transient failures Rerun computations on machine or network fault Speculate duplicates for slow computations Schedules a local DAG of work at each vertex

Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices

Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices

Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices

Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices

Resources are virtualized Each graph vertex is a process Writes outputs to disk (usually) Reads inputs from upstream nodes’ output files Graph generally larger than cluster RAM 1TB partitioned input, 250MB part size, 4000 parts Cluster is shared Don’t size program for exact cluster Use whatever share of resources are available

Integrated system Collection-oriented programming model (LINQ) Partitioned file system (TidyFS) Manages replication and distribution of large data Cluster scheduler (Quincy) Jointly schedule multiple jobs at a time Fine-grain multiplexing between jobs Balance locality and fairness Monitoring and debugging (Artemis) Within job and across jobs

Dryad Cluster Scheduling Scheduler R

Dryad Cluster Scheduling Scheduler R R

Quincy without preemption

Quincy with preemption

Dryad features Well-tested at scales up to 15k cluster computers In heavy production use for 8 years Dataflow graph is mutable at runtime Repartition to avoid skew Specialize matrices dense/sparse Harden fault-tolerance

Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions

Stateless DAG dataflow MapReduce, Dryad, Spark, … Stateless vertex constraint hampers performance Iteration and streaming overheads Why does this design keep repeating?

Software engineering Fault tolerance well understood E.g., Chandy-Lamport, rollback recovery, etc. Basic mechanism: checkpoint plus log Stateless DAG: no checkpoint! Programming model “tricked” user All communication on typed channels Only channel data needs to be persisted Fault tolerance comes without programmer effort Even with UDFs

Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions

What about stateful dataflow? Naiad Add state to vertices Support streaming and iteration Opportunities Much lower latency Can model mutable state with dataflow Challenges Scheduling Coordination Fault tolerance

Batch processing Stream processing Graph processing Timely dataflow For example, we have: batch-processing frameworks like MapReduce, Dryad and Spark, which are based on stateless, and synchronous task execution; stream processing systems like Storm and SEEP that use an asynchronous streaming model; and graph processing systems like Pregel and GraphLab that use a vertex oriented execution model. If you want to write a program that combines these models, you end up having to use the lowest common denominator, which is batch processing. For some workloads this works just fine: in the last talk you heard about how a batch system – Spark – can handle many stream-processing workloads with sub-second latency. But what if you wanted to feed your stream into a graph algorithm? Or combine the result of a graph-based machine learning algorithm with the results of a batch job? We’ve seen over the years that batch systems are inadequate for those sorts of tasks, which is why so many specialized systems have bloomed. With timely dataflow, we have aimed to discover a better common substrate that supports all of these different workloads and their combinations, and use this to build a system that achieves the performance of a specialized system, while retaining the flexibility of a generic framework that lets you compose these styles together, Timely dataflow

Batching Streaming vs. (synchronous) (asynchronous) Now, we could use a batch execution model to run this dataflow, but to understand why this isn’t always sufficient, let’s look at how one of these dataflows executes, by considering a single vertex. A batch system, like Spark or Dryad, execution is synchronous: all of the inputs for a vertex must be available before the stage becomes active. It then runs for a while, and produces its outputs. By contrast, a streaming system, like Storm, is asynchronous: a vertex is always active, and when input comes in from any source, it can produce output immediately. The two approaches have different strengths: asynchronous streaming is typically more efficient in a distributed system, because independent vertices can run without coordination. Synchronous execution is sometimes necessary, for example if a vertex has to compute an aggregate value like a count or an average, but it requires the system to coordinate. Requires coordination Supports aggregation No coordination needed Aggregation is difficult

Batch DAG execution Central coordinator

Streaming DAG execution      

Streaming DAG execution      Inline coordination

Batch iteration Central coordinator   Returning to our cyclic graph, we see that if all execution is in synchronous batch mode, a batch of data arrives at the input, then moves through the vertices in lock step, before reaching the output and sending the output. The nice thing is that there’s a one-to-one correspondence between input and output, but the cost is in the latency – there’s no overlapping of computation within the graph. Central coordinator

Streaming iteration    In the streaming setting, all vertices remain active, so data streams in… it leads to data flowing around the loop, and data being produced. The latency can be much lower, but the correspondence between input and output is lost, which makes it hard to do things like keep a sliding window of counts by second, like we saw in the previous talk.

Messages  Messages are delivered asynchronously B.SendBy(edge, message, time) B C D  The asynchronous primitives are for message exchange, and indeed all message exchange in a timely dataflow graph is asynchronous. When a vertex B is running, it can invoke SendBy, specifying one of its outgoing edges, a message body, and a time. The time identifies records that arrived in the same batch – for now think of it as a sequence number – but since message delivery is asynchronous, the system can deliver the message without having to put it in order. The system delivers this message to vertex C, which receives a callback to its OnRecv method, with the same edge, message and time. Just these primitives let you build any vertex that works for asynchronous streaming: in particular, things like transformations and filters, but also things like joins of two streams. Sometimes you need synchronization, however… C.OnRecv(edge, message, time) Messages are delivered asynchronously

Notifications support batching C.SendBy(_, _, time) D.NotifyAt(time) B C D  …and that is the purpose of notifications, which are delivered in a specific order so that we can simulate batching. When it is active, vertex D can request a notification by calling the NotifyAt primitive, specifying a time at which it should be notified. This doesn’t block the system, so D can continue to receive messages and so some processing in OnRecv. It finally receives the notification, through its OnNotify callback, once all upstream work at earlier times has completed. Once the notification is received, D knows that it has seen every message it will ever see at that time. In this respect, notifications are similar to punctuations in stream processing: they guarantee that a vertex will never again see any more messages before a certain time. OnNotify can therefore perform work that depends on knowing every value for a particular time. This is particularly useful for computing the final value of an aggregation, but it can also be useful for performing cleanup at the end of an operation. No more messages at time or earlier D.OnRecv(_, _, time) D.OnNotify(time) Notifications support batching

Coordination in timely dataflow Local scheduling with global progress tracking Coordination with a shared counter, not a scheduler Efficient, scalable implementation

Interactive graph analysis #x 32K tweets/s In @y ⋈ max ⋈ PageRank on a static graph is pretty dull, and doesn’t stretch Naiad’s functionality. For the finale, I’m going to revisit the application that I started with, where we’re going to process tweets incrementally in a graph algorithm, combine that with the extracted hashtags, and then run interactive queries against the result. To make things concrete, we’ll ingest 32 thousand tweets per second on a cluster of 32 machines, and taken in one query every hundred milliseconds. Now note that this doesn’t come anywhere close to saturating the query throughput, but it does let us see the latency at a fine granularity. 10 queries/s z? ⋈

Query latency Max: 140 ms 99th percentile: 70 ms Median: 5.2 ms 32  8-core 2.1 GHz AMD Opteron 16 GB RAM per server Gigabit Ethernet Query latency Max: 140 ms 99th percentile: 70 ms Median: 5.2 ms There are more results in the paper, but this one captures things the best. Every one hundred milliseconds, a username query is generated at random and submitted to the computation. This graph shows a time series of the latencies of those queries, where the latency is on a log scale. There are 200 queries in this 20-second fragment of the trace. All but one of them returns in under 100 milliseconds, the 99th percentile is 70 milliseconds, and the median latency is just over 5 milliseconds. The latency spikes are due to interactive queries queuing behind the “batch update” work in the connected components. We could probably improve this by encouraging the batch work to yield, but that remains future work.

Mutable state In batch DAG systems collections are immutable Functional definition in terms of preceding subgraph Adding streaming or iteration introduces mutability Collection varies as function of epoch, loop iteration

Key-value store as dataflow var lookup = data.join(query, d => d.key, q => q.key) Modeled random access with dataflow… Add/remove key is streaming update to data Look up key is streaming update to query High throughput requires batching But that was true anyway, in general

What can’t dataflow do? Programming model for mutable state? Not as intuitive as functional collection manipulation Policies for placement still primitive Hash everything and hope Great research opportunities Intersection of OS, network, runtime, language

Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions

Conclusions Apache 2.0 licensed source on GitHub Dataflow is a great structuring principle We know good programming models We know how to write high-performance systems Dataflow is the status quo for batch processing Mutable state is the current research frontier Apache 2.0 licensed source on GitHub http://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/