Dryad and dataflow systems Michael Isard misard@microsoft.com Microsoft Research 4th June, 2014
Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions
Computation on large datasets Performance mostly efficient resource use Locality Data placed correctly in memory hierarchy Scheduling Get enough work done before being interrupted Decompose into independent batches Parallel computation Control communication and synchronization Distributed computation Writes must be explicitly shared
Computational model Vertices are independent Dataflow very powerful State and scheduling Dataflow very powerful Explicit batching and communication Processing vertices Channels Inputs Outputs
Why dataflow now? Collection-oriented programming model Operations on collections of objects Turn spurious (unordered) for into foreach Not every for is foreach Aggregation (sum, count, max, etc.) Grouping Join, Zip Iteration LINQ since ca 2008, now Spark via Scala, Java
Generics and extension methods Given some lines of text, find the most commonly occurring words. Read the lines from a file Split each line into its constituent words Count how many times each word appears Find the words with the highest counts Well-chosen syntactic sugar red blue yellow int SortKey(void* x) { return (KeyValuePair<string,int>*)x->count; } int SortKey(KeyValuePair<string,int> x) { return x.count; } red,2 blue,4 yellow,3 Collection<KeyValuePair<string,int>> Type inference var lines = FS.ReadAsLines(inputFileName); var words = lines.SelectMany(x => x.Split(‘ ‘)); var counts = words.CountInGroups(); var highest = counts.OrderByDescending(x => x.count).Take(10); Lambda expressions FooCollection FooTake(FooCollection c, int count) { … } Collection<T> Take(this Collection<T> c, int count) { … } Generics and extension methods
Collections compile to dataflow Each operator specifies a single data-parallel step Communication between steps explicit Collections reference collections, not individual objects! Communication under control of the system Partition, pipeline, exchange automatically LINQ innovation: embedded user-defined functions var words = lines.SelectMany(x => x.Split(‘ ‘)); Very expressive Programmer ‘naturally’ writes pure functions
Distributed sorting set sample compute histogram var sorted = set.OrderBy(x => x.key) set sample compute histogram range partition by key sort locally sorted
Quiet revolution in parallelism Programming model is more attractive Simpler, more concise, readable, maintainable Program is easier to optimize Programmer separates computation and communication System can re-order, distribute, batch, etc. etc.
Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions
What is Dryad? General-purpose DAG execution engine ca 2005 Cited as inspiration for e.g. Hyracks, Tez Engine behind Microsoft Cosmos/SCOPE Initially MSN Search/Bing, now used throughout MSFT Core of research batch cluster environment ca 2009 DryadLINQ Quincy scheduler TidyFS
What Dryad does Abstracts cluster resources Set of computers, network topology, etc. Recovers from transient failures Rerun computations on machine or network fault Speculate duplicates for slow computations Schedules a local DAG of work at each vertex
Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices
Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices
Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices
Scheduling and fault tolerance DAG makes things easy Schedule from source to sink in any order Re-execute subgraph on failure Execute “duplicates” for slow vertices
Resources are virtualized Each graph vertex is a process Writes outputs to disk (usually) Reads inputs from upstream nodes’ output files Graph generally larger than cluster RAM 1TB partitioned input, 250MB part size, 4000 parts Cluster is shared Don’t size program for exact cluster Use whatever share of resources are available
Integrated system Collection-oriented programming model (LINQ) Partitioned file system (TidyFS) Manages replication and distribution of large data Cluster scheduler (Quincy) Jointly schedule multiple jobs at a time Fine-grain multiplexing between jobs Balance locality and fairness Monitoring and debugging (Artemis) Within job and across jobs
Dryad Cluster Scheduling Scheduler R
Dryad Cluster Scheduling Scheduler R R
Quincy without preemption
Quincy with preemption
Dryad features Well-tested at scales up to 15k cluster computers In heavy production use for 8 years Dataflow graph is mutable at runtime Repartition to avoid skew Specialize matrices dense/sparse Harden fault-tolerance
Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions
Stateless DAG dataflow MapReduce, Dryad, Spark, … Stateless vertex constraint hampers performance Iteration and streaming overheads Why does this design keep repeating?
Software engineering Fault tolerance well understood E.g., Chandy-Lamport, rollback recovery, etc. Basic mechanism: checkpoint plus log Stateless DAG: no checkpoint! Programming model “tricked” user All communication on typed channels Only channel data needs to be persisted Fault tolerance comes without programmer effort Even with UDFs
Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions
What about stateful dataflow? Naiad Add state to vertices Support streaming and iteration Opportunities Much lower latency Can model mutable state with dataflow Challenges Scheduling Coordination Fault tolerance
Batch processing Stream processing Graph processing Timely dataflow For example, we have: batch-processing frameworks like MapReduce, Dryad and Spark, which are based on stateless, and synchronous task execution; stream processing systems like Storm and SEEP that use an asynchronous streaming model; and graph processing systems like Pregel and GraphLab that use a vertex oriented execution model. If you want to write a program that combines these models, you end up having to use the lowest common denominator, which is batch processing. For some workloads this works just fine: in the last talk you heard about how a batch system – Spark – can handle many stream-processing workloads with sub-second latency. But what if you wanted to feed your stream into a graph algorithm? Or combine the result of a graph-based machine learning algorithm with the results of a batch job? We’ve seen over the years that batch systems are inadequate for those sorts of tasks, which is why so many specialized systems have bloomed. With timely dataflow, we have aimed to discover a better common substrate that supports all of these different workloads and their combinations, and use this to build a system that achieves the performance of a specialized system, while retaining the flexibility of a generic framework that lets you compose these styles together, Timely dataflow
Batching Streaming vs. (synchronous) (asynchronous) Now, we could use a batch execution model to run this dataflow, but to understand why this isn’t always sufficient, let’s look at how one of these dataflows executes, by considering a single vertex. A batch system, like Spark or Dryad, execution is synchronous: all of the inputs for a vertex must be available before the stage becomes active. It then runs for a while, and produces its outputs. By contrast, a streaming system, like Storm, is asynchronous: a vertex is always active, and when input comes in from any source, it can produce output immediately. The two approaches have different strengths: asynchronous streaming is typically more efficient in a distributed system, because independent vertices can run without coordination. Synchronous execution is sometimes necessary, for example if a vertex has to compute an aggregate value like a count or an average, but it requires the system to coordinate. Requires coordination Supports aggregation No coordination needed Aggregation is difficult
Batch DAG execution Central coordinator
Streaming DAG execution
Streaming DAG execution Inline coordination
Batch iteration Central coordinator Returning to our cyclic graph, we see that if all execution is in synchronous batch mode, a batch of data arrives at the input, then moves through the vertices in lock step, before reaching the output and sending the output. The nice thing is that there’s a one-to-one correspondence between input and output, but the cost is in the latency – there’s no overlapping of computation within the graph. Central coordinator
Streaming iteration In the streaming setting, all vertices remain active, so data streams in… it leads to data flowing around the loop, and data being produced. The latency can be much lower, but the correspondence between input and output is lost, which makes it hard to do things like keep a sliding window of counts by second, like we saw in the previous talk.
Messages Messages are delivered asynchronously B.SendBy(edge, message, time) B C D The asynchronous primitives are for message exchange, and indeed all message exchange in a timely dataflow graph is asynchronous. When a vertex B is running, it can invoke SendBy, specifying one of its outgoing edges, a message body, and a time. The time identifies records that arrived in the same batch – for now think of it as a sequence number – but since message delivery is asynchronous, the system can deliver the message without having to put it in order. The system delivers this message to vertex C, which receives a callback to its OnRecv method, with the same edge, message and time. Just these primitives let you build any vertex that works for asynchronous streaming: in particular, things like transformations and filters, but also things like joins of two streams. Sometimes you need synchronization, however… C.OnRecv(edge, message, time) Messages are delivered asynchronously
Notifications support batching C.SendBy(_, _, time) D.NotifyAt(time) B C D …and that is the purpose of notifications, which are delivered in a specific order so that we can simulate batching. When it is active, vertex D can request a notification by calling the NotifyAt primitive, specifying a time at which it should be notified. This doesn’t block the system, so D can continue to receive messages and so some processing in OnRecv. It finally receives the notification, through its OnNotify callback, once all upstream work at earlier times has completed. Once the notification is received, D knows that it has seen every message it will ever see at that time. In this respect, notifications are similar to punctuations in stream processing: they guarantee that a vertex will never again see any more messages before a certain time. OnNotify can therefore perform work that depends on knowing every value for a particular time. This is particularly useful for computing the final value of an aggregation, but it can also be useful for performing cleanup at the end of an operation. No more messages at time or earlier D.OnRecv(_, _, time) D.OnNotify(time) Notifications support batching
Coordination in timely dataflow Local scheduling with global progress tracking Coordination with a shared counter, not a scheduler Efficient, scalable implementation
Interactive graph analysis #x 32K tweets/s In @y ⋈ max ⋈ PageRank on a static graph is pretty dull, and doesn’t stretch Naiad’s functionality. For the finale, I’m going to revisit the application that I started with, where we’re going to process tweets incrementally in a graph algorithm, combine that with the extracted hashtags, and then run interactive queries against the result. To make things concrete, we’ll ingest 32 thousand tweets per second on a cluster of 32 machines, and taken in one query every hundred milliseconds. Now note that this doesn’t come anywhere close to saturating the query throughput, but it does let us see the latency at a fine granularity. 10 queries/s z? ⋈
Query latency Max: 140 ms 99th percentile: 70 ms Median: 5.2 ms 32 8-core 2.1 GHz AMD Opteron 16 GB RAM per server Gigabit Ethernet Query latency Max: 140 ms 99th percentile: 70 ms Median: 5.2 ms There are more results in the paper, but this one captures things the best. Every one hundred milliseconds, a username query is generated at random and submitted to the computation. This graph shows a time series of the latencies of those queries, where the latency is on a log scale. There are 200 queries in this 20-second fragment of the trace. All but one of them returns in under 100 milliseconds, the 99th percentile is 70 milliseconds, and the median latency is just over 5 milliseconds. The latency spikes are due to interactive queries queuing behind the “batch update” work in the connected components. We could probably improve this by encouraging the batch work to yield, but that remains future work.
Mutable state In batch DAG systems collections are immutable Functional definition in terms of preceding subgraph Adding streaming or iteration introduces mutability Collection varies as function of epoch, loop iteration
Key-value store as dataflow var lookup = data.join(query, d => d.key, q => q.key) Modeled random access with dataflow… Add/remove key is streaming update to data Look up key is streaming update to query High throughput requires batching But that was true anyway, in general
What can’t dataflow do? Programming model for mutable state? Not as intuitive as functional collection manipulation Policies for placement still primitive Hash everything and hope Great research opportunities Intersection of OS, network, runtime, language
Talk outline Why is dataflow so useful? What is Dryad? An engineering sweet spot Beyond Dryad Conclusions
Conclusions Apache 2.0 licensed source on GitHub Dataflow is a great structuring principle We know good programming models We know how to write high-performance systems Dataflow is the status quo for batch processing Mutable state is the current research frontier Apache 2.0 licensed source on GitHub http://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/