Parallel Computing with Dryad

Parallel Computing with Dryad

Today’s Speakers MS Student Raman Grover Advisor: Prof. Michael Carey
Sanjay Kulhari Phd. Student Advisor: Prof. Vassilis Tsotras

Acknowledgments Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Distributed Data-Parallel Computing Using a High-Level Programming Language DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Google Tech Talks MSDN Channel 9

Parallel Distributed Computing. . . Why ?
Large-scale Internet Services Depend on clusters of hundreds or thousands of general purpose servers. Future advances in local computing power : Increasing the number of cores on a chip rather than improving the speed or instruction-level parallelism of a single core

Hard Problems High-latency Unreliable networks
Control of resources by separate federated or competing entities, Issues of identity for authentication and access control. The Programming Model Reliability, Efficiency and Scalability of the applications

Achieving Scalability
Systems that automatically discover and exploit parallelism in sequential programs Those that require the developer to explicitly expose the data dependencies of a computation. Condor Shader languages developed for graphic processing units Parallel databases Google’s MapReduce system

Reasons for Success Developer is explicitly forced to consider the data parallelism of the computation The developer need have no understanding of standard concurrency mechanisms such as threads and fine-grain concurrency control Developers now work at a suitable level of abstraction for writing scalable applications since the resources available at execution time are not generally known at the time the code is written.

Limitations Not a free lunch !
Restrict an application’s communication flow for different reasons : GPU shader languages Strongly tied to an efficient hardware implementation Map Reduce Designed for the widest possible class of developers, aims for simplicity at the expense of generality performance. Parallel databases designed for relational algebra manipulations (e.g. SQL) where the communication graph is implicit.

Dryad Control over the communication graph as well as the subroutines that live at its vertices. Specify an arbitrary directed acyclic graph to describe the application’s communication patterns, Express the data transport mechanisms (files, TCP pipes, and sharedmemory FIFOs) between the computation vertices. MapReduce restricts all computations to take a single input set and generate a single output set. SQL and shader languages allow multiple inputs but generate a single output from the user’s perspective, though SQL query plans internally use multiple-output vertices. Dryad is notable for allowing graph vertices (and computations in general) to use an arbitrary number of inputs and outputs.

In this talk ! Dryad : System Overview Describing a Dryad Graph
Communication Channel Dryad Job Job Execution Fault Tolerance Runtime Graph Refinement Experimental Evaluation Building on Dryad

Before we dive into Details…
Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50 Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.

Dryad = Execution Layer
Job (Application) Pipeline ≈ Dryad Shell Cluster Machine In the same way as the Unix shell does not understand the pipeline running on top, but manages its execution (i.e., killing processes when one exits), Dryad does not understand the job running on top.

Virtualized 2-D Pipelines
This is a possible schedule of a Dryad job using 2 machines.

The Unix pipeline is generalized 3-ways: 2D instead of 1D spans multiple machines resources are virtualized: you can run the same large job on many or few machines 2D DAG multi-machine virtualized

Dryad Job Structure grep1000 | sed500 | sort1000 | awk500 | perl50
Channels Input files Stage Output files sort grep awk sed perl grep sort This is the basic Dryad terminology. awk sed grep sort Vertices (processes)

Finite Streams of items
Channels Finite Streams of items Distributed filesystem (persistent) SMB/NTFS files (temporary) TCP pipes (inter-machine) Memory FIFOs (intra-machine) X Items Channels are very abstract, enabling a variety of transport mechanisms. The performance and fault-tolerance of these machanisms vary widely. M

Architecture V V V NS PD PD PD data plane Files, TCP, FIFO, Network
job schedule V V V The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job. The JM controls the processes running on a cluster, but never exchanges data with them. (The data plane is completely separated from the control plane.) NS PD PD PD control plane Job manager cluster

Runtime Services Name server Daemon Job Manager
Centralized coordinating process User application to construct graph Linked with Dryad libraries for scheduling vertices Vertex executable Dryad libraries to communicate with JM User application sees channels in/out Arbitrary application code, can use local FS Let us move back to discussing the graph execution. The Job Manager (JM) handles the scheduling of all the processes in the vertices of the graph. It does this using a topological sort of the graph. When nodes in the graph fail execution, parts of the subgraph may need to be re-executed, because the inputs that are needed may no longer be available. The vertices that had generated these inputs may have to be re-run. The JM will attempt to re-execute a minimal part of the graph to recompute the missing data. On executing a vertex, the JM must choose a machine

Job = Directed Acyclic Graph
Outputs Processing vertices Channels (file, pipe, shared memory) A dryad application is composed of a collection of processing vertices (processes). The vertices communicate with each other through channels. The vertices and channels should always compose into a directed acyclic graph. Inputs

Job execution V Scheduler keeps track of state and history of each vertex in the graph. When a job manager fails job is terminated but scheduler can implement checkpointing or replication to avoid this. Execution record attached with a vertex. Execution record paired with a available computer, remote daemon is instructed to run the vertex. 23

Job execution (cont.) If an execution of a vertex fails it can start again. More than one instance of the vertex may be executing at the same time. Each vertex names it output channels uniquely using version number. Vertex New execution record created and added to scheduling queue Execution record paired with an available computer Job manager receives periodic status updates from the vertex Input ready 24

Fault tolerance policy
All vertex programs are deterministic Every terminating execution of the job will give the same results regardless of the failures over the course of execution. Job manager will know in any case that something bad happened to a vertex. Vertices belong to stages and stage manager can take care of slow or failed vertices of a stage. 25

Fault tolerance policy (cont.)
If A fails, run it again If A’s inputs are gone, run upstream vertices again. If A is slow, run another copy elsewhere and use output from whichever finishes first. A 26

Run-time graph refinement
To be able to scale to large input sets while conserving scarce network bandwidth. For associative and commutative computations aggregation tree can be helpful. If internal vertices perform data reduction network traffic between racks will be reduced. Keep refining when upstream vertices have completed. Stage manager for each input layer. 27

Run-time graph refinement (cont.)
Partial aggregation operation, to process k sets in parallel. Data mining example follows this. Dynamic refinement is good because the amount of data to be written is not known in advance and also the required input channels. 28

B vertex gets 1000 tuples, runs DISTINCT B vertex receives 50,000 tuples and Execute DISTINCT on them B B * vertex gets 30,000 tuples, runs DISTINCT and returns 500 tuples + vertex gets 20,000 tuples, runs DISTINCT and returns 500 tuples * + * + + * * A A A A A Each A vertex sends 10,000 tuples 29

B * + B * + + * * A A A A A 30

Experimental evaluation
Hardware: Cluster of 10 computers (Sky server query experiment) Cluster of 1800 computers (Data mining experiment) Each computer had 2 dual core Opteron processors running at 2 GHz. i.e. 4 CPUs total. 8 GB of DRAM 400 GB Western Digital. 1 Gbit/sec Ethernet Windows server 2003 Enterprise X64 edition SP1. 31

Case study I (Sky server Query)
3-way join to find gravitational lens effect Table U: (objId, color) 11.8GB Table N: (objId, neighborId) 41.8GB Find neighboring stars with similar colors: Join U+N to find T = U.color,N.neighborId where U.objId = N.objId Join U+T to find U.objId where U.objId = T.neighborID and U.color ≈ T.color 32

SkyServer DB query Took SQL plan Manually coded in Dryad
H n X U N SkyServer DB query [distinct] [merge outputs] (u.color,n.neighborobjid) [re-partition by n.neighborobjid] [order by n.neighborobjid] Took SQL plan Manually coded in Dryad Manually partitioned data select u.color,n.neighborobjid from u join n where u.objid = n.objid select u.objid from u join <temp> where u.objid = <temp>.neighborobjid and |u.color - <temp>.color| < d u: objid, color n: objid, neighborobjid [partition by objid] 33

D M 4n S Y H n X U N Optimization Y U S S S S M M M M D X U N 34

D M 4n S Y H n X U N Optimization Y U S S S S M M M M D X U N 35

Speed-up Number of Computers Dryad In-Memory Dryad Two-pass
16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up 8.0 6.0 4.0 2.0 0.0 2 4 6 8 10 Number of Computers 36

Case study II - Query histogram computation
Input: log file (n partitions) Extract queries from log partitions Re-partition by hash of query (k buckets) Compute histogram within each bucket 37

Naïve histogram topology
Q R k n is : Each MS C P S D P parse lines D hash distribute S sort C count occurrences MS merge sort 38

Efficient histogram topology
P parse lines D hash distribute S sort C count occurrences MS merge sort M non-deterministic merge k Each Q' is : Each T k R R C Each is : R S D T is : P C C Q' M MS MS n 39

Final histogram refinement
Q' R 450 T 217 10,405 99,713 33.4 GB 118 GB 154 GB 10.2 TB 1,800 computers 43,171 vertices 11,072 processes 11.5 minutes 40

Optimizing Dryad applications
General-purpose refinement rules Processes formed from sub graphs Re-arrange computations, change I/O type Application code not modified System at liberty to make optimization choices High-level front ends hide this from user 41

All this sounds good ! But how do I interact with Dryad ?
Nebula scripting language Allows users to specify a computation as a series of stages each taking input from one or more previous stages or files system. Dryad as generalization of UNIX piping mechanism. Writing distributed applications using perl or grep. Also a front end that uses perl scripts and sql select, project and join. 42

Interacting with Dryad (Cont.)
Integration with SQL Server SQL Server Integration Services (SSIS) supports work-flow based application programming on single instance of SQL server. SSIS input graph generated and tested on a single computer. SSIS graph is run in distributed fashion using dryad. Each Dryad vertex is an instance of SQL server running an SSIS sub graph of the complete Job. Deployed in live production system. 43

LINQ Microsoft’s Language INtegrated Query
Available in Visual Studio products A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Integrated into .NET programming languages Programs can call operators Operators can invoke arbitrary .NET functions Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Highly extensible Add new custom operators Add new execution providers 44

LINQ System Architecture
Local machine Execution engines Scalability .Net program (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ Multi-core LINQ provider interface LINQ-to-SQL Objects LINQ-to-Obj Single-core

DryadLINQ Automatically distribute a LINQ program
More general than distributed SQL Inherits flexible C# type system and libraries Data-clustering, EM, inference, … Uniform data-parallel programming model From SMP to clusters Few Dryad-specific extensions Same source program runs on single-core through multi-core up to cluster 46

DryadLINQ System Architecture
Client machine DryadLINQ .NET program Data center Distributed query plan Query Expr Invoke Query Vertex code Input Tables ToTable JM Dryad Execution Output DryadTable .Net Objects Results Output Tables foreach (11)

Word Count in DryadLINQ
Count word frequency in a set of documents: var docs = DryadLinq.GetTable<Doc>(“file://docs.txt”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToDryadTable(“counts.txt”);

Execution Plan for Word Count
SM SelectMany Q sort SM GB groupby pipelined GB C count (1) S D distribute MS mergesort pipelined GB groupby Sum Sum

Execution Plan for Word Count
SM SM SM SM Q Q Q Q SM GB GB GB GB GB C C C C (1) (2) S D D D D MS MS MS MS GB GB GB GB Sum Sum Sum Sum

DryadLINQ: From LINQ to Dryad
Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); select where logs 51 51

How does it work? Sequential code “operates” on datasets
But really just builds an expression graph Lazy evaluation When a result is retrieved Entire graph is handed to DryadLINQ Optimizer builds efficient DAG Program is executed on cluster 52

Future Directions Goal: Use a cluster as if it is a single computer
Dryad/DryadLINQ represent a modest step On-going research What can we write with DryadLINQ? Where and how to generalize the programming model? Performance, usability, etc. How to debug/profile/analyze DryadLINQ apps? Job scheduling How to schedule/execute N concurrent jobs? Caching and incremental computation How to reuse previously computed results? Static program checking A very compelling case for program analysis? Better catch bugs statically than fighting them in the cloud?

Conclusions Goal: Use a compute cluster as if it is a single computer
Dryad/DryadLINQ represent a significant step Requires close collaborations across many fields of computing, including Distributed systems Distributed and parallel databases Programming language design and analysis

Parallel Computing with Dryad

Similar presentations

Presentation on theme: "Parallel Computing with Dryad"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing with Dryad

Similar presentations

Presentation on theme: "Parallel Computing with Dryad"— Presentation transcript:

Similar presentations

About project

Feedback