Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley
Distributed Data-Parallel Computing Cloud – Transparent scaling – Resource virtualization Commodity clusters – Fault tolerance with good performance Workloads beyond standard SQL, HPC – Data-mining, graph analysis, … – Semi-structured/unstructured data
Execution layer This talk: system-level middleware – Yuan Yu will describe DryadLINQ programming model on Saturday Algorithm -> execution plan by magic
Problem domain Large inputs – Tens of GB “small test dataset” – Single job up to hundreds of TB – Semi-structured data is common Not latency sensitive – Overhead of seconds for trivial job – Large job could take days – Batch computation not online queries Simplifies fault tolerance, caching, etc.
Talk overview Some typical computations DAG implementation choices The Dryad execution engine Comparison with MapReduce Discussion
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w S f S’
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w S1S1 f S1’S1’ S2S2 f S2’S2’ S3S3 f S3’S3’
Reduce Grouping plus aggregation – 1) Group x in S according to key selector k(x) – 2) For each group g, output r(g) E.g. simple word count – group by k(x) = x – for each group g output key (word) and count of g
Reduce Grouping plus aggregation – 1) Group x in S according to key selector k(x) – 2) For each group g, output r(g) E.g. simple word count – group by k(x) = x – for each group g output key (word) and count of g S G S’ r
Reduce S G S’ r
Reduce S1S1 S1’S1’ GD S2’S2’ G S2S2 D S3S3 D D is distribute, e.g. by hash or range r r S3’S3’ Gr
Reduce S1S1 G S1’S1’ G D S2’S2’ G S2S2 GD S3S3 GD ir is initial reduce, e.g. compute a partial sum r r ir
K-means Set of points P, initial set of cluster centres C Iterate until convergence: – For each c in C Initialize count c, centre c to 0 – For each p in P Find c in C that minimizes dist(p,c) Update: count c += 1, centre c += p – For each c in C Replace c <- centre c /count c
K-means C0C0 accc P accc accc C1C1 C2C2 C3C3
K-means C0C0 ac P1P1 P2P2 P3P3 cc C1C1 ac cc C2C2 ac cc C3C3
Graph algorithms Set N of nodes with data (n,x) Set E of directed edges (n,m) Iterate until convergence: – For each node (n,x) in N For each outgoing edge n->m in E, n m = f(x,n,m) – For each node (m,x) in N Find set of incoming updates i m = {n m : n->m in E} Replace (m,x) <- (m,r(i m )) E.g. power iteration (PageRank)
PageRank N0N0 aede E aede aede N1N1 N2N2 N3N3
PageRank N01N01 ae E1E1 E2E2 E3E3 cc N02N02 N03N03 D D D N11N11 N12N12 N13N13 ae cc D D D N21N21 N22N22 N23N23 ae cc D D D N31N31 N32N32 N33N33
DAG abstraction Absence of cycles – Allows re-execution for fault-tolerance – Simplifies scheduling: no deadlock Cycles can often be replaced by unrolling – Unsuitable for fine-grain inner loops Very popular – Databases, functional languages, …
Rewrite graph at runtime Loop unrolling with convergence tests Adapt partitioning scheme at run time – Choose #partitions based on runtime data volume – Broadcast Join vs. Hash Join, etc. Adaptive aggregation and distribution trees – Based on data skew and network topology Load balancing – Data/processing skew (cf work-stealing)
Push vs Pull Databases typically ‘pull’ using iterator model – Avoids buffering – Can prevent unnecessary computation But DAG must be fully materialized – Complicates rewriting – Prevents resource virtualization in shared cluster S1S1 S1’S1’ GD S2’S2’ G S2S2 D r r
Fault tolerance Buffer data in (some) edges Re-execute on failure using buffered data Speculatively re-execute for stragglers ‘Push’ model makes this very simple
Dryad General-purpose execution engine – Batch processing on immutable datasets – Well-tested on large clusters Automatically handles – Fault tolerance – Distribution of code and intermediate data – Scheduling of work to resources
Dryad System Architecture R Scheduler
Dryad System Architecture R R Scheduler
Dryad System Architecture R R Scheduler
Dryad Job Model Directed acyclic graph (DAG) Clean abstraction – Hides cluster services – Clients manipulate graphs Flexible and expressive – General-purpose programs – Complicated execution plans
Dryad Inputs and Outputs Partitioned data set – Records do not cross partition boundaries – Data on compute machines: NTFS, SQLServer, … Optional semantics – Hash-partition, range-partition, sorted, etc. Loading external data – Partitioning “automatic” – File system chooses sensible partition sizes – Or known partitioning from user
Channel abstraction S1S1 G S1’S1’ G D S2’S2’ G S2S2 GD S3S3 GD r r ir
Push vs Pull Channel types define connected component – Shared-memory or TCP must be gang-scheduled Pull within gang, push between gangs
MapReduce (Hadoop) MapReduce restricts – Topology of DAG – Semantics of function in compute vertex Sequence of instances for non-trivial tasks G S1’S1’ G D S2’S2’ G GD GD r r ir S1S1 S2S2 S3S3 f f f
MapReduce complexity Simple to describe MapReduce system Can be hard to map algorithm to framework – cf k-means: combine C+P, broadcast C, iterate, … – HIVE, PigLatin etc. mitigate programming issues Implementation not uniform – Different fault-tolerance for mappers, reducers – Add more special cases for performance Hadoop introducing TCP channels, pipelines, … – Dryad has same state machine everywhere
Discussion DAG abstraction supports many computations – Can be targeted by high-level languages! – Run-time rewriting extends applicability DAG-structured jobs scale to large clusters – Over 10k computers in large Dryad clusters – Transient failures common, disk failures daily Trade off fault-tolerance against performance – Buffer vs TCP, still manual choice in Dryad system – Also external vs in-memory working set
Conclusion Dryad well-tested, scalable – Daily use supporting Bing for over 3 years Applicable to large number of computations – 250 computer cluster at MSR SVC, Mar->Nov distinct users (~50 lab members + interns) 15k jobs (tens of millions of processes executed) Hundreds of distinct programs – Network trace analysis, privacy-preserving inference, light- transport simulation, decision-tree training, deep belief network training, image feature extraction, …