Download presentation
Presentation is loading. Please wait.
1
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley
2
Distributed Data-Parallel Computing Cloud – Transparent scaling – Resource virtualization Commodity clusters – Fault tolerance with good performance Workloads beyond standard SQL, HPC – Data-mining, graph analysis, … – Semi-structured/unstructured data
3
Execution layer This talk: system-level middleware – Yuan Yu will describe DryadLINQ programming model on Saturday Algorithm -> execution plan by magic
4
Problem domain Large inputs – Tens of GB “small test dataset” – Single job up to hundreds of TB – Semi-structured data is common Not latency sensitive – Overhead of seconds for trivial job – Large job could take days – Batch computation not online queries Simplifies fault tolerance, caching, etc.
5
Talk overview Some typical computations DAG implementation choices The Dryad execution engine Comparison with MapReduce Discussion
6
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w
7
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w S f S’
8
Map Independent transformation of dataset – for each x in S, output x’ = f(x) E.g. simple grep for word w – output line x only if x contains w S1S1 f S1’S1’ S2S2 f S2’S2’ S3S3 f S3’S3’
9
Reduce Grouping plus aggregation – 1) Group x in S according to key selector k(x) – 2) For each group g, output r(g) E.g. simple word count – group by k(x) = x – for each group g output key (word) and count of g
10
Reduce Grouping plus aggregation – 1) Group x in S according to key selector k(x) – 2) For each group g, output r(g) E.g. simple word count – group by k(x) = x – for each group g output key (word) and count of g S G S’ r
11
Reduce S G S’ r
12
Reduce S1S1 S1’S1’ GD S2’S2’ G S2S2 D S3S3 D D is distribute, e.g. by hash or range r r S3’S3’ Gr
13
Reduce S1S1 G S1’S1’ G D S2’S2’ G S2S2 GD S3S3 GD ir is initial reduce, e.g. compute a partial sum r r ir
14
K-means Set of points P, initial set of cluster centres C Iterate until convergence: – For each c in C Initialize count c, centre c to 0 – For each p in P Find c in C that minimizes dist(p,c) Update: count c += 1, centre c += p – For each c in C Replace c <- centre c /count c
15
K-means C0C0 accc P accc accc C1C1 C2C2 C3C3
16
K-means C0C0 ac P1P1 P2P2 P3P3 cc C1C1 ac cc C2C2 ac cc C3C3
17
Graph algorithms Set N of nodes with data (n,x) Set E of directed edges (n,m) Iterate until convergence: – For each node (n,x) in N For each outgoing edge n->m in E, n m = f(x,n,m) – For each node (m,x) in N Find set of incoming updates i m = {n m : n->m in E} Replace (m,x) <- (m,r(i m )) E.g. power iteration (PageRank)
18
PageRank N0N0 aede E aede aede N1N1 N2N2 N3N3
19
PageRank N01N01 ae E1E1 E2E2 E3E3 cc N02N02 N03N03 D D D N11N11 N12N12 N13N13 ae cc D D D N21N21 N22N22 N23N23 ae cc D D D N31N31 N32N32 N33N33
20
DAG abstraction Absence of cycles – Allows re-execution for fault-tolerance – Simplifies scheduling: no deadlock Cycles can often be replaced by unrolling – Unsuitable for fine-grain inner loops Very popular – Databases, functional languages, …
21
Rewrite graph at runtime Loop unrolling with convergence tests Adapt partitioning scheme at run time – Choose #partitions based on runtime data volume – Broadcast Join vs. Hash Join, etc. Adaptive aggregation and distribution trees – Based on data skew and network topology Load balancing – Data/processing skew (cf work-stealing)
22
Push vs Pull Databases typically ‘pull’ using iterator model – Avoids buffering – Can prevent unnecessary computation But DAG must be fully materialized – Complicates rewriting – Prevents resource virtualization in shared cluster S1S1 S1’S1’ GD S2’S2’ G S2S2 D r r
23
Fault tolerance Buffer data in (some) edges Re-execute on failure using buffered data Speculatively re-execute for stragglers ‘Push’ model makes this very simple
24
Dryad General-purpose execution engine – Batch processing on immutable datasets – Well-tested on large clusters Automatically handles – Fault tolerance – Distribution of code and intermediate data – Scheduling of work to resources
25
Dryad System Architecture R Scheduler
26
Dryad System Architecture R R Scheduler
27
Dryad System Architecture R R Scheduler
28
Dryad Job Model Directed acyclic graph (DAG) Clean abstraction – Hides cluster services – Clients manipulate graphs Flexible and expressive – General-purpose programs – Complicated execution plans
29
Dryad Inputs and Outputs Partitioned data set – Records do not cross partition boundaries – Data on compute machines: NTFS, SQLServer, … Optional semantics – Hash-partition, range-partition, sorted, etc. Loading external data – Partitioning “automatic” – File system chooses sensible partition sizes – Or known partitioning from user
30
Channel abstraction S1S1 G S1’S1’ G D S2’S2’ G S2S2 GD S3S3 GD r r ir
31
Push vs Pull Channel types define connected component – Shared-memory or TCP must be gang-scheduled Pull within gang, push between gangs
32
MapReduce (Hadoop) MapReduce restricts – Topology of DAG – Semantics of function in compute vertex Sequence of instances for non-trivial tasks G S1’S1’ G D S2’S2’ G GD GD r r ir S1S1 S2S2 S3S3 f f f
33
MapReduce complexity Simple to describe MapReduce system Can be hard to map algorithm to framework – cf k-means: combine C+P, broadcast C, iterate, … – HIVE, PigLatin etc. mitigate programming issues Implementation not uniform – Different fault-tolerance for mappers, reducers – Add more special cases for performance Hadoop introducing TCP channels, pipelines, … – Dryad has same state machine everywhere
34
Discussion DAG abstraction supports many computations – Can be targeted by high-level languages! – Run-time rewriting extends applicability DAG-structured jobs scale to large clusters – Over 10k computers in large Dryad clusters – Transient failures common, disk failures daily Trade off fault-tolerance against performance – Buffer vs TCP, still manual choice in Dryad system – Also external vs in-memory working set
35
Conclusion Dryad well-tested, scalable – Daily use supporting Bing for over 3 years Applicable to large number of computations – 250 computer cluster at MSR SVC, Mar->Nov 09 47 distinct users (~50 lab members + interns) 15k jobs (tens of millions of processes executed) Hundreds of distinct programs – Network trace analysis, privacy-preserving inference, light- transport simulation, decision-tree training, deep belief network training, image feature extraction, …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.