Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.

Slides:



Advertisements
Similar presentations
Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Introduction to Data Center Computing Derek Murray October 2010.
The DryadLINQ Approach to Distributed Data-Parallel Computing
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
epiC: an Extensible and Scalable System for Processing Big Data
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
C# and LINQ Yuan Yu Microsoft Research Silicon Valley.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Big Data Platforms Mihai Budiu, Oct My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Distributed Computations
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ.
Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Oct. 3, 2011 Hadoop, HDFS and Microsoft Cloud Computing Technologies.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Dryad and dataflow systems
Dryad and DryadLINQ Theophilus Benson CS Distributed Data-Parallel Programming using Dryad By Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael.
Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx.
Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6.
Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce M/R slides adapted from those of Jeff Dean’s.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Majd F. Sakr, Suhail Rehman and
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
Distributed Computations MapReduce/Dryad
Parallel Computing with Dryad
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
DryadInc: Reusing work in large-scale computations
5/7/2019 Map Reduce Map reduce.
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard

Dryad Similar goals as MapReduce –focus on throughput, not latency –Automatic management of scheduling, distribution, fault tolerance Computations expressed as a graph –Vertices are computations –Edges are communication channels –Each vertex has several input and output edges

WordCount in Dryad Count Word:n MergeSort Word:n Count Word:n Distribute Word:n

Why using a dataflow graph? Many programs can be represented as a distributed dataflow graph –The programmer may not have to know this “SQL-like” queries: LINQ Dryad will run them for you

Runtime Job Manager (JM) consults name server(NS) to discover available machines. JM maintains job graph and schedules vertices Daemon process (D) executes vertices Vertices (V) run arbitrary app code Vertices exchange data through files, TCP pipes etc. Vertices communicate with JM to report status

Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

Scheduling at JM General scheduling rules: –Vertex can run anywhere once all its inputs are ready Prefer executing a vertex near its inputs –Fault tolerance If A fails, run it again If A’s inputs are gone, run upstream vertices again (recursively) If A is slow, run another copy elsewhere and use output from whichever finishes first

Advantages of DAG over MapReduce Big jobs more efficient with Dryad –MapReduce: big job runs >=1 MR stages reducers of each stage write to replicated storage Output of reduce: 2 network copies, 3 disks –Dryad: each job is represented with a DAG intermediate vertices write to local file

Advantages of DAG over MapReduce Dryad provides explicit join –MapReduce: mapper (or reducer) needs to read from shared table(s) as a substitute for join –Dryad: explicit join combines inputs of different types Dryad “Split” produces outputs of different types –Parse a document, output text and references

DAG optimizations: merge tree

Dryad Optimizations: data- dependent re-partitioning Distribute to equal-sized ranges Sample to estimate histogram Randomly partitioned inputs

Dryad example 1: SkyServer Query 3-way join to find gravitational lens effect Table U: (objId, color) 11.8GB Table N: (objId, neighborId) 41.8GB Find neighboring stars with similar colors: –Join U+N to find T = N.neighborID where U.objID = N.objID, U.color –Join U+T to find U.objID where U.objID = T.neighborID and U.color ≈ T.color

SkyServer query u: objid, color n: objid, neighborobjid [partition by objid] select u.color,n.neighborobjid from u join n where u.objid = n.objid

(u.color,n.neighborobjid) [re-partition by n.neighborobjid] [order by n.neighborobjid] [distinct] [merge outputs] select u.objid from u join where u.objid =.neighborobjid and |u.color -.color| < d

Dryad example 2: Query histogram computation Input: log file (n partitions) Extract queries from log partitions Re-partition by hash of query (k buckets) Compute histogram within each bucket

Naïve histogram topology Pparse lines D hash distribute S quicksort C count occurrences MSmerge sort

Efficient histogram topology Pparse lines D hash distribute S quicksort C count occurrences MSmerge sort M non-deterministic merge Q' is:Each R is: Each MS C M P C S Q' RR k T k n T is: Each MS D C

RR T Q’ MS►C►D M►P►S►C MS►C Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge R

MS►C►D M►P►S►C MS►C Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge RR T R Q’

MS►C►D M►P►S►C MS►C Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge RR T R Q’ T

MS►C►D M►P►S►C MS►C Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge RR T R Q’ T

Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge MS►C►D M►P►S►C MS►C RR T R Q’ T

Pparse linesDhash distribute SquicksortMSmerge sort Ccount occurrencesMnon-deterministic merge MS►C►D M►P►S►C MS►C RR T R Q’ T

Final histogram refinement 1,800 computers 43,171 vertices 11,072 processes 11.5 minutes

DryadLINQ

LINQ: Relational queries integrated in C# More general than distributed SQL –Inherits flexible C# type system and libraries –Data-clustering, EM, inference, … Uniform data-parallel programming model –From SMP to clusters

LINQ Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data

Dryad DryadLINQ System Architecture DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr Cluster Output Tables Results Input Tables Invoke Query plan Output Table Dryad Execution.Net Objects ToTable foreach Vertex code

DryadLINQ example: PageRank PageRank scores web pages using the hyperlink graph To compute the pagerank of (i+1)-th iteration: A page u’s score is contributed by all neighboring pages v that link to it The contribution of v is its pagerank normalized by the number of outgoing links

DryadLINQ example: PageRank DryadLINQ express each iteration as a SQL query 1.Join pages with ranks 2.Distribute ranks on outgoing edges 3.GroupBy edge destination 4.Aggregate into ranks 5.Repeat

One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

The Complete PageRank Program var pages = PartitionedTable.Get (“dfs://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < n; iter++) { ranks = PRStep(pages, ranks); } ranks.ToPartitionedTable (“dfs://outputranks.txt”); public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public struct Rank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

Multi-Iteration PageRank pagesranks Iteration 1 Iteration 2 Iteration 3 Memory FIFO

Lessons of Dryad/DryadLINQ Acyclic dataflow graph is a powerful computation model Language integration increases programmer productivity Decoupling of Dryad and DryadLINQ –Dryad: execution engine (given DAG, do scheduling and fault tolerance) –DryadLINQ: programming model (given query, generate DAG)