DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Slides:

Advertisements

Similar presentations

Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.

Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.

Introduction to Data Center Computing Derek Murray October 2010.

The DryadLINQ Approach to Distributed Data-Parallel Computing

Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/

Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,

Cluster Computing with DryadLINQ

epiC: an Extensible and Scalable System for Processing Big Data

C# and LINQ Yuan Yu Microsoft Research Silicon Valley.

Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang Presented by: Hien Nguyen.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Big Data Platforms Mihai Budiu, Oct My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Nectar: Efficient Management of Computation and Data in Data Centers Lenin Ravindranath Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Cloud computing: Infrastructure, Services, and Applications UC Berkeley,

DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley.

Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.

Distributed Computations

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.

Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.

Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May

Tools and Services for Data Intensive Research Roger Barga Nelson Araujo, Tim Chou, and Christophe Poulain Advanced Research Tools and Services Group,

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Oct. 3, 2011 Hadoop, HDFS and Microsoft Cloud Computing Technologies.

Dryad and DryadLINQ Theophilus Benson CS Distributed Data-Parallel Programming using Dryad By Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael.

Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Intel Research Berkeley, Systems Seminar Series October 9, 2008.

Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx.

Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Austin code camp 2010 asp.net apps with azure table storage PRESENTED BY CHANDER SHEKHAR DHALL

Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.

Cloud Computing Other High-level parallel processing languages Keke Chen.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6.

Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Artemis Logs Database View Data Collectio n GUI Dryad Overview Data collection Distributed system Plug-ins GUI Plug-ins Hunting for Bugs with Artemis System.

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

4 5 6 var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where

Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.

Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.

Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA

Some slides adapted from those of Yuan Yu and Michael Isard

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

CSCI5570 Large Scale Data Processing Systems

Spark Presentation.

Distributed Computations MapReduce/Dryad

Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.

Parallel Computing with Dryad

Introduction to Spark.

DryadInc: Reusing work in large-scale computations

Fast, Interactive, Language-Integrated Cluster Computing

Server & Tools Business

Presentation transcript:

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley

Distributed Data-Parallel Computing Research problem: How to write distributed data-parallel programs for a compute cluster? The DryadLINQ programming model – Sequential, single machine programming abstraction – Same program runs on single-core, multi-core, or cluster – Familiar programming languages – Familiar development environment

DryadLINQ Overview Automatic query plan generation by DryadLINQ Automatic distributed execution by Dryad

LINQ Microsoft’s Language INtegrated Query – Available in Visual Studio products A set of operators to manipulate datasets in.NET – Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. – Integrated into.NET programming languages Programs can call operators Operators can invoke arbitrary.NET functions Data model – Data elements are strongly typed.NET objects – Much more expressive than SQL tables Highly extensible – Add new custom operators – Add new execution providers

LINQ System Architecture PLINQ Local machine.Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-Obj LINQ provider interface Scalability Single-core Multi-core Cluster

Dryad System Architecture 6 Files, TCP, FIFO, Network job schedule data plane control plane NSPD V VV Job managercluster

A Simple LINQ Example: Word Count Count word frequency in a set of documents: var docs = [A collection of documents]; var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count()));

Word Count in DryadLINQ Count word frequency in a set of documents: var docs = DryadLinq.GetTable (“file://docs.txt”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToDryadTable(“counts.txt”);

Distributed Execution of Word Count SM DryadLINQ GB S LINQ expression IN OUT Dryad execution

DryadLINQ System Architecture 10 DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr Data center Output Tables Results Input Tables Invoke Query Output DryadTable Dryad Execution.Net Objects JM ToTable foreach Vertex code

DryadLINQ Internals Distributed execution plan – Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. Automatic code generation – Vertex code that runs on vertices – Channel serialization code – Callback code for runtime optimizations – Automatically distributed to cluster machines Separate LINQ query from its local context – Distribute referenced objects to cluster machines – Distribute application DLLs to cluster machines

Execution Plan for Word Count 12 (1) SM GB S SM Q GB C D MS GB Sum SelectMany sort groupby count distribute mergesort groupby Sum pipelined

Execution Plan for Word Count 13 (1) SM GB S SM Q GB C D MS GB Sum (2) SM Q GB C D MS GB Sum SM Q GB C D MS GB Sum SM Q GB C D MS GB Sum

MapReduce in DryadLINQ 14 MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

Map-Reduce Plan (When reduce is combiner-enabled) M Q G1G1 C D MS G2G2 R M Q G1G1 C D G2G2 R M Q G1G1 C D G2G2 R G2G2 R map sort groupby combine distribute mergesort groupby reduce mergesort groupby reduce map Dynamic aggregation reduce

An Example: PageRank Ranks web pages by propagating scores along hyperlink structure Each iteration as an SQL query: 1.Join edges with ranks 2.Distribute ranks on edges 3.GroupBy edge destination 4.Aggregate into ranks 5.Repeat

One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

The Complete PageRank Program var pages = DryadLinq.GetTable (“file://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable (“outputranks.txt”); public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public struct Rank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

One Iteration PageRank J S G C D M G R J S G C D M G R J S G C D Join pages and ranks Disperse page’s rank Group rank by page Accumulate ranks, partially Hash distribute Merge the data Group rank by page Accumulate ranks M G R … … Dynamic aggregation

Multi-Iteration PageRank pagesranks Iteration 1 Iteration 2 Iteration 3 Memory FIFO

LINQ System Architecture PLINQ Local machine.Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-Obj LINQ provider interface Scalability Single-core Multi-core Cluster

Combining with PLINQ 22 Query DryadLINQ PLINQ subquery

Combining with LINQ-to-SQL 23 DryadLINQ Subquery Query LINQ-to-SQL

Combining with LINQ-to-Objects Query DryadLINQ Local machine Cluster LINQ-to-Object debug production

Current Status Works with any LINQ enabled language – C#, VB, F#, IronPython, … Works with multiple storage systems – NTFS, SQL, Windows Azure, Cosmos DFS Released internally within Microsoft – Used on a variety of applications External academic release announced at PDC – DryadLINQ in source, Dryad in binary – UW, UCSD, Indiana, ETH, Cambridge, …

Image Processing Cosmos DFSSQL Servers Software Stack 26 Windows Server Cluster Services Azure Platform Dryad DryadLINQ Windows Server Other Languages CIFS/NTFS Machine Learning Graph Analysis Data Mining Applications … Other Applications

Lessons Deep language integration worked out well – Easy expression of massive parallelism – Elegant, unified data model based on.NET objects – Multiple language support: C#, VB, F#, … – Visual Studio and.NET libraries – Interoperate with PLINQ, LINQ-to-SQL, LINQ-to-Object, … Key enablers – Language side LINQ extensibility: custom operators/providers.NET reflection, dynamic code generation, … – System side Dryad generality: DAG model, runtime callback Clean separation of Dryad and DryadLINQ

Future Directions Goal: Use a cluster as if it is a single computer – Dryad/DryadLINQ represent a modest step On-going research – What can we write with DryadLINQ? Where and how to generalize the programming model? – Performance, usability, etc. How to debug/profile/analyze DryadLINQ apps? – Job scheduling How to schedule/execute N concurrent jobs? – Caching and incremental computation How to reuse previously computed results? – Static program checking A very compelling case for program analysis? Better catch bugs statically than fighting them in the cloud?

Conclusions A powerful, elegant programming environment for large-scale data-parallel computing To request a copy of Dryad/DryadLINQ, contact For academic use only See a demo of the system at the poster session!