Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
The DryadLINQ Approach to Distributed Data-Parallel Computing
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
epiC: an Extensible and Scalable System for Processing Big Data
C# and LINQ Yuan Yu Microsoft Research Silicon Valley.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Big Data Platforms Mihai Budiu, Oct My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed, parallel web service orchestration using XSLT Peter Kelly Paul Coddington Andrew Wendelborn.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.
Distributed Computations
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Chapter 10 Application Development. Chapter Goals Describe the application development process and the role of methodologies, models and tools Compare.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
HADOOP ADMIN: Session -2
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
MapReduce.
Dryad and DryadLINQ Theophilus Benson CS Distributed Data-Parallel Programming using Dryad By Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael.
Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Intel Research Berkeley, Systems Seminar Series October 9, 2008.
Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx.
Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]
Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
ISYS 512 Business Application Design and Development with.Net David Chao.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Next Generation of Apache Hadoop MapReduce Owen
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
TensorFlow– A system for large-scale machine learning
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
Spark Presentation.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Introduction to Spark.
Overview of big data tools
Saranya Sriram Developer Evangelist | Microsoft
DryadInc: Reusing work in large-scale computations
Server & Tools Business
Presentation transcript:

Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters. The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmer. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the.NET Language Integrated Query (LINQ).

The Evolution of DryadLINQ Dryad had its roots in an idea developed in October 2004 by Isard Yu recognized the potential of LINQ to serve as the front-end programming tool for Dryad, and started the DryadLINQ project in September 2006 By early 2008, the Dryad/DryadLINQ combination was made available within Microsoft The DryadLINQ research paper won a best-paper award in 2008 during the eighth USENIX Symposium on Operating Systems Design and Implementation

Current Status Works with any LINQ enabled language – C#, VB, F#, IronPython, … Works with multiple storage systems – NTFS, SQL, Windows Azure, Cosmos DFS Released internally within Microsoft – Used on a variety of applications External academic release announced at PDC – DryadLINQ in source, Dryad in binary – UW, UCSD, Indiana, ETH, Cambridge, …

Dryad Definition Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

The Structure of Dryad Jobs

Dryad services An API to create distributed applications (jobs), by specifying which processes have to be executed and communication channels linking them. Scheduling of the processes on the cluster machines. Fault-tolerance through re-execution of processes after transient failures. Monitoring of the computation and statistics collection. Job visualization. An API for run-time resource management policies. Support for efficient bulk data transfer between processes.

Image Processing Cosmos DFSSQL Servers Software Stack 8 Windows Server Cluster Services Azure Platform Dryad DryadLINQ Windows Server Other Languages CIFS/NTFS Machine Learning Graph Analysis Data Mining Applications … Other Applications

Dryad System Architecture 9 Files, TCP, FIFO, Network job schedule data plane control plane NSPD V VV Job managercluster

LINQ Framework PLINQ Local machine.Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-XML LINQ provider interface Scalability Single-core Multi-core Cluster Extremely open and extensible

DryadLINQ Operators Operators present in LINQ which are implemented by DryadLINQ. Adaptations of operators present in LINQ which return scalar values (i.e., not IQueryable), but which are modified to return an IQueryable instead. For example, Count returns an integer, while CountAsQueryable returns an IQueryable whose actual contents will be a single integer. The AsQueryable variants can be chained together to produce complex queries, while using the scalar variants would require breaking queries into small sub-queries, which could decrease efficiency New operators, which exist only in DryadLINQ. We have added new operators which cannot be synthesized efficiently from compositions of primitive LINQ operators, and which can substantially improve the performance of queries in the context of a distributed execution environment like Dryad.

Combining with LINQ-to-SQL 12 DryadLINQ Subquery Query LINQ-to-SQL

DryadLINQ and LINQ C# and LINQ data objects become distributed partitioned files. LINQ queries become distributed Dryad jobs. C# methods become code running on the vertices of a Dryad job.

DryadLINQ representation

DryadLInq features Declarative programming: computations are expressed in a high-level language similar to SQL Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on thePLINQ parallelization framework.PLINQ Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management. Integration with.Net: all.Net libraries, including Visual Basic, and dynamic languages are available. Type safety: distributed computations are statically type-checked. Automatic serialization: data transport mechanisms automatically handle all.Net object types. Job graph optimizations – static: a rich set of term-rewriting query optimization rules is applied to the query plan, optimizing locality and improving performance. – dynamic: run-time query plan optimizations automatically adapt the plan taking into account the statistics of the data set processed. Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ: public static IQueryable MapReduce (this IQueryable source, Expression >> mapper, Expression > keySelector, Expression,R>> reducer) { return source.SelectMany(mapper).GroupBy(keySelector, reducer); }

DryadLINQ System Architecture 16 DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr Data center Output Tables Results Input Tables Invoke Query Output DryadTable Dryad Execution.Net Objects JM ToTable foreach Vertex code

A Query provider translates IQueryable objects to a suitable format and ships them to a remote execution engine. It also transforms the remote data into C# objects. DryadLINQ is just an instance of such a provider which interfaces with the Dryad remote execution framework. Query Provider Execute Local program (5) (11) Transform C# (1) (12) Query obj C# Objects (3) Remote execution Data (8) Results Data Query Invoke QueryQuery (2) Transform (7) (9) (10) (4) (6)

Execution stages of a Dryad Job

Partitioned File Structure

Reductions (Aggregations) var result = input.Aggregate((x,y) => x+y); [Associative] int Add(int x, int y); var sum = input.Aggregate((x,y)=>Add(x,y));

Apply The Select delegate receives each element individually, while the one of Apply receives the whole stream.

MapReduce in DryadLINQ 22 MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

Map-Reduce Plan (When reduce is combiner-enabled) M Q G1G1 C D MS G2G2 R M Q G1G1 C D G2G2 R M Q G1G1 C D G2G2 R G2G2 R map sort groupby combine distribute mergesort groupby reduce mergesort groupby reduce map Dynamic aggregation reduce

An Example: PageRank Ranks web pages by propagating scores along hyperlink structure Each iteration as an SQL query: 1.Join edges with ranks 2.Distribute ranks on edges 3.GroupBy edge destination 4.Aggregate into ranks 5.Repeat

Multi-Iteration PageRank pagesranks Iteration 1 Iteration 2 Iteration 3 Memory FIFO

Dryad Enters the Market A big step is coming, as Dryad and DryadLINQ become fully productized as part of the Microsoft HPC Server suite. It will be integrated with Microsoft SQL Server and Windows Azure to give customers from academia to the business community a new, powerful computing tool. Offering an easy-to-use but powerful, data-intensive computing tool It benefits a whole new set of Microsoft customers

Windows Azure and DryadLINQ Windows Azure is a platform for building scalable, highly reliable, multi-tiered web service applications. It is hosted on Microsoft’s large data centers in the United States, Europe, and Asia. Windows Azure has both compute and data resources. The compute resources are designed to allow applications to scale to thousands of servers and data resources. There is no port of Hadoop or Dryad/LINQ currently available. However, Windows Azure is an excellent platform for experimenting with new variations on large-scale map- reduce algorithms, as these patterns are easily coded as worker role networks.

“ We’re convinced that we will delight our customers, both with the pure capability of the system, as well as its ease of use. What I really like about Dryad is that is not just about handling a problem in a better way, it is also about new possibilities in computing that you couldn’t imagine before.”

Resources DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations - Yuan Yu, Pradeep Kumar Gunda, Michael Isard Distributed Data-Parallel Computing Using a High-Level Programming Language - Michael Isard, Yuan Yu Some sample programs written in DryadLINQ - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey, Frank McSherry, and Kannan Achan dryad.aspx dryad.aspx aspx aspx