From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
Introduction to Data Center Computing Derek Murray October 2010.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Dryad and dataflow systems
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Data Engineering How MapReduce Works
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop.
CSCI5570 Large Scale Data Processing Systems
Large-scale file systems and Map-Reduce
Parallel Databases.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Distributed Computations MapReduce/Dryad
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
DryadInc: Reusing work in large-scale computations
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Presentation transcript:

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ

Overview From sequential code to parallel execution Dryad fundamentals Simple program example, plan for practicals

Distributed computation Single computer, shared memory – All objects always available for read and write Cluster of workstations – Each computer sees a subset of objects – Writes on one computer must be explicitly shared System automatically handles complexity – Needs some help

Data-parallel computation LINQ is high-level declarative specification Same action on entire collection of objects set.Select(x => f(x)) – Compute f(x) on each x in set, independently set.GroupBy(x => key(x)) – Group by unique keys, independently set.OrderBy(x => key(x)) – Sort whole set (system chooses how)

Distributed cluster computing Dataset is stored on local disks of cluster set set.0 set.7 set.1 set.6set.4 set.3set.2 set.5

Distributed cluster computing Dataset is stored on local disks of cluster set.0 set.7 set.1 set.6set.4 set.3set.2 set.5

Simple distributed computation var set2 = set.Select(x => f(x)) set set2

Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.7 set.1 set.6 set.4 set.3 set.2 set.5 set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.1 set.2set.3 set.4set.5set.6 set.7 set2.0set2.1set2.2set2.3set2.4set2.5set2.6set2.7 ffffffff

Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.1 set.2set.3 set.4set.5set.6 set.7 set2.0set2.1set2.2set2.3set2.4set2.5set2.6set2.7 ffffffff

Distributed acyclic graph Computation reads and writes along edges Graph shows parallelism via independence Goals of DryadLINQ optimizer – Extract parallelism (find independent work) – Control data skew (balance work across nodes) – Limit cross-computer data transfer

Distributed grouping var groups = set.GroupBy(x => x.key) set is a collection of records each with a key Don’t know what keys are present – Or in which partitions First, reorganize data – All records with same key on same computer Then can do final grouping in parallel

Distributed grouping var groups = set.GroupBy(x => x.key) set hash partition by key group locally groups acac adad dbdb baba acac a c a a adad d d b b dbdb baba

Distributed grouping var groups = set.GroupBy(x => x.key) set hash partition by key group locally groups acac adad dbdb baba acac a c a adad d d b b dbdb baba a a a c b d a a a c b d

Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram

Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram [1,1] [2,100]

Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram [1,1] [2,100]

Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a b b a a d b d b d b b d a b b a count

Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a a a count a b b a a d b d a b b a a b b a a d b d a b b a a b d b d b b b b d d b b b d d

Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a,6 b,6 d,4 count a b b a a d b d a b b a a b b a a d b d a b b a a a a b b b d d a a a a,6 b,6 d,4 b b b d d

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 b,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 b,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,6 b,6 d,4 b,6 d,4

What Dryad does Abstracts cluster resources – Set of computers, network topology, etc. Schedule DAG: choose cluster computers – Fairly among competing jobs – So computation is close to data Recovers from transient failures – Rerun computations on machine or network fault – Speculate duplicates for slow computations

Resources are virtualized Each graph node is process – Writes outputs to disk – Reads inputs from upstream nodes’ output files Graph generally larger than cluster – 1TB input, 250MB partition, 4000 parts Cluster is shared – Don’t size program for exact cluster – Use whatever share of resources are available

What controls parallelism Initially based on partitioning of inputs After reorganization, system or user decides

DryadLINQ-specific operators set = PartitionedTable.Get (uri) set.ToPartitionedTable(uri) set.HashPartition(x => f(x), numberOfParts) set.AssumeHashPartition(x => f(x)) [Associative] f(x) { … } RangePartition(…), Apply(…), Fork(…) [Decomposable], [Homomorphic], [Resource] Field mappings, Multiple partitioned tables, …

using System; using System.Collections.Generic; using System.Linq; using System.Text; using LinqToDryad; namespace Count { class Program { public const string inputUri static void Main(string[] args) { PartitionedTable table = PartitionedTable.Get (inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); }

Form into groups 9 groups, one MSRI member per group Try to pick common interest for project later

sherwood-246 — sherwood-253,sherwood-255 d:\dryad\data\Workshop\DryadLINQ\samples Count, Points, Robots Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe TidyFS (file system) browser d:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe