DryadInc: Reusing work in large-scale computations

Slides:



Advertisements
Similar presentations
Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
Beyond Mapper and Reducer
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Cluster Computing with DryadLINQ
MapReduce.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Chuang, Sang, Yoo, Gu, Killian and Kulkarni, “EventWave” SoCC ‘13 EventWave: Programming Model and Runtime Support for Tightly-Coupled Elastic Cloud Applications.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Nectar: Efficient Management of Computation and Data in Data Centers Lenin Ravindranath Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed Computations
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Index Building.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
TensorFlow– A system for large-scale machine learning
Some slides adapted from those of Yuan Yu and Michael Isard
IST 220 – Intro to Databases
Hadoop Aakash Kag What Why How 1.
Topo Sort on Spark GraphX Lecturer: 苟毓川
Seth Pugsley, Jeffrey Jestes,
CSCI5570 Large Scale Data Processing Systems
Hadoop MapReduce Framework
Synthesis for Verification
PREGEL Data Management in the Cloud
Distributed Computations MapReduce/Dryad
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Optimizing Big-Data Queries using Program Synthesis
Delay Optimization using SOP Balancing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
MapReduce: Data Distribution for Reduce
Cse 344 May 4th – Map/Reduce.
CSE 491/891 Lecture 24 (Hive).
MAPREDUCE TYPES, FORMATS AND FEATURES
Delay Optimization using SOP Balancing
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Chaitali Gupta, Madhusudhan Govindaraju
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Presentation transcript:

DryadInc: Reusing work in large-scale computations Lucian Popa*+, Mihai Budiu+, Yuan Yu+, Michael Isard+ + Microsoft Research Silicon Valley * UC Berkeley

Problem Statement Goal: Reuse (part of) prior computations to: … Outputs Distributed Computation … Inputs Append-only data Goal: Reuse (part of) prior computations to: Speed up the current job Increase cluster throughput Reduce energy and costs

Propose Two Approaches 1. IDE Reuse IDEntical computations from the past (like make or memoization) 2. MER Do only incremental computation on the new data and MERge results with the previous ones (like patch)

Context Implemented for Dryad Dryad Job = Computational DAG Vertex: arbitrary computation + inputs/outputs Edge: data flows Simple Example: Record Count Outputs Add A Count C C Inputs (partitions) I1 I2

IDE – IDEntical Computation Record Count First execution DAG Outputs Add A Count C C Inputs (partitions) I1 I2

IDE – IDEntical Computation Record Count Second execution DAG Outputs Add A Count C C C Inputs (partitions) I1 I2 I3 New Input

IDE – IDEntical Computation Record Count Second execution DAG Outputs Add A Count C C C Identical subDAG Inputs (partitions) I1 I2 I3

IDE – IDEntical Computation Replace identical computational subDAG with edge data cached from previous execution IDE Modified DAG Outputs Add A Count C Replaced with Cached Data Inputs (partitions) I3

IDE – IDEntical Computation Replace identical computational subDAG with edge data cached from previous execution IDE Modified DAG Outputs Add A Count C Inputs (partitions) I3 Use DAG fingerprints to determine if computations are identical

Semantic Knowledge Can Help Reuse Output A C C I1 I2

Semantic Knowledge Can Help Previous Output A Merge (Add) A C I3 C C I1 I2 Incremental DAG

MER – MERgeable Computation User-specified A Merge (Add) Automatically Inferred A C I3 C C I1 I2 Automatically Built

MER – MERgeable Computation Merge Vertex Save to Cache A Incremental DAG – Remove Old Inputs A A C C C C C I1 I2 Empty I1 I2 I3

IDE in practice 6 input DAG IDE 6 (4+2) input DAG

MER in practice 9 input DAG MER 9 (5+4) input DAG Empty Empty

Evaluation – Running time Word Histogram Application – 8 nodes No Cache IDE MER

Discussion MapReduce: just a particular case IDE reuses the output of Mappers MER requires combined Reduce function Combine IDE with MER: benefits don’t add up IDE can be used for the incremental DAG at MER More semantic knowledge: further opportunities Generate merge function automatically Improve incremental DAG

Conclusions & Questions Problem: reuse work in distributed computations on append-only data Two methods: INC – reuse IDEntical past computations No user effort MER – MERge past results with new ones Small user effort, potentially larger gains Implemented for Dryad

Backup Slides

Architecture Dryad Job Manager Cache Server IDE/MER Rerun Logic Modify DAG before run Cache Server RUN Update cache after run

IDE – IDEntical Computation Use fingerprints to identify identical computational DAGs Guarantee that both computation and data are unchanged Second DAG First DAG Store to Cache: <hash, outputs>

IDE – IDEntical Computation Use a heuristic to identify the vertices to cache Second DAG First DAG cached

Word Histogram Application Outputs Merge Sort Count (Reduce) … m Sort Hash Distribute (Map) … n Inputs

Evaluation – Running time No Cache IDE (4 ext) MER (4 ext) IDE (20 ext) Results similar if 20 more inputs appended ~ 1.2GB MER (20 ext)