Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

SALSA HPC Group School of Informatics and Computing Indiana University.

SALSA HPC Group School of Informatics and Computing Indiana University Judy Qiu Thilina Gunarathne CAREER Award.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Distributed Computations

Microsoft Cloud Futures 2010 April 9, 2010 Jie Li 1, Youngryel Ryu 2, Deb Agarwal 3, Keith Jackson 3, Marty Humphrey 1, Catharine van Ingen 4 University.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Scalable Parallel Computing on Clouds (Dissertation Proposal)

Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

School of Informatics and Computing Indiana University

Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.

MapReduce M/R slides adapted from those of Jeff Dean’s.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

SALSA HPC Group School of Informatics and Computing Indiana University.

Scalable Parallel Computing on Clouds : Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

SALSA HPC Group School of Informatics and Computing Indiana University.

Towards a Collective Layer in the Big Data Stack Thilina Gunarathne Judy Qiu

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -

SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.

Next Generation of Apache Hadoop MapReduce Owen

BIG DATA/ Hadoop Interview Questions.

SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox

15-826: Multimedia Databases and Data Mining

Accelerating MapReduce on a Coupled CPU-GPU Architecture

MapReduce and Data Intensive Applications XSEDE’12 BOF Session

I590 Data Science Curriculum August

Applying Twister to Scientific Applications

MapReduce for Data Intensive Scientific Analyses

Data Science Curriculum March

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Scientific Data Analytics on Cloud and HPC Platforms

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Twister4Azure : Iterative MapReduce for Azure Cloud

Data-Intensive Computing: From Clouds to GPU Clusters

CS639: Data Management for Data Science

CS639: Data Management for Data Science

Presentation transcript:

Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud

Twister4Azure – Iterative MapReduce Decentralized iterative MR architecture for clouds – Utilize highly available and scalable Cloud services Extends the MR programming model Multi-level data caching – Cache aware hybrid scheduling Multiple MR applications per job Collective communication primitives Outperforms Hadoop in local cluster by 2 to 4 times Dynamic scheduling, load balancing, fault tolerance, monitoring, local testing/debugging

MRRoles4Azure Azure Cloud Services Highly-available and scalable Utilize eventually-consistent, high-latency cloud services effectively Minimal maintenance and management overhead Decentralized Avoids Single Point of Failure Global queue based dynamic scheduling Dynamically scale up/down MapReduce First pure MapReduce for Azure Typical MapReduce fault tolerance

MRRoles4Azure Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Data Intensive Iterative Applications Growing class of applications – Clustering, data mining, machine learning & dimension reduction applications – Driven by data deluge & emerging computation fields – Lots of scientific applications k ← 0; MAX ← maximum iterations δ [0] ← initial delta value while ( k< MAX_ITER || f(δ [k], δ [k-1] ) ) foreach datum in data β[datum] ← process (datum, δ [k] ) end foreach δ [k+1] ← combine(β[]) k ← k+1 end while k ← 0; MAX ← maximum iterations δ [0] ← initial delta value while ( k< MAX_ITER || f(δ [k], δ [k-1] ) ) foreach datum in data β[datum] ← process (datum, δ [k] ) end foreach δ [k+1] ← combine(β[]) k ← k+1 end while

Data Intensive Iterative Applications Growing class of applications – Clustering, data mining, machine learning & dimension reduction applications – Driven by data deluge & emerging computation fields Compute CommunicationReduce/ barrier New Iteration Larger Loop- Invariant Data Smaller Loop- Variant Data Broadcast

Iterative MapReduce MapReduceMerge Extensions to support additional broadcast (+other) input data Map(,, list_of ) Reduce(, list_of, list_of ) Merge(list_of >,list_of ) MapCombineShuffleSortReduceMergeBroadcast

Merge Step Extension to the MapReduce programming model to support iterative applications – Map -> Combine -> Shuffle -> Sort -> Reduce -> Merge Receives all the Reduce outputs and the broadcast data for the current iteration User can add a new iteration or schedule a new MR job from the Merge task. – Serve as the “loop-test” in the decentralized architecture Number of iterations Comparison of result from previous iteration and current iteration – Possible to make the output of merge the broadcast data of the next iteration

 In-Memory Caching of static data  Programming model extensions to support broadcast data  Merge Step  Hybrid intermediate data transfer In-Memory/Disk caching of static data Multi-Level Caching Caching BLOB data on disk Caching loop-invariant data in-memory – Direct in-memory – Memory mapped files

Cache Aware Hybrid Scheduling Map 1 Map 2 Map n Map Workers Red 1 Red 2 Red m Reduce Workers In Memory/Disk Data Cache Map Task Meta Data Cache Worker Role New Iteration Left over tasks New Job Job Bulletin Board Job 1, iteration 2, bcast.. Job 2, iteration 26, bcast.. ……. Scheduling Queue Decentralized Fault tolerant Multiple MapReduce applications within an iteration Load balancing Multiple waves

Data Transfer Iterative vs traditional MapReduce – Iterative computations tasks are finer grained – Intermediate data are relatively smaller Hybrid Data Transfer based on the use case – Blob+Table storage based transport – Direct TCP Transport Push data from Map to Reduce Optimized data broadcasting

Fault Tolerance For Iterative MapReduce Iteration Level – Role back iterations Task Level – Re-execute the failed tasks Hybrid data communication utilizing a combination of faster non-persistent and slower persistent mediums – Direct TCP (non persistent), blob uploading in the background. Decentralized control avoiding single point of failures Duplicate-execution of slow tasks

Collective Communication Primitives for Iterative MapReduce Supports common higher-level communication patterns Performance – Framework can optimize these operations transparently to the users Multi-algorithm – Avoids unnecessary steps in traditional MR and iterative MR Ease of use – Users do not have to manually implement these logic (eg: Reduce and Merge tasks) – Preserves the Map & Reduce API’s AllGather We are working on several other primitives as well

AllGather Primitive AllGather – MDS BCCalc, PageRank (with in-links matrix)

Performance with/without data caching Speedup gained using data cache Scaling speedup Increasing number of iterations Number of Executing Map Task Histogram Strong Scaling with 128M Data Points Weak Scaling Task Execution Time Histogram First iteration performs the initial data fetch Overhead between iterations Scales better than Hadoop on bare metal

Multi Dimensional Scaling Weak Scaling Data Size Scaling Performance adjusted for sequential performance difference X: Calculate invV (BX) Map Reduc e Merge BC: Calculate BX Map Reduc e Merge Calculate Stress Map Reduc e Merge New Iteration Scalable Parallel Scientific Computing Using Twister4Azure. Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu. Submitted to Journal of Future Generation Computer Systems. (Invited as one of the best 6 papers of UCC 2011)

Multi Dimensional Scaling

Performance Comparisons BLAST Sequence Search Cap3 Sequence Assembly Smith Watermann Sequence Alignment MapReduce in the Clouds for Science, Thilina Gunarathne, et al. CloudCom 2010, Indianapolis, IN.

KMEANS CLUSTERING DEMO

Acknowledgement Microsoft Extreme Computing Group for the Azure Compute Grant Persistent Systems IU Salsa Group

Hybrid Task Scheduling First iteration through queues New iteration in Job Bulleting Board Data in cache + Task meta data history Left over tasks