Twister4Azure : Iterative MapReduce for Azure Cloud

Slides:



Advertisements
Similar presentations
Cloud Service Models and Performance Ang Li 09/13/2010.
Advertisements

System Center 2012 R2 Overview
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSA HPC Group School of Informatics and Computing Indiana University Judy Qiu Thilina Gunarathne CAREER Award.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Features Scalability Availability Latency Lifecycle Data Integrity Portability Manage Services Deliver Features Faster Create Business Value.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Microsoft Cloud Futures 2010 April 9, 2010 Jie Li 1, Youngryel Ryu 2, Deb Agarwal 3, Keith Jackson 3, Marty Humphrey 1, Catharine van Ingen 4 University.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
System Center: Accelerating Growth in the hybrid Cloud Microsoft Hosting Service Providers Conversation #2 1.
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
CloudClustering Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley †Microsoft Research Toward an Iterative Data Processing Pattern on the.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Scalable Parallel Computing on Clouds : Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Towards Constraint-based High Performance Cloud System in the Process of Cloud Computing Adoption in an Organization Speaker : 吳靖緯 MA0G0101.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSA HPC Group School of Informatics and Computing Indiana University.
Towards a Collective Layer in the Big Data Stack Thilina Gunarathne Judy Qiu
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Scalability == Capacity * Density.
第 1 讲 分布式系统概述 §1.1 分布式系统的定义 §1.2 分布式系统分类 §1.3 分布式系统体系结构.
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
Features Scalability Manage Services Deliver Features Faster Create Business Value Availability Latency Lifecycle Data Integrity Portability.
SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Energy Management Solution
Organizations Are Embracing New Opportunities
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Introduction to Load Balancing:
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Energy Management Solution
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
MapReduce and Data Intensive Applications XSEDE’12 BOF Session
Replication Middleware for Cloud Based Storage Service
Welcome! Power BI User Group (PUG)
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
Data Science Curriculum March
Welcome! Power BI User Group (PUG)
Scientific Data Analytics on Cloud and HPC Platforms
Scalable Parallel Interoperable Data Analytics Library
Parallel Applications And Tools For Cloud Computing Environments
Clouds from FutureGrid’s Perspective
Parallel System for BLAST
Lecture 29: Distributed Systems
Convergence of Big Data and Extreme Computing
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Twister4Azure : Iterative MapReduce for Azure Cloud Thilina Gunarathne, Judy Qiu, Geoffrey Fox {tgunarat, xqiu,gcf}@indiana.edu CCA 2011 April 12 – 13, 2011

MapReduceRoles for Azure Familiar MapReduce programming model Built using highly-available and scalable Azure cloud services Co-exist with eventual consistency & high latency of cloud services Decentralized control No single point of failure. Supports dynamically scaling up and down of the compute resources. MapReduce fault tolerance

MapReduceRoles for Azure We use Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Twister for Azure Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as well as using a bulletin board (special table) In our current work, we extend MR4Azure to support iterative mapreduce computations. We added an additional merge step to the programming model, where the computations decides whether to go for a new iteration or not. We also support in-memory caching of static data between iterations and we developed a hybrid scheduling strategy to perform cache aware scheduling.

Twister for Azure We don’t have a master node, who has the global knowledge about the cached data.. Hence in each iteration, tasks will be posted to the bulleting board, where workers will first check to identify any tasks that require a data product they have in cache. If not they fall back to the queue.

Performance – Kmeans Clustering Performance with/without data caching. Speedup gained using data cache KMeans iterative MapReduce performance. 16 Azure Small instances, 6 iterations, 8 to 48 million 20-D data points. Left: Performance with and without data caching. Right: Speedup obtained from using the data cache Left: Scaling speedup with increasing number of instances (Azure Small) & data for 10 iterations. Right: Increasing number of iterations using 16 million data points with caching using 16 Azure Small instances. Increasing number of iterations Scaling speedup

Performance Comparisons BLAST Sequence Search Smith Watermann Sequence Alignment Cap3 Sequence Assembly

Conclusion Enables users to easily and efficiently perform large scale iterative data analysis and scientific computations on Azure cloud. Utilizes a novel hybrid scheduling mechanism to provide the caching of static data across iterations. Utilize cloud infrastructure services effectively to deliver robust and efficient applications. http://salsahpc.indiana.edu/twister4azure