MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Slides:



Advertisements
Similar presentations
Evaluating Caching and Storage Options on the Amazon Web Services Cloud Gagan Agrawal, Ohio State University - Columbus, OH David Chiu, Washington State.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
IPDPS Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal.
Cloud Computing Mick Watson Director of ARK-Genomics The Roslin Institute.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
Data-intensive Computing on the Cloud: Concepts, Technologies and Applications B. Ramamurthy This talks is partially supported by National.
ANL Chicago Elastic and Efficient Execution of Data- Intensive Applications on Hybrid Cloud Tekin Bicer Computer Science and Engineering Ohio State.
Edge Based Cloud Computing as a Feasible Network Paradigm(1/27) Edge-Based Cloud Computing as a Feasible Network Paradigm Joe Elizondo and Sam Palmer.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Larisa kocsis priya ragupathy
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
Presented by: Mostafa Magdi. Contents Introduction. Cloud Computing Definition. Cloud Computing Characteristics. Cloud Computing Key features. Cost Virtualization.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
OSU – CSE 2014 Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments Tekin Bicer Department of Computer Science.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Load Rebalancing for Distributed File Systems in Clouds.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key- Value Store Apeksha Shetty and Gagan Agrawal Ohio State University David Chiu.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Introduction to Distributed Platforms
| A Comparative Study on I/O Performance between Compute and Storage Optimized Instances of Amazon.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Replication Middleware for Cloud Based Storage Service
Tools and Techniques for Processing (and Management) of Data
MapReduce Simplied Data Processing on Large Clusters
Linchuan Chen, Peng Jiang and Gagan Agrawal
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Wei Jiang Advisor: Dr. Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering and Computer Science (Presenter) Washington State University, Vancouver The 4th Workshop on Many-Task Computing on Grids and Supercomputers: MTAGS’11

2 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Outline Background –Problem and Goal System Overview Experiments Conclusion 2

3 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Today’s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR E.g., Map-Reducible applications are solving common problems –Data mining –Graph processing –etc. The Problem 3

4 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Infrastructure-as-a-Service –Anyone, anywhere can allocate “unlimited” virtualized compute/storage resources Amazon Web Services: –Most popular IaaS provider –Elastic Compute Cloud (EC2) –Simple Storage Service (S3) Clouds 4

5 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Amazon Web Services: EC2 On-Demand Instances (Virtual Machines) –Types: Extra Large, Large, Small, etc. For example, Extra Large Instance: –Cost: $0.68 per allocated-hour –15GB memory; 1.7TB disk (ephemeral) –8 Compute Units –High I/O performance 5

6 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Amazon Web Services: S3 Accessible anywhere. High reliability and availability Objects are arbitrary data blobs Objects stored as (key,value) in Buckets –5TB of data per object –Unlimited objects per bucket 6

7 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Amazon Web Services: S3 (Cont.) Simple FTP-like interface using web service protocol –Put, Get (Partial Get), and Delete –SOAP and REST High throughput (~40MB/sec) –Scales well to multiple clients Low costs 7

8 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Amazon Web Services: S3 (Cont.) 449 billion objects in S3 as of July 2011 –Doubling each year 8

9 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Motivation and Focus 9 Virtualization is characteristic of any cloud environment: Clouds are black boxes Storage and elastic compute services exhibit performance variabilities we should leverage

10 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Goals As users are increasingly moving to cloud-based solutions for computing.... We have a need for services and tools that can… –Get the most out of cloud resources for data- intensive processing –Provide a simple programming interface 10 ?

11 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Outline Background System Overview Experiments Conclusion 11

12 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 System Design Cloud middleware that is able to –Use a set of possibly heterogeneous EC2 instances to scalably and efficiently process data stored in S3 MATE is a generalized reduction PDC structure like Map-Reduce 12

13 MATE T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE and Map-Reduce 13 Map-Reduce

14 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 Design 14

15 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 Design: Data Organization 15 Objects: Physical representation of the data in S3 Units: Fixed data units within a chunk for retrieval (exploits concurrency) Metadata: chunk offset, chunk size, unit size Chunks: Logical data partitions within objects (exploits memory utilization)

16 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 Design: Data Retrieval 16 Threaded chunk retrieval: Chunk retrieved concurrently with a number of threads

17 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Dynamic Load Balancing 17 Job Pool

18 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Dynamic Load Balancing 18 Job Pool Schedule for processing Simple greedy heuristic: Select a unit belonging to the least connected chunk

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 19 MATE-EC2 Processing Flow (1) Compute node requests a job from Master

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 20 MATE-EC2 Processing Flow (2) Chunk retrieved in units

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 21 MATE-EC2 Processing Flow (3) Pass to Compute Layer, and process

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 22 MATE-EC2 Processing Flow (4) Request another job from Master

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 23 MATE-EC2 Processing Flow Slave Instance...

24 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Experiments Goals –Finding the most suitable setting for AWS –Performance of MATE-EC2 on heterogeneous and homogeneous compute environments –Performance comparison of MATE-EC2 and Map- Reduce 24

25 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Experiments (Cont.) Setup: –4 Large EC2 slave instances –1 Large instance for master instance –For each application, the dataset is split into16 data objects on S3 Large Instance: –4 compute units (each comparable to GHz) –7.5GB (memory) –850GB (disk, ephemeral) –High I/O 25

26 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Experiments (Cont.) 26 AppI/OCompRObj SizeDataset KMeans Clustering Low/MedMed/HighSmall8.2GB 10.7 billion points PCALowHighLarge8.2GB PageRankHighLow/MedVery Large1GB 9.6M nodes, 131M edges

27 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Effect of Chunk Sizes (KMeans) Performance increase: –128KB vs. >8M –2.07x to 2.49x speedup 27 1 Data Retrieval Thread

28 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Data Retrieval (KMeans) 8M vs. others speedup: 1.13x x Data Retrieval Threads One Thread vs. others: 1.37x x 128M Chunk Size

29 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Job Assignment (KMeans) Speedup: –1.01x for 8M –1.1x to 1.14x for others 29

30 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 vs. Elastic MapReduce 30 Speedups vs. EMR-combine 4.08x to 7.54x PageRank Speedups vs. EMR-combine 3.54x to 4.58x KMeans Chunk Size: 128MB Data retrieval Threads: 16

31 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) MATE-EC2 on Heterogeneous Instances Overheads –KMeans: 1% –PCA: 1.1%, 7.4%, 11.7% 31

32 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) In Conclusion... AWS environment is explored for data-intensive computing –64M and 128M data chunks w/ 16 data retrieval threads seems to be optimal for our middleware Our data retrieval and processing optimizations significantly improve the performance of our middleware MATE-EC2 outperforms MR both in scalability and performance 32

33 T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) Thank You Questions & Discussion Acknowledgments: –We want to thank the MTAGS’11 organizers and the reviewers for their constructive comments –This work was sponsored by: 33 ?

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 34 Some Instance Types Cluster GPU –22 GB, 33.5 CU (2 x Intel Xeon X5570, quad-core “Nehalem” arch.), –2 x NVIDIA Tesla “Fermi” M2050 GPUs, –1690 GB of instance storage, –64-bit platform, I/O Performance: Very High (10 Gigabit Ethernet) Cluster Compute –23 GB, 33.5 CU (2 x Intel Xeon X5570, quad-core “Nehalem”) –1690 GB of instance storage, –64-bit platform, I/O Performance: Very High (10 Gigabit Ethernet), Small/Large Instance –1.7/7.5 GB, 1/4 CU (1/2 virtual cores with 1/2 EC2 Compute Units each) –160/850 GB instance storage –I/O Performance: Moderate/High (10 Gigabit Ethernet) 21

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 35 Bucket/Key 22

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 36 23

T Bicer, D Chiu, G Agrawal. Mate-EC2: A Middleware for Processing Data with AWS (MTAGS 2011, Seattle WA) 37 24