IPDPS, 2010 1 Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Slides:

Advertisements

Similar presentations

IPDPS Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal.

Advertisements

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

ANL Chicago Elastic and Efficient Execution of Data- Intensive Applications on Hybrid Cloud Tekin Bicer Computer Science and Engineering Ohio State.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High.

Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid KutluGagan AgrawalOguz Kurt Department of Computer Science and Engineering The Ohio State University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Mining High Utility Itemset in Big Data

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Fault Tolerance in Cloud Srihari Murali University at Buffalo November 11, 2011.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Data-Centric Systems Lab. A Virtual Cloud Computing Provider for Mobile Devices Gonzalo Huerta-Canepa presenter 김영진.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Supporting Fault-Tolerance in Streaming Grid Applications

Tools and Techniques for Processing (and Management) of Data

Communication and Memory Efficient Parallel Decision Tree Construction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

An Adaptive Middleware for Supporting Time-Critical Event Response

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Yi Wang, Wei Jiang, Gagan Agrawal

Resource Allocation for Distributed Streaming Applications

A Map-Reduce System with an Alternate API for Multi-Core Environments

Phoenix: A Substrate for Resilient Distributed Graph Analytics

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010 IPDPS 2010, Atlanta, Georgia

IPDPS, 2010 Motivation Data Intensive computing –Distributed Large Datasets –Distributed Computing Resources Cloud Environments Long execution time 2 High Probability of Failures

IPDPS, 2010 A Data Intensive Computing API FREERIDE 3 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. overheads are eliminated with red. func/obj.

IPDPS, 2010 Simple Example Our large Dataset Our Compute Nodes Robj[1]= Local Reduction (+) Result= 71 Global Reduction(+)

IPDPS, 2010 Remote Data Analysis Co-locating resources gives best performance… But may not be always possible –Cost, availability etc. Data hosts and compute hosts are separated Fits grid/cloud computing FREERIDE-G is a version of FREERIDE that supports remote data analysis 5

IPDPS, 2010 Fault Tolerance Systems Checkpoint based –System or Application level snapshot –Architecture dependent –High overhead Replication based –Service or Application –Resource Allocation –Low overhead 6

IPDPS, 2010 Outline Motivation and Introduction Fault Tolerance System Approach Implementation of the System Experimental Evaluation Related Work Conclusion 7

IPDPS, 2010 A Fault Tolerance System based on Reduction Object Reduction object… –represents intermediate state of the computation –is small in size –is independent from machine architecture Reduction obj/func show associative and commutative properties 8 Suitable for Checkpoint based Fault Tolerance System

IPDPS, 2010 An Illustration 9 Robj= Local Reduction (+) Robj = 8Robj = 21 Robj=0 Robj = Local Reduction (+) 25

IPDPS, 2010 Modified Processing Structure for FTS { * Initialize FTS * } While { Foreach ( element e ) { (i, val) = Process(e); RObj(i) = Reduce(RObj(i), val); { * Store Red. Obj. * } } if ( CheckFailure() ) { * Redistribute Data * } { * Global Reduction * } } 10

IPDPS, 2010 Outline Motivation and Introduction Fault Tolerance System Design Implementation of the System Experimental Evaluation Related Work Conclusion 11

IPDPS, 2010 Simple Implementation of the Alg. Reduction object is stored another comp. node –Pair-wise reduction object exchange Failure detection is done by alive peer 12 CN n Reduction Object Exchange.... CN n-1 CN 1 Reduction Object Exchange CN 0

IPDPS, 2010 Demonstration 13 N0 Robj N0 Local Red. N1 Robj N1 Local Red. Robj N0 Robj N1 N2 Robj N2 Local Red. N3 Robj N3 Local Red. Robj N2 Robj N3 Failure Detected Final Result Global Red. Redistribute Failed Node’s Remaining Data

IPDPS, 2010 Outline Motivation and Introduction Fault Tolerance System Design Implementation of the System Experimental Evaluation Related Work Conclusion 14

IPDPS, 2010 Goals for the Experiments Observing reduction object size Evaluate the overhead of the FTS Studying the slowdown in case of one node’s failure Comparison with Hadoop (Map-Reduce imp.) 15

IPDPS, 2010 Experimental Setup FREERIDE-G –Data hosts and compute nodes are separated Applications –K-means and PCA Hadoop (Map-Reduce Imp.) –Data is replicated among all nodes 16

IPDPS, 2010 Experiments (K-means) Without Failure Configurations –Without FTS –With FTS With Failure Configuration –Failure after processing %50 of data (on one node) 17 Execution Times with K-means 25.6 GB Dataset Reduction obj. size: 2KB With FT overheads: % Max: 8 Comp. Nodes, 25.6 GB Relative: 5.38 – 21.98% Max: 4 Comp. Nodes, 25.6 GB Absolute: 0 – 4.78% Max: 8 Comp. Nodes, 25.6 GB

IPDPS, 2010 Experiments (PCA) 18 Execution Times with PCA, 17 GB Dataset Reduction obj. size: 128KB With FT overheads: 0 – 15.36% Max: 4 Comp. Nodes, 4 GB Relative: 7.77 – 32.48% Max: 4 Comp. Nodes, 4 GB Absolute: 0.86 – 14.08% Max: 4 Comp. Nodes, 4 GB

IPDPS, 2010 Comparison with Hadoop 19 K-means Clustering, 6.4GB Dataset Overheads Hadoop | | FREERIDE-G | 8.18 | 9.18 w/f = with failure Failure happens after processing 50% of the data on one node

IPDPS, 2010 Comparison with Hadoop 20 K-means Clustering, 6.4GB Dataset, 8 Comp. Nodes Overheads Hadoop | | FREERIDE-G 9.52 | 8.18 | 8.14 One of the comp. nodes failed after processing 25, 50 and 75% of its data

IPDPS, 2010 Outline Motivation and Introduction Fault Tolerance System Design Implementation of the System Experimental Evaluation Related Work Conclusion 21

IPDPS, 2010 Related Work Application level checkpointing –Bronevetsky et. al.: C^3 (SC06, ASPLOS04, PPoPP03) –Zheng et. al.: Ftc-charm++ (Cluster04) Message logging –Agrabia et. al.: Starfish (Cluster03) –Bouteiller et. al.: Mpich-v (Int. Journal of High Perf. Comp. 06) Replication-based Fault Tolerance –Abawajy et. al. (IPDPS04) 22

IPDPS, 2010 Outline Motivation and Introduction Fault Tolerance System Design Implementation of the System Experimental Evaluation Related Work Conclusion 23

IPDPS, 2010 Conclusion Reduction object represents the state of the system Our FTS has very low overhead and effectively recovers from failures Different designs can be implemented using Robj. Our system outperforms Hadoop both in absence and presence of failures 24

IPDPS, 2010 Thanks 25