Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.

Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.

Ch 4. The Evolution of Analytic Scalability

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University.

Computation Time Analysis - Climate Reanalysis Data Dipanwita Dasgupta University of Notre Dame Graduate Operating Systems.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key- Value Store Apeksha Shetty and Gagan Agrawal Ohio State University David Chiu.

QianZhu, Liang Chen and Gagan Agrawal

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

On Spatial Joins in MapReduce

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Ch 4. The Evolution of Analytic Scalability

Smita Vijayakumar Qian Zhu Gagan Agrawal

Resource Allocation for Distributed Streaming Applications

MapReduce: Simplified Data Processing on Large Clusters

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. Gagan Agrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas

Introduction  Cloud computing Resources on demand pay-as-you-go Elasticity  Resource Allocation on the cloud Dynamic resource allocation Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 2

Motivation  Use elasticity of cloud for executing scientific applications Over provisioning and Under provisioning Avoid wastage of resources  No Generalized scientific workflow to execute application in dynamic fashion  Allocate resources during the execution  Meet time constraints by using more resources Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 3

Background-MassMatrix  Developed by Dr. Hua Xu and Dr. Michael Freitas at Ohio State University  A database search program with rapid characterization of proteins and peptides Supports multiple data formats like.mgf,.mzXML and raw data The input database are of the formats.fasta or.BAS Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 4

MassMatrix Application Flow Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Theoretical Protein database Digest the sequence Has the sequence been searched before? Do not add it to the final result Full scan search for finding matching peptides Clear insignificant peptides Statistical analysis to generate results results MS/MS data input file yes no 5

Contributions (1/2)  Providing a framework for parallelization of the MassMatrix application  Creating a dynamic workflow Resources are allocated adaptively QOS is achieved by parameter prediction Gives user control by using benefit function Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 6

Contributions (2/2)  Allows to specify the time constraint in which the application should be completed  “ A cloud-based Dynamic Workflow for Mass spectrometry Data Analysis” - Ashish Nagavaram, Gagan Agrawal, Michael Freitas, Gaurang Mehta 7 th IEEE Conference on E-Science, Dec 2011 Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 7

Outline  Introduction  Motivation  Background  Parallelization of MassMatrix  Adaptive Resource allocation  Experimental Results  Parameter Prediction  Conclusion Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 8

Parallel MassMatrix  Parallelize the full-scan search phase Takes the longest time to execute The rest of the phases are sequential  A split-merge approach is followed The user can specify the number of splits Splits are made based on specific tags Index embedded in the file-split name Other options also considered Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 9

Parallel MassMatrix (contd.)  Only input file split When we split database also leads to redundant results When split both input and database we have the same problem  The intermediate files are written to disk Pointers serialized Written as comma separated values  A python script keeps polling the job queue to check if the parallel phase has been completed Suspends the sequential phase until then Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 10

Parallel MassMatrix (Contd.)  The intermediate files are read back in and re- indexed while merging  The merging process is complicated Complex data structures (matrix of matrices) Have to get inside each data-structure to maximize them Intermediate files are indexed among each other While re-indexing maintain both local and global index The data structures are also re-numbered while merging Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 11

Parallel MassMatrix (contd.)  Intermediate files are merged in order of the split they process  Unnecessary intermediate files are not loaded back Saves memory Helps in case of large data files Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 12

MassMatrix Flow (Parallel) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 13 Configuration File Input File Input Database Python Script splitN split2 split1 Sequential phase Merge massmatrix

Experimental results (Parallelization) Experimental setup:  8 core Intel Xeon node with 6GB of DDR400 RAM  The theoretical database used was of 20 MB.fasta format database is used  The code was run for 6 different datasets Each had 50,000 records on average Is of.mgf format  Experiments are run for 1, 2, 4 and 8 splits Run on a single node with 8 cores Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 14

Experimental results (Parallelization) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Execution times when datasets are run for 1, 2, 4 and 8 splits 15

Experimental results (Parallelization) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Execution times for datasets when run on 1, 2, 4 and 8 cores 16

Background (Pegasus)  Used to help creating adaptive version of MassMatrix Is a software system to manage workflows Manages resources on local, grid and cloud Provides API’s to create workflows  Creates a DAG to represent dependencies DAG has a connection between nodes if there is dependency  Creates a plan for the execution of the application Executes application according to this plan. Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 17

Background (Condor)  Uses wrangler to start nodes in the cloud New nodes added to cluster automatically Uses Amazon private and public keys to identify user Configuration specified in xml file  Condor is the job scheduler used Developed at University of Wisconsin Jobs are stored in a queue Jobs submitted from queue to the cluster in FIFO Provides fault tolerance through check pointing Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 18

The Pegasus workflow Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Pegasus workflow showing the workflow of MassMatrix Application 19

Parallel Pegasus workflow Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Pegasus workflow for parallel version of MassMatrix application 20

Adaptive Resource Allocation  An approach for dynamic resource allocation Decision based on rate of execution Calculates number of additional resources to meet time constraint  Initial assumption that input is divided into equal splits  Decision made on the basis of execution time of initial N splits Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 21

Adaptive Resource Allocation (Contd.)  The code initially is run with N resources  For our case we used N=4  Let T per_split be the execution time of a single split  T constraint be the user specified time constraint  Then we can say that T time_constraint = T constraint – ( 2 × T per_split ) (1) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 22

Adaptive Resource Allocation (Contd.)  Another N splits must have already started execution Hence we do not consider them in calculation  Hence if we use N resources the predicted execution time is T execution_pred = T per_split × ( {split_count} - 2 × N )(2) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 23

Adaptive Resource Allocation (Contd.)  Based on equations (1) and (2) we can calculate the number of needed as  Nodes required is the number of additional nodes that need to be spawned Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 24

Adaptive Algorithm Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Algorithm showing the steps involved in calculating the additional resources needed to meet the time constraint 25

Experimental Goals  To evaluate efficiency of our system with different datasets  The framework is effective calculates the additional nodes required Meets the time constraints Tested for different time constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 26

Experimental results (Adaptive) Experimental setup:  Cloud infrastructure: Amazon EC2  submit host to submit jobs to the cloud  Pegasus version  Condor job scheduler version  Results for 2 datasets and different time constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 27

Experimental Results (contd.) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Results obtained when algorithm is ran for different time constraints on the dataset1 28

Experimental Results (contd.) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Results Obtained for dataset2 when run with same time constraints 29

Benefit function and Parameter prediction (QOS) Motivation:  Provide Quality of service Tradeoff between execution time vs. quality of results Quality depends on the parameter values Provide a way for the user to control the quality of results Quality defined as equation in terms of parameters  User has flexibility to decide which parameter has more importance  Makes prediction such that execution time is as close as possible to time constraint Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 30

Benefit function and Parameter prediction (QOS)  Benefit function - is an equation made of some or all parameters of the application We use this equation to set the parameter importance This is the minimal set of equations needed to obtain the required quality  The goal is to maximize this benefit function within the user specified time constraint Calculated for different parameter combinations  Decision made using tables constructed from data of previous executions Hash tables are used Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 31

Benefit function and Parameter prediction (QOS)  Tables contain parameter combination to execution time mappings and vice versa  Multiple datasets can be used for prediction Parameters are mapped to average execution time Reduces error percentage Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 32

Parameter prediction process Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 33

Experimental Results  Experiments conducted on a Linux desktop machine with 2 cores and 1 GB of memory  The tables are populated using two datasets data1.mgf and data2.mgf  The parameter combinations are predicted for two other datasets data3.mgf and data4.mgf Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 34

Experimental Results Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Parameter Prediction results when run for different Benefit function and constraints 35

Experimental Results Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Parameter Prediction results for a different Benefit Function 36

Conclusion  Displayed a framework for dynamic execution of scientific workflows  User specified time constraint can be used to drive the allocation of resources  Effective dynamic allocation  Maximizing Benefit function Parameter prediction within this value Provide quality results based on user requirements Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 37

Thank you Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 38