Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.

Slides:



Advertisements
Similar presentations
SLA-Oriented Resource Provisioning for Cloud Computing
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
GLOBUS PLUG-IN FOR WINGS WOKFLOW ENGINE Elizabeth Martí ITACA Universidad Politécnica de Valencia
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
SALSA HPC Group School of Informatics and Computing Indiana University.
Low Cost, Scalable Proteomics Data Analysis Using Amazon's Cloud Computing Services and Open Source Search Algorithms Brian D. Halligan, Ph.D. Medical.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
XSEDE 13 July 24, Galaxy Team: PSC Team:
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering.
Universidad Politécnica de Baja California. Juan P. Navarro Sanchez 9th level English Teacher: Alejandra Acosta The Beowulf Project.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Cloud Usage Overview The IBM SmartCloud Enterprise infrastructure provides an API and a GUI to the users. This is being used by the CloudBroker Platform.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
The New Zealand Institute for Plant & Food Research Limited Use of Cloud computing in impact assessment of climate change Kwang Soo Kim and Doug MacKenzie.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
6 February 2009 ©2009 Cesare Pautasso | 1 JOpera and XtremWeb-CH in the Virtual EZ-Grid Cesare Pautasso Faculty of Informatics University.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Enterprise Cloud Computing
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
BioVLAB-Microarray: Microarray Data Analysis in Virtual Environment Youngik Yang, Jong Youl Choi, Kwangmin Choi, Marlon Pierce, Dennis Gannon, and Sun.
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-2.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
HPC In The Cloud Case Study: Proteomics Workflow
Accessing the VI-SEEM infrastructure
Introductory RNA-seq Transcriptome Profiling
University of Chicago and ANL
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Spark Presentation.
DIRAC services.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Clouds from FutureGrid’s Perspective
Overview of Workflows: Why Use Them?
High Throughput Computing for Astronomers
Computational Pipeline Strategies
Presentation transcript:

Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed Platforms for Protein-Guided Scientific Workflow 1 XSEDE '14, July , Atlanta, GA, USA

Introduction  Gene expression and transcriptome analysis are one of the main focuses of research for a great number of biologists and scientists  The analysis of this so called “big data” is done by using a complex set of multitude of software tools  Enhanced demand of powerful computational resources where the data can be stored and analyzed 2

Assembly Pipeline  Assembly of raw sequence data is a complex multi-stage process composed of preprocessing, assembling, and post- processing  Assembly pipeline is used to simplify the entire assembly process by automating steps of the pipeline 3

blast2cap3  Multiple approaches used for assembling the filtered reads produce high redundancy of the resulting transcripts  Overlap-based assembly program CAP3 is used to merge transcripts based on the overlapping region with specified identity  However, because most of the produced transcripts code for a protein, a protein similarity should be also considered during the merging 4

blast2cap3  Blast2cap3 is a protein-guided assembly approach that first clusters the transcripts based on similarity to a common protein and then passes each cluster to CAP3  Blast2cap3 is a Python script written by Vince Buffalo from Plant Sciences Department, UCD  The recent use of blast2cap3 on the wheat transcriptome assembly shows that blast2cap3 generates fewer artificially fused sequences and reduces the total number of transcripts by 8-9% 5

blast2cap3  The assembled transcripts are aligned with protein datasets closely related to the organism for which the transcripts are generated, and afterwards, transcripts sharing a common protein hit are merged using CAP3  The current implementation of blast2cap3 supports only serial execution 6

Pegasus Workflow Management System  The modularity of blast2cap3 allows us to decompose the existing approach on multiple tasks, some of which can be run in parallel  The protein-guided assembly can be structured into a scientific workflow 7

Pegasus Workflow Management System  Pegasus WMS is a framework that automatically maps high-level scientific workflows organized as directed acyclic graph (DAG) onto wide range of execution platforms, including clusters, grids, and clouds  Pegasus uses DAX (directed acyclic graph in XML) files to specify an abstract workflow  The abstract workflow contains information and description of all executable files and logical names of the input files used by the workflow 8

blast2cap3 with Pegasus WMS  Each node represents a workflow task, while each edge represents the dependency between the tasks  Archive of all required built libraries and tools (Python, Biopython, CAP3)  The step of downloading and extracting this archive is defined as a task in the workflow  Pegasus WMS implementation of blast2cap3 reduces the running time of the current serial implementation of blast2cap3 for more than 95% 9

10

Execution Platforms  The resources that scientific workflows require can exceed the capabilities of the local computational resources  Scientific workflows are usually executed on distributed platforms, such as campus clusters, grids or clouds  Used execution platforms 11

Sandhills: University of Nebraska Campus Cluster  Sandhills is one of the High Performance Computing (HPC) Clusters at the University of Nebraska – Lincoln Holland Computing Center (HCC)  Used by faculty and students  Sandhills was constructed in 2011 and it has 1440 AMD cores housed in a total of 44 nodes  Every new user account of HCC is required to be associated with a faculty or research group 12

OSG: Open Science Grid  OSG is a national consortium of geographically distributed academic institutions and laboratories that provide hundreds computing and storage resources to the OSG users  OSG is organized into Virtual Organizations  OSG does not own any computing or storage resources, but allows users to use the resources contributed by the other members of the OSG and VO’s  Every new user applies for an OSG certificate 13

Amazon EC2: Amazon Elastic Compute Cloud  Amazon Elastic Compute Cloud (Amazon EC2) is a large commercial Web-based service provided by Amazon.com  Users have access to virtual machine (VM) instances where they deploy VM images with customized software and libraries  Amazon EC2 is a scalable, elastic and flexible platform  Amazon EC2 users are hourly billed for the number and the type of resources they are using 14

Experiments  Investigate the behavior of the modified Pegasus WMS implementation of blast2cap3 when the workflow is composed of 30, 110, 210, 610, 1,010, and 2,010 tasks respectively  Run the workflow multiple times on the different execution platforms in order to detect the different workflow performance as well as the different resource availability over time 15

Experiments  Compare the total workflow running time between different execution platforms  Examine the number of running versus the number of idle jobs over time for each workflow 16

Experimental Data  Diploid wheat Triticum urartu dataset from NCBI  The assembled transcripts were generated using Velvet as a de novo assembler  These transcripts were aligned with closely related wheat organisms (Barley, Brachypodium, Rice, Maize, Sorghum, Arabidopsis)  “transcripts.fasta”, 404 MB big, 236,529 assembled transcripts  “alignments.out”, 155 MB big, 1,717,454 protein hits 17

Comparing Running Time on Sandhills, OSG and Amazon EC2 for Workflows with Different Number of Tasks 18

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 19

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 20

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 21

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 22

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 23

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number 24

Cost Comparison of Different Execution Platforms 25  The main and the most important difference between the commercial cloud and the academic distributed resources is the cost  Sandhills:  generally free resources  OSG:  completely free resources  Amazon EC2:  complex pricing model  50 m1.large spot instance X $0.04 per hour = $122.84

Conclusion  Using more than 100 tasks in a workflow significantly reduces the running time for all execution platforms  The resource allocation on Sandhills and OSG is opportunistic, and its availability changes over time  The results are almost constant when Amazon EC2 is used  Workflow failures were not encountered on Sandhills and Amazon EC2 26

Conclusion  The predictability of the Amazon EC2 resources leads to better workflow running time when the cloud is used as a platform  For our blast2cap3 workflow, better running time and better usage of the allocated resources were achieved when Amazon EC2 is used  Due to the Amazon EC2 cost, the academic distributed systems can be a good alternative 27

Acknowledgments  University of Nebraska Holland Computing Center  Open Science Grid 28