EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss

OUTLINE Introduction - motivations Approach Previous work Implementation Experiments and results Conclusion

INTRODUCTION Hadoop Open source implementation of the MapReduce framework introduced by Google in 2004 MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner Map: input data in divided into chunks and analysed on different nodes in a parallel manner Reduce: collating the work and combining the results into a single value Monitoring, scheduling and re-executing the failed tasks are the responsibility of MapReduce Originally for bare-metal clusters – popularity in cloud is growing

INTRODUCTION Aim Integration of Hadoop with workflow systems and science gateways Automatic setup of Hadoop software and infrastructure Utilization of the power of Cloud Computing Motivation Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists

INTRODUCTION Motivation CloudSME project To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – December 2015 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (SMEs) and 5 academic/research institutions Industrial use-case: datamining of aircraft maintenance data using Hadoop based parallelisation

INTRODUCTION

APPROACH Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster Cluster related parameters and input files provided by user Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job Two methods proposed:  Single Node Method  Three Node Method

PREVIOUS WORK Hadoop portlet developed by BIFI within the SCI-BUS project Liferay based portlet Submit Hadoop jobs in user specified clusters in OpenStack cloud User only needs to provide job executable and cluster configuration Easy to setup and use Front end based on Ajax web services Back end based on Java Standalone portlet, no integration with WS-PGRADE workflows

PREVIOUS WORK Workflow integration could be achieved directly using the Hadoop portlet Bash script to submit, monitor and retrieve jobs from the portlet  (a) Submit job to MapReduce portlet  (b) Get JobId of submitted job from portlet  (c) Get Job status and logs from portlet periodically until job is nished  (d) Get output of job if job is successful Requires additional portlet to be installed on gateway Adds communication overhead Hadoop Portlet Curl Openstack Cloud Openstack Java API Shell Script

SINGLE NODE METHOD Working: Connect to OpenStack cloud and launch servers Connect to the master node server and setup cluster configuration Transfer input files and job executable to master node Start the Hadoop job by running a script in the master node When the job is finished, delete servers from OpenStack cloud and retrieve output if the job is successful

SINGLE NODE METHOD

THREE NODE METHOD Working: Stage 1 or Deploy Hadoop Node: Launch servers in OpenStack cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources

THREE NODE METHOD

IMPLEMENTATION

IMPLEMENTATION – CLOUDBROKER PLATFORM User Tools Java Client Library* CloudBroker Platform* … Cloud Chemistry Appli-cations Biology Appli- cations Pharma Appli-cations Web Browser UI* … Appli- cations REST Web Service API* End Users, Software Vendors, Resource Providers CLI* Engineering Appli- cations Euca- lyptus Cloud Open- Nebula Cloud* Open- Stack Cloud* Amazon Cloud* CloudSigma Cloud* Seamless access to heterogeneous cloud resources – high level interoperability

IMPLEMENTATION – WS-PGRADE/GUSE General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi-cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology

IMPLEMENTATION – SHIWA WORKFLOW REPOSITORY Workflow repository to store directly executable workflows Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE

IMPLEMENTATION – SUPPORTED STORAGE SOLUTIONS Local (user’s machine): Bottleneck for large files Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: Two file transfers: Swift – Master node – HDFS Amazon S3: Direct transfer from S3 to HDFS using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

EXPERIMENTS AND RESULTS Testbed CloudSME production gUSE (v 3.6.6) portal Jobs submitted using the CloudSME CloudBroker platform All jobs submitted to University of Westminster OpenStack Cloud Hadoop v2.5.1 on Ubuntu 14.04 trusty servers

EXPERIMENTS AND RESULTS Hadoop applications used for experiments WordCount - the standard Hadoop example Rule Based Classification - A classification algorithm adapted for MapReduce Prefix Span - MapReduce version of the popular sequential pattern mining algorithm

EXPERIMENTS AND RESULTS Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes

EXPERIMENTS AND RESULTS 5 jobs on a 5 node cluster each, using WS- PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method

CONCLUSION Solution works for any Hadoop application Proposed approach is generic and can be used for any gateway environment and cloud User can choose the appropriate method (Single or Three Node) according to his/her application Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously Can be used for large scale scientific simulations

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.

Similar presentations

Presentation on theme: "EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.

Similar presentations

Presentation on theme: "EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss."— Presentation transcript:

Similar presentations

About project

Feedback