EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Workflows Information Flows Prof. Silvia Olabarriaga Dr. Gabriele Pierantoni.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI WS-PGRADE/gUSE Supporting e-Science communities in Europe Zoltan Farkas.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Connecting Workflow-Oriented Science Gateways to Multi-Cloud Systems Zoltán Farkas, Péter Kacsuk, Ákos Hajnal MTA SZTAKI.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
CloudBroker integration to WS- PGRADE/gUSE Zoltán Farkas MTA SZTAKI LPDS
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
Workflow sharing and integration services by the ER-flow project on behalf of the ER-flow consortium EGI Community Forum, Manchester,
SCI-BUS is supported by the FP7 Capacities Programme under contract no. RI CloudBroker Platform Presentation Wibke Sudholt CloudBroker GmbH Technoparkstrasse.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker Platform integration into WS-PGRADE/gUSE Zoltán Farkas MTA.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
1 Developing domain specific gateways based on the WS- PGRADE/gUSE framework Peter Kacsuk MTA SZTAKI Start date: Duration:
Sharing, integrating and executing different workflows in heterogeneous multi-cloud systems Peter Kacsuk MTA SZTAKI SCI-BUS is supported.
07/06/11 New Features of WS-PGRADE (and gUSE) 2010 Q Q2 Miklós Kozlovszky MTA SZTAKI LPDS.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Creating the Autodock gateway from WS-PGRADE/gUSE and making it cloud-enabled.
From P-GRADE to SCI-BUS Peter Kacsuk, Zoltan Farkas and Miklos Kozlovszky MTA SZTAKI - Computer and Automation Research Institute of the Hungarian Academy.
Sharing Workflows through Coarse-Grained Workflow Interoperability : Sharing Workflows through Coarse-Grained Workflow Interoperability G. Terstyanszky,
Introduction to SHIWA Technology Peter Kacsuk MTA SZTAKI and Univ.of Westminster
The PROGRESS Grid Service Provider Maciej Bogdański Portals & Portlets 2003 Edinburgh, July 14th-17th.
Introduction to WS-PGRADE and gUSE Tutorial Akos Balasko 04/17/
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Convert generic gUSE Portal into a science gateway Akos Balasko 02/07/
Holding slide prior to starting show. A Portlet Interface for Computational Electromagnetics on the Grid Maria Lin and David Walker Cardiff University.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Services for advanced workflow programming.
A scalable and flexible platform to run various types of resource intensive applications on clouds ISWG June 2015 Budapest, Hungary Tamas Kiss,
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
Scientific Gateway for Academic Grid Malaysia Group Name: ZenFone Munirah binti Kassim Ana Farhanah binti Omar Siti Syahirah binti.
Convert generic gUSE Portal into a science gateway Akos Balasko.
SHIWA and Coarse-grained Workflow Interoperability Gabor Terstyanszky, University of Westminster Summer School Budapest July 2012 SHIWA is supported.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing Cloud Systems from WS-PGRADE/gUSE Zoltán Farkas MTA SZTAKI LPDS.
SHIWA: Is the Workflow Interoperability a Myth or Reality PUCOWO, June 2011, London Gabor Terstyanszky, Tamas Kiss, Tamas Kukla University of Westminster.
Application Specific Module Tutorial Zoltán Farkas, Ákos Balaskó 03/27/
1 SCI-BUS: building e-Science gateways in Europe: building e-Science gateways in Europe Peter Kacsuk and Zoltan Farkas MTA SZTAKI.
1 WS-PGRADE/gUSE generic DCI gateway framework for EGI user communities Zoltan Farkas and Peter Kacsuk MTA SZTAKI SCI-BUS is supported.
Holding slide prior to starting show. Lessons Learned from the GECEM Portal David Walker Cardiff University
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI MTA SZTAKI background for the DARIAH CC Zoltan Farkas MTA SZTAKI LPDS,
OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
Supporting Big Data Processing via Science Gateways EGI CF 2015, November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster,
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Cloud-enabled, scalable Data Avenue service to process very large, heterogeneus data Péter Kacsuk, Ákos Hajnal MTA SZTAKI Francesco Tusa, Junaid Arshad.
SHIWA Simulation Platform (SSP) Gabor Terstyanszky, University of Westminster EGI Community Forum Munnich March 2012 SHIWA is supported by the FP7.
Usage of WS-PGRADE and gUSE in European and national projects Peter Kacsuk 03/27/
Providing cloud-based simulation services for SMEs EGI 2015, May, Lisbon Dr Tamas Kiss, CloudSME Project Director University of Westminster, London,
SCI-BUS is supported by the FP7 Capacities Programme under contract no. RI SCI-BUS and the CloudBroker Platform: Extending Science Gateways to Clouds.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing cloud resources through the WS-PGRADE/gUSE and CloudBroker integrated.
Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS
Convert generic gUSE Portal into a science gateway Akos Balasko.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Overview on the work performed during EPIKH Training Faiza MEDJEK /INFN, CATANIA 1.
Exposing WS-PGRADE/gUSE for large user communities Peter Kacsuk, Zoltan Farkas, Krisztian Karoczkai, Istvan Marton, Akos Hajnal,
CloudSME – Cloud-based Simulation platform for Manufacturing and Engineering from project to company Dr Tamas Kiss, CloudSME Project Director Chair of.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Introduction to the SHIWA Simulation Platform EGI User Forum,
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss

OUTLINE Introduction - motivations Approach Previous work Implementation Experiments and results Conclusion

INTRODUCTION Hadoop Open source implementation of the MapReduce framework introduced by Google in 2004 MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner Map: input data in divided into chunks and analysed on different nodes in a parallel manner Reduce: collating the work and combining the results into a single value Monitoring, scheduling and re-executing the failed tasks are the responsibility of MapReduce Originally for bare-metal clusters – popularity in cloud is growing

INTRODUCTION Aim Integration of Hadoop with workflow systems and science gateways Automatic setup of Hadoop software and infrastructure Utilization of the power of Cloud Computing Motivation Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists

INTRODUCTION Motivation CloudSME project To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – December 2015 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (SMEs) and 5 academic/research institutions Industrial use-case: datamining of aircraft maintenance data using Hadoop based parallelisation

INTRODUCTION

APPROACH Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster Cluster related parameters and input files provided by user Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job Two methods proposed:  Single Node Method  Three Node Method

PREVIOUS WORK Hadoop portlet developed by BIFI within the SCI-BUS project Liferay based portlet Submit Hadoop jobs in user specified clusters in OpenStack cloud User only needs to provide job executable and cluster configuration Easy to setup and use Front end based on Ajax web services Back end based on Java Standalone portlet, no integration with WS-PGRADE workflows

PREVIOUS WORK Workflow integration could be achieved directly using the Hadoop portlet Bash script to submit, monitor and retrieve jobs from the portlet  (a) Submit job to MapReduce portlet  (b) Get JobId of submitted job from portlet  (c) Get Job status and logs from portlet periodically until job is nished  (d) Get output of job if job is successful Requires additional portlet to be installed on gateway Adds communication overhead Hadoop Portlet Curl Openstack Cloud Openstack Java API Shell Script

SINGLE NODE METHOD Working: Connect to OpenStack cloud and launch servers Connect to the master node server and setup cluster configuration Transfer input files and job executable to master node Start the Hadoop job by running a script in the master node When the job is finished, delete servers from OpenStack cloud and retrieve output if the job is successful

SINGLE NODE METHOD

THREE NODE METHOD Working: Stage 1 or Deploy Hadoop Node: Launch servers in OpenStack cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources

THREE NODE METHOD

IMPLEMENTATION

IMPLEMENTATION – CLOUDBROKER PLATFORM User Tools Java Client Library* CloudBroker Platform* … Cloud Chemistry Appli-cations Biology Appli- cations Pharma Appli-cations Web Browser UI* … Appli- cations REST Web Service API* End Users, Software Vendors, Resource Providers CLI* Engineering Appli- cations Euca- lyptus Cloud Open- Nebula Cloud* Open- Stack Cloud* Amazon Cloud* CloudSigma Cloud* Seamless access to heterogeneous cloud resources – high level interoperability

IMPLEMENTATION – WS-PGRADE/GUSE General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi-cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology

IMPLEMENTATION – SHIWA WORKFLOW REPOSITORY Workflow repository to store directly executable workflows Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE

IMPLEMENTATION – SUPPORTED STORAGE SOLUTIONS Local (user’s machine): Bottleneck for large files Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: Two file transfers: Swift – Master node – HDFS Amazon S3: Direct transfer from S3 to HDFS using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

EXPERIMENTS AND RESULTS Testbed CloudSME production gUSE (v 3.6.6) portal Jobs submitted using the CloudSME CloudBroker platform All jobs submitted to University of Westminster OpenStack Cloud Hadoop v2.5.1 on Ubuntu trusty servers

EXPERIMENTS AND RESULTS Hadoop applications used for experiments WordCount - the standard Hadoop example Rule Based Classification - A classification algorithm adapted for MapReduce Prefix Span - MapReduce version of the popular sequential pattern mining algorithm

EXPERIMENTS AND RESULTS Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes

EXPERIMENTS AND RESULTS 5 jobs on a 5 node cluster each, using WS- PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method

CONCLUSION Solution works for any Hadoop application Proposed approach is generic and can be used for any gateway environment and cloud User can choose the appropriate method (Single or Three Node) according to his/her application Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously Can be used for large scale scientific simulations