Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster,

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.
Workflows Information Flows Prof. Silvia Olabarriaga Dr. Gabriele Pierantoni.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI WS-PGRADE/gUSE Supporting e-Science communities in Europe Zoltan Farkas.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Connecting Workflow-Oriented Science Gateways to Multi-Cloud Systems Zoltán Farkas, Péter Kacsuk, Ákos Hajnal MTA SZTAKI.
CloudBroker integration to WS- PGRADE/gUSE Zoltán Farkas MTA SZTAKI LPDS
Workflow sharing and integration services by the ER-flow project on behalf of the ER-flow consortium EGI Community Forum, Manchester,
SCI-BUS is supported by the FP7 Capacities Programme under contract no. RI CloudBroker Platform Presentation Wibke Sudholt CloudBroker GmbH Technoparkstrasse.
Software Architecture
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker Platform integration into WS-PGRADE/gUSE Zoltán Farkas MTA.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
A General and Scalable Solution of Heterogeneous Workflow Invocation and Nesting Tamas Kukla, Tamas Kiss, Gabor Terstyanszky.
1 Developing domain specific gateways based on the WS- PGRADE/gUSE framework Peter Kacsuk MTA SZTAKI Start date: Duration:
Sharing, integrating and executing different workflows in heterogeneous multi-cloud systems Peter Kacsuk MTA SZTAKI SCI-BUS is supported.
From P-GRADE to SCI-BUS Peter Kacsuk, Zoltan Farkas and Miklos Kozlovszky MTA SZTAKI - Computer and Automation Research Institute of the Hungarian Academy.
Sharing Workflows through Coarse-Grained Workflow Interoperability : Sharing Workflows through Coarse-Grained Workflow Interoperability G. Terstyanszky,
Introduction to SHIWA Technology Peter Kacsuk MTA SZTAKI and Univ.of Westminster
Introduction to WS-PGRADE and gUSE Tutorial Akos Balasko 04/17/
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
P-GRADE and GEMLCA.
1 P-GRADE Portal: a workflow-oriented generic application development portal Peter Kacsuk MTA SZTAKI, Hungary Univ. of Westminster, UK.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
A scalable and flexible platform to run various types of resource intensive applications on clouds ISWG June 2015 Budapest, Hungary Tamas Kiss,
Lightweight construction of rich scientific applications Daniel Harężlak(1), Marek Kasztelnik(1), Maciej Pawlik(1), Bartosz Wilk(1) and Marian Bubak(1,
Scientific Gateway for Academic Grid Malaysia Group Name: ZenFone Munirah binti Kassim Ana Farhanah binti Omar Siti Syahirah binti.
Convert generic gUSE Portal into a science gateway Akos Balasko.
SHIWA and Coarse-grained Workflow Interoperability Gabor Terstyanszky, University of Westminster Summer School Budapest July 2012 SHIWA is supported.
Building an European Research Community through Interoperable Workflows and Data ER-flow project Gabor Terstyanszky, University of Westminster, UK EGI.
SHIWA: Is the Workflow Interoperability a Myth or Reality PUCOWO, June 2011, London Gabor Terstyanszky, Tamas Kiss, Tamas Kukla University of Westminster.
1 SCI-BUS: building e-Science gateways in Europe: building e-Science gateways in Europe Peter Kacsuk and Zoltan Farkas MTA SZTAKI.
1 WS-PGRADE/gUSE generic DCI gateway framework for EGI user communities Zoltan Farkas and Peter Kacsuk MTA SZTAKI SCI-BUS is supported.
Introduction to the program of the summer school Peter Kacsuk MTA SZTAKI SCI-BUS is supported by the FP7 Capacities Programme under contract.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI MTA SZTAKI background for the DARIAH CC Zoltan Farkas MTA SZTAKI LPDS,
OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
Cloud-enabled, scalable Data Avenue service to process very large, heterogeneus data Péter Kacsuk, Ákos Hajnal MTA SZTAKI Francesco Tusa, Junaid Arshad.
SHIWA Simulation Platform (SSP) Gabor Terstyanszky, University of Westminster EGI Community Forum Munnich March 2012 SHIWA is supported by the FP7.
Usage of WS-PGRADE and gUSE in European and national projects Peter Kacsuk 03/27/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Providing cloud-based simulation services for SMEs EGI 2015, May, Lisbon Dr Tamas Kiss, CloudSME Project Director University of Westminster, London,
Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.
1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.
SCI-BUS is supported by the FP7 Capacities Programme under contract no. RI SCI-BUS and the CloudBroker Platform: Extending Science Gateways to Clouds.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing cloud resources through the WS-PGRADE/gUSE and CloudBroker integrated.
Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.
Using SHIWA Workflow Interoperability Tools for Neuroimaging Data Analysis Applications Vladimir Korkhov 1, Dagmar Krefting 2, Tamas Kukla 3, Gabor Terstyanszky.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Exposing WS-PGRADE/gUSE for large user communities Peter Kacsuk, Zoltan Farkas, Krisztian Karoczkai, Istvan Marton, Akos Hajnal,
CloudSME – Cloud-based Simulation platform for Manufacturing and Engineering from project to company Dr Tamas Kiss, CloudSME Project Director Chair of.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
SHIWA SIMULATION PLATFORM = SSP Gabor Terstyanszky, University of Westminster e-Science Workflows Workshop Budapest 09 nd February 2012 SHIWA is supported.
Accessing the VI-SEEM infrastructure
Organizations Are Embracing New Opportunities
Peter Kacsuk, Zoltan Farkas MTA SZTAKI
WS-PGRADE for Molecular Sciences and XSEDE
Peter Kacsuk MTA SZTAKI
Recap: introduction to e-science
An easier path? Customizing a “Global Solution”
Lecture 16 (Intro to MapReduce and Hadoop)
Introduction to the SHIWA Simulation Platform EGI User Forum,
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Supporting Big Data Processing via Science Gateways EGI CF 2015, November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli

MapReduce/Hadoop * MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner * Map: input data in divided into chunks and analysed on different nodes in a parallel manner * Reduce: collating the work and combining the results into a single value * Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework * Originally for bare-metal clusters – popularity in cloud is growing * Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 Introduction MapReduce and big data

Motivation * Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework * Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim * Integration of Hadoop with workflow systems and science gateways * Automatic setup of Hadoop software and infrastructure * Utilization of the power of Cloud Computing Motivations

CloudSME project * To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – March 2016 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (all SMEs) and 5 academic/research institutions Spin-off company established – CloudSME UG One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations

* Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster * Cluster related parameters and input files provided by user * Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job * Two methods proposed: * Single Node Method * Three Node Method Approach

* Aim: * execute MapReduce job in Cloud resources * automatically set-up and destroy execution environment in the cloud * Infrastructure aware workflow: * the necessary execution environment should also be transparently set up before and destroyed after execution * carried out from the workflow without further user intervention. * Steps 1. execution environment is created dynamically in the cloud 2. execution of workflow tasks 3. breaking down of the infrastructure releasing resources Approach Infrastructure aware workflow

* Connect to cloud and launch servers * Connect to the master node server and setup cluster configuration * Transfer input files and job executable to master node * Start the Hadoop job by running a script in the master node * When the job is finished, delete servers from cloud and retrieve output if the job is successful Approach Single node method

* Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration * Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back * Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources Approach Three node method

Implementation

© CloudBroker GmbH All rights reserved. User Tools Java Client Library* CloudBroker Platform* … Cloud Chemistry Appli-cations Biology Appli- cations Pharma Appli-cations Web Browser UI* … Appli- cations REST Web Service API* End Users, Software Vendors, Resource Providers CLI* Engineering Appli- cations Euca- lyptus Cloud Open- Nebula Cloud* Open- Stack Cloud* Amazon Cloud* CloudSigma Cloud* Seamless access to heterogeneous cloud resources – high level interoperability Implementation CloudBroker platform

General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi- cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology Implementation WS-PGRADE/gUSE

* Each box describes a task * Each arrow describes information flow such as input files and output files * Special node describes parameter sweeps Implementation WS-PGRADE/gUSE

Implementation SHIWA workflow repository Workflow repository to store directly executable workflows Supports various workflow system including WS- PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE

Implementation Supported storage solutions Local (user’s machine): * Bottleneck for large files * Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: * Two file transfers: Swift – Master node – HDFS Amazon S3: * Direct transfer from S3 to HDFS * using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

Experiments and results Initial testbed * CloudSME production gUSE (v 3.6.6) portal * Jobs submitted using the CloudSME CloudBroker platform * All jobs submitted to University of Westminster OpenStack Cloud * Hadoop v2.5.1 on Ubuntu trusty servers

Experiments and results Hadoop applications used for experiments * WordCount - the standard Hadoop example * Rule Based Classification - A classification algorithm adapted for MapReduce * Prefix Span - MapReduce version of the popular sequential pattern mining algorithm

Experiments and results Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes

Experiments and results

5 jobs on a 5 node cluster each, using WS- PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method

* Solution works for any Hadoop application * Proposed approach is generic and can be used for any gateway environment and cloud * User can choose the appropriate method (Single or Three Node) according to the application * Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously * Can be used for large scale scientific simulations * EGI Federated Cloud integration: * Already runs on some EGI FedCloud resources: SZTAKI, BIFI * WS/PGRADE is fully integrated with EGI FedCloud * CloudBroker does not currently support EGI FedCloud directly Conclusion

Any questions?