Supporting Big Data Processing via Science Gateways EGI CF 2015, November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli
MapReduce/Hadoop * MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner * Map: input data in divided into chunks and analysed on different nodes in a parallel manner * Reduce: collating the work and combining the results into a single value * Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework * Originally for bare-metal clusters – popularity in cloud is growing * Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 Introduction MapReduce and big data
Motivation * Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework * Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim * Integration of Hadoop with workflow systems and science gateways * Automatic setup of Hadoop software and infrastructure * Utilization of the power of Cloud Computing Motivations
CloudSME project * To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – March 2016 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (all SMEs) and 5 academic/research institutions Spin-off company established – CloudSME UG One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations
* Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster * Cluster related parameters and input files provided by user * Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job * Two methods proposed: * Single Node Method * Three Node Method Approach
* Aim: * execute MapReduce job in Cloud resources * automatically set-up and destroy execution environment in the cloud * Infrastructure aware workflow: * the necessary execution environment should also be transparently set up before and destroyed after execution * carried out from the workflow without further user intervention. * Steps 1. execution environment is created dynamically in the cloud 2. execution of workflow tasks 3. breaking down of the infrastructure releasing resources Approach Infrastructure aware workflow
* Connect to cloud and launch servers * Connect to the master node server and setup cluster configuration * Transfer input files and job executable to master node * Start the Hadoop job by running a script in the master node * When the job is finished, delete servers from cloud and retrieve output if the job is successful Approach Single node method
* Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration * Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back * Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources Approach Three node method
© CloudBroker GmbH All rights reserved. User Tools Java Client Library* CloudBroker Platform* … Cloud Chemistry Appli-cations Biology Appli- cations Pharma Appli-cations Web Browser UI* … Appli- cations REST Web Service API* End Users, Software Vendors, Resource Providers CLI* Engineering Appli- cations Euca- lyptus Cloud Open- Nebula Cloud* Open- Stack Cloud* Amazon Cloud* CloudSigma Cloud* Seamless access to heterogeneous cloud resources – high level interoperability Implementation CloudBroker platform
General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi- cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology Implementation WS-PGRADE/gUSE
* Each box describes a task * Each arrow describes information flow such as input files and output files * Special node describes parameter sweeps Implementation WS-PGRADE/gUSE
Implementation SHIWA workflow repository Workflow repository to store directly executable workflows Supports various workflow system including WS- PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE
Implementation Supported storage solutions Local (user’s machine): * Bottleneck for large files * Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: * Two file transfers: Swift – Master node – HDFS Amazon S3: * Direct transfer from S3 to HDFS * using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow
Experiments and results Initial testbed * CloudSME production gUSE (v 3.6.6) portal * Jobs submitted using the CloudSME CloudBroker platform * All jobs submitted to University of Westminster OpenStack Cloud * Hadoop v2.5.1 on Ubuntu trusty servers
Experiments and results Hadoop applications used for experiments * WordCount - the standard Hadoop example * Rule Based Classification - A classification algorithm adapted for MapReduce * Prefix Span - MapReduce version of the popular sequential pattern mining algorithm
Experiments and results Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes
Experiments and results
5 jobs on a 5 node cluster each, using WS- PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method
* Solution works for any Hadoop application * Proposed approach is generic and can be used for any gateway environment and cloud * User can choose the appropriate method (Single or Three Node) according to the application * Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously * Can be used for large scale scientific simulations * EGI Federated Cloud integration: * Already runs on some EGI FedCloud resources: SZTAKI, BIFI * WS/PGRADE is fully integrated with EGI FedCloud * CloudBroker does not currently support EGI FedCloud directly Conclusion
Any questions?