Flowbster: Dynamic creation of data pipelines in clouds

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

P. Kacsuk, G. Sipos, A. Toth, Z. Farkas, G. Kecskemeti and G. Hermann P. Kacsuk, G. Sipos, A. Toth, Z. Farkas, G. Kecskemeti and G. Hermann MTA SZTAKI.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI

WS-PGRADE: Supporting parameter sweep applications in workflows Péter Kacsuk, Krisztián Karóczkai, Gábor Hermann, Gergely Sipos, and József Kovács MTA.

EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space User Oriented Provisioning of Secure Virtualized.

Workflow Management Chris A. Mattmann OODT Component Working Group.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker Platform integration into WS-PGRADE/gUSE Zoltán Farkas MTA.

CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Creating the Autodock gateway from WS-PGRADE/gUSE and making it cloud-enabled.

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.

EGEE-II INFSO-RI Enabling Grids for E-sciencE P-GRADE overview and introduction: workflows & parameter sweeps (Advanced features)

FlowLevel Client, server & elements monitoring and controlling system Message Include End Dial Start.

Application Specific Module Tutorial Zoltán Farkas, Ákos Balaskó 03/27/

IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.

07/02/2012 WS-PGRADE/gUSE in use Lightweight introduction Zoltán Farkas MTA SZTAKI LPDS.

OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.

WS-PGRADE/gUSE in use Advance use of WS- PGRADE/gUSE gateway framework Zoltán Farkas and Peter Kacsuk MTA SZTAKI LPDS.

Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.

Supporting Big Data Processing via Science Gateways EGI CF 2015, November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster,

The EUBrazilOpenBio-BioVeL Use Case in EGI Daniele Lezzi, Barcelona Supercomputing Center EGI-TF September 2013.

EGI-InSPIRE RI EGI Webinar EGI-InSPIRE RI Porting your application to the EGI Federated Cloud 17 Feb

Cloud-enabled, scalable Data Avenue service to process very large, heterogeneus data Péter Kacsuk, Ákos Hajnal MTA SZTAKI Francesco Tusa, Junaid Arshad.

Cloud interoperability and elasticity with COMPSs Federated Cloud F2F Jan , Amsterdam Daniele Lezzi – Barcelona Supercomputing Center.

1 Support for parameter study applications in the P-GRADE Portal Gergely Sipos MTA SZTAKI (Hungarian Academy of Sciences)

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.

Occopus and its usage to build efficient data processing workflow infrastructures in clouds József Kovács, Péter Kacsuk, Ádám Novák, Ádám Visegrádi MTA.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing cloud resources through the WS-PGRADE/gUSE and CloudBroker integrated.

Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS

Apache Tez : Accelerating Hadoop Query Processing Page 1.

OCCI-compliant Occopus orchestrator and experiences with using it with the EGI Federated Cloud Jozsef Kovacs and Peter Kacsuk MTA SZTAKI.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Convert generic gUSE Portal into a science gateway Akos Balasko.

Exposing WS-PGRADE/gUSE for large user communities Peter Kacsuk, Zoltan Farkas, Krisztian Karoczkai, Istvan Marton, Akos Hajnal,

EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.

PaaS services for Computing and Storage

Accessing the VI-SEEM infrastructure

Deployment of Flows Loretta Auvil

Peter Kacsuk, Zoltan Farkas MTA SZTAKI

Dockerize OpenEdge Srinivasa Rao Nalla.

Introduction to Distributed Platforms

Data Bridge Solving diverse data access in scientific applications

StratusLab Final Periodic Review

StratusLab Final Periodic Review

HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.

Spark Presentation.

Grid Application Support Group Case study Schrodinger equations on the Grid Status report 16. January, Created by Akos Balasko

WS-PGRADE for Molecular Sciences and XSEDE

P-GRADE and GEMLCA.

Lightweight introduction

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

Lightweight introduction

An easier path? Customizing a “Global Solution”

University of Technology

OPNFV Arno Installation & Validation Walk-Through

Case Study: Algae Bloom in a Water Reservoir

Simplified Development Toolkit

LifeWatch Cloud Computing Workshop

Orchestration & Container Management in EGI FedCloud

Wide Area Workload Management Work Package DATAGRID project

Overview of Workflows: Why Use Them?

MMG: from proof-of-concept to production services at scale

Distributing META-pipe on ELIXIR compute resources

Final Review 27th March Final Review 27th March 2019.

Introduction to the SHIWA Simulation Platform EGI User Forum,

Workflow level parametric study support by the P-GRADE portal

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Flowbster: Dynamic creation of data pipelines in clouds Peter Kacsuk, Jozsef Kovacs and Zoltan Farkas MTA SZTAKI kacsuk@sztaki.hu

Motivations for Flowbster Processing big data typically means to execute a set of tasks on a large data set These tasks are typically executed in a certain order Such an ordering of tasks can be represented as a Dataflow graph Such dataflow graphs can be executed by dataflow workflow systems The goal is to execute such dataflow workflow systems in clouds using as many cloud resources as needed (on-demand resource usage) The name of this new workflow system is Flowbster

Flowbster contra job-oriented workflow systems Job-oriented workflow systems work based on service orchestration Nodes of a workflow represent jobs to be executed in the infrastructure There is a workflow enactor (orchestrator) that recognizes that a certain node/job can be executed and submits this job together with the required data The result data is typically transferred back to the enactor (or its storage) This execution mechanism is not optimal, requires to much data transfer

Flowbster contra job-oriented workflow systems In Flowbster there is no enactor, it works based on service coreography Nodes of the workflow directly communicate the data among them Data is passed through the workflow as a data stream A node is activated and executes the assigned task when all the input data arrived There is no useless data transfer Nodes of Flowbster workflows are deployed in the cloud as VMs and they exist until all the input data sets are processed As a result a Flowbster workflow works as a temporary virtual infrastructure deployed in the cloud Input data sets flow through this virtual infrastructure and meanwhile they flow through they are processed by the nodes of the workflow

Concept of Flowbster The goal of Flowbster is to enable The quick deployment of the workflow as a pipeline infrastructure in the cloud Once the pipeline infrastructure is created in the cloud it is activated and data elements of the data set to be processed flow through the pipeline As the data set flows through the pipeline its data elements are processed as defined by the Flowbster workflow Data set to be processed Processed data set EUDAT service EUDAT service EGI FedCloud service

Structure of the Flowbster workflow system Goal: To create the Flowbster workflow in the cloud without any cloud knowledge Solution: To provide a layered concept where users with different expertise can enter to the use of Flowbster 4 layers: Graphical design layer Application description layer Flowbster layers Workflow system layer Cloud deployment and orchestration layer Occopus layer

Occopus Layer Occopus is a cloud orchestrator and manager tool It automatically deploys virtual infrastructures (like Flowbster workflows) in the cloud based on an Occopus descriptor that consists of: Virtual infrastructure description: Specifies the nodes (services) to be deployed and all cloud-independent attributes e.g. input values for a service. Specifies the dependencies among the nodes, to decide the order of deployment Specifies scaling related attributes like min, max number of instances Node definition: Defines how to construct the node on a target cloud. This contains all cloud dependent settings, e.g. image id, flavour, contextualization See detailed tutorials at the Occopus web page: http://occopus.lpds.sztaki.hu/tutorials

Flowbster Workflow System Layer Contains uniform Flowbster workflow nodes which have the internal structure shown in the figure Every node provides the following actions: Receives and keeps track of the input items Executes the (pre-) configured application when inputs are ready Identifies and forwards results of execution towards a (pre-) configured endpoint Contains 3 components: Receiver: service to receive inputs Executor: service to execute predefined app Forwarder: service to send results of the finished app to a predefined remote location Receiver inputs ready Forwarder Executor job finished sysconf: global settings, i.e. working directories, port, logging, etc. appconf: application definition, i.e. exe, args, inputs, outputs, endpoints, etc. Also requires 2 config files in order to costumize the node according to the workflow definition

Connecting Flowbster nodes into a workflow NodeB Receiver NodeA Receiver Executor Executor Forwarder NodeA.out1 -> NodeB.in Forwarder sysconf appconf sysconf NodeA.out2 -> NodeC.in appconf NodeC Receiver Executor Flowbster workflow nodes work in a service coreography In appconf the list of outputs is defined For a certain output the endpoints are defined An endpoint must point to a receiver node Forwarder sysconf appconf

Flowbster Graphical Design Layer

Flowbster Graphical Design Layer

Flowbster Application Description Layer - &Vina name: Vina type: flowbster_node scaling: min: 5 max: 5 variables: jobflow: app: exe: filename: vina.run tgzurl: http://foo.bar/vina.tgz args: '' in: - name: ligands.zip name: config.txt name: receptor.pdbqt out: name: output.tar targetname: output.tar targetnode: COLLECTOR Automatically generated from the graphical view It contains the Occopus descriptor of the Flowbster workflow Virtual infrastructure descriptor representing the workflow graph Customized node definitions for each node of the workflow. E.g. Vina node: Define parallellism From GENERATOR Vina To COLLECTOR ligands Receiver inputs ready Forwarder config output receptor Executor job finished appconf: application definition, i.e. exe, args, inputs, outputs, endpoints, etc.

Feeding and gathering data set elements to be processed Processed data set Feeder Collector Feeder: not part of Flowbster, should be written by the user Command line tool Feeds a given node/port of Flowbster workflow with input data items Collector: not part of Flowbster, should be written by the user Web service acting as a receiver Transfers the incoming data items into the target storage

Exploitable parallelisms in Flowbster Parallel branch parallelism Pipeline parallelism Node scalability parallelism D2 D1 D3

Node scalability parallelism in Flowbster Generator-Worker-Collector parameter sweep processing pattern: The Generator generates N output data from 1 input data The Worker should be executed for every input data -> N Worker instances can run in parallel for processing the N data The Collector collects the N results coming from the N Worker instances and after processing them creates 1 output data

Heterogeneous multi-cloud setup of Flowbster in EGI FedCloud VI Descriptor Occopus OCCI EGI FedCloud e.g.: CESNET OpenNebula G OCCI OCCI OCCI OCCI EGI FedCloud e.g.: SZTAKI W Occopus can utilise multiple clouds in a federation like EGI FedCloud Nodes of deployable VI are instantiated on different FedCloud sites Connection is based on public ips OCCI EGI FedCloud e.g.: BIFI OpenStack C

Performance measurements Experiment with Autodock Vina Workflow Question: What speedup can be achieved by node scalability parallelism? 256 docking simulation having 2, 4, 8 and 16 instances of the Vina (worker) node Case (number of Vina node instances) Makespan (minutes) 2 8 4 5 3 16

Current state of Occopus Open-source (License: Apache v2) 6 releases so far (latest in August 2016) Now: Release v1.2 (3rd production release) Python 2.7 Base webpage: http://occopus.lpds.sztaki.hu Git: https://github.com/occopus Documentation: Users’ Guide Developers’ Guide Tutorials (e.g. building docker/swarm cluster) Package repository: http://pip.lpds.sztaki.hu/packages

Current state of Flowbster Open-source (License: Apache v2) Running prototype First release comes in October 2016 Available at Git: https://github.com/occopus Documentation under development: Users’ Guide Developers’ Guide Tutorials Further development plans Dynamic scalability for node scalability parallelism Built-in error diagnostic and fault-recovery mechanism