ARMS-CC Workshop – San Sebastián – July 20th, 2015 User-guided provisioning in federated clouds for distributed calculations A. J. Rubio-Montero 1, E.

Slides:



Advertisements
Similar presentations
Distributed Systems Architecture Research Group Universidad Complutense de Madrid EGEE UF4/OGF25 Catania, Italy March 2 nd, 2009 State and Future Plans.
Advertisements

Interaction model of grid services in mobile grid environment Ladislav Pesicka University of West Bohemia.
Virtual Machine Technology Dr. Gregor von Laszewski Dr. Lizhe Wang.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
2 nd GADA Workshop / OTM 2005 Conferences Eduardo Huedo Rubén S. Montero Ignacio M. Llorente Advanced Computing Laboratory Center for.
FI-WARE – Future Internet Core Platform FI-WARE Cloud Hosting July 2011 High-level description.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space User Oriented Provisioning of Secure Virtualized.
Catania Science Gateway Framework Motivations, architecture, features Catania, 09/06/2014Riccardo Rotondo
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Software to Data model Lenos Vacanas, Stelios Sotiriadis, Euripides Petrakis Technical University of Crete (TUC), Greece Workshop.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
CGW 2003 Institute of Computer Science AGH Proposal of Adaptation of Legacy C/C++ Software to Grid Services Bartosz Baliś, Marian Bubak, Michał Węgiel,
BESIII distributed computing and VMDIRAC
Threads, Thread management & Resource Management.
Nicholas LoulloudesMarch 3 rd, 2009 g-Eclipse Testing and Benchmarking Grid Infrastructures using the g-Eclipse Framework Nicholas Loulloudes On behalf.
European Grid Initiative Federated Cloud update Peter solagna Pre-GDB Workshop 10/11/
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks, An Overview of the GridWay Metascheduler.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks, Novelties and Features around the GridWay.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
1 FedCloud Task Force Demo EGI CF2012 – Munich 28/29 March Matteo Turilli
INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
EGI Technical Forum Madrid COMPSs in the EGI Federated Cloud Daniele Lezzi – BSC EGI Technical Forum Madrid.
EGI-InSPIRE RI EGI Webinar EGI-InSPIRE RI Porting your application to the EGI Federated Cloud 17 Feb
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number Marios Chatziangelou, et al.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
PLATFORM TO EASE THE DEPLOYMENT AND IMPROVE THE AVAILABILITY OF TRENCADIS INFRASTRUCTURE IberGrid 2013 Miguel Caballer GRyCAP – I3M - UPV.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI SA4: Advancing EGI’s Strategic Goals Michel Drescher Technical Manager, EGI.eu.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.
1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Daniele Lezzi Execution of scientific workflows on federated multi-cloud infrastructures IBERGrid Madrid, 20 September 2013.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI A pan-European Research Infrastructure supporting the digital European Research.
CS-DC’15 World Conference – Phoenix – Sep 30th to Oct 1st, 2015 Consolidating user’s resource provisioning capabilities in cloud federations R. Mayo-García.
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number Federated Cloud Update.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Overview for ENVRI Gergely Sipos, Malgorzata Krakowian EGI.eu
1 EGI Federated Cloud Architecture Matteo Turilli Senior Research Associate, OeRC, University of Oxford Chair – EGI Federated Clouds Task Force
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
New Paradigms: Clouds, Virtualization and Co.
StratusLab First Periodic Review
Management of Virtual Machines in Grids Infrastructures
FedCloud Blueprint Update
StratusLab Final Periodic Review
StratusLab Final Periodic Review
DIRAC services.
Grid Computing.
Interoperability & Standards
Management of Virtual Machines in Grids Infrastructures
GGF15 – Grids and Network Virtualization
Cloud Management Mechanisms
ELIXIR Competence Center
Presentation transcript:

ARMS-CC Workshop – San Sebastián – July 20th, 2015 User-guided provisioning in federated clouds for distributed calculations A. J. Rubio-Montero 1, E. Huedo 2, R. Mayo-García 1 1 CIEMAT – 2 UCM ARMS-CC Workshop (as part of PoDP 2015) San Sehastián – July 20 th, 2015

The problem The followed approach Results obtained Conclusions Index 2

Typical approach with commercial providers (AWS) Focused on the management customised environments equivalent to the existent ones within the institution Elasticity for HTC – On demand growing data-centre capacity: clusters – Consolidate services: scientific portals But, how many providers will we count on? Current EGI FedCloud: 40 sites EGI grid: 573 CEs (126 CEs allowed for biomed VO) One cloud provider it’s not a matter 3

The Provisioning Problem: Inter-cloud approach Interoperation issue – Multiple Interfaces / protocols – Different authentication, authorisation and payment mechanisms and accounts. – Monitoring and control Limited characterisation Limited scheduling Not scalable 4

The Provisioning Problem: Federations Interoperability requirements Enable EGI FedCloud Standardised interfacesInteroperationOCCI v1.1, CDMI Information Systems (IS)Discovering and scheduling providers LDAP GLUE schema, site and top-BDIIs Authentication and Authorisation Infrastructure (AAI) Global security and Virtual Organisations X.509, CAs, VOMS Common monitoring, accouting and tracking systems QoS control SLA achievement Nagios/SAM, APEL, GGUS, GocDB… Other tools designed for cloud – AppDB Marketplace. – Science GateWay. 5

Scheduling highly distributed HTC calculations Compatible HTC applications use standardised interfaces/APIs for distributed computing – DRMAA, SAGA, OGSA-BES, OGSA-DAI. Other legacy applications use interfaces derived of Grid computing – GRAM, GridFTP, SRM. Efficient scheduling should take account of the performance of the instantiated VMs and the location of the data to distribute jobs 6

Scheduling highly distributed HTC calculations Most of the few brokering frameworks working on IaaS cloud federations are oriented to consolidate complex services on demand – SlipStream, QBROKAGE, CompatibleOne Exception: the PMES broker o ff ers the OGSA-BES interface to allow instantiating a single VM per each job 7

Characterisation, overheads and reliability As in Grid – GLUE (and OCCI) are incomplete, i.e. quotas for a user, QoS… – ISs usually do not show the reality – Cloud providers can be overbooked – To repeatedly instantiate VMs can be an unacceptable overhead – Cloud providers unexpectedly fail Hinders advanced scheduling 8

Use of pilot frameworks in clouds Concurrent use of Clouds and Grids (and local clusters or Desktop Computing if supported by the framework) Compatibility with applications previously adapted Preserving the achieved performance or speedup and the robustness of the system Keeping the expertise acquired from Grid, reducing the user training-gap and the operational costs Complementary tools such as web UI, monitoring tools and portals, etc. 9

Design adaptation Name LRMS basedElastic Clusters (Sholuoder, PBS, SGE) Push glideinWMS Application oriented Big-Job Grid Scheduler ServiceSs Gwpilot Pull LHC orientedDirac Volunteer Computing 3G-Bridge Easy deployment Communication Customisable Characterisation Legacy applications Automated Scheduling Guided provisioning Grid compatibility 10 Comparative of pilot systems

GWcloud + GWpilot advantages Compatibility and interoperability – Friendly interface and compatibility with legacy applications (DRMAA and OGSA-BES) – Grid security standards and protocols are preserved to enable compatibility with external Grid services – Independent and easy configuration, lower overheads that allow decentralised and local installations, even on the PC of the user – Minimal contextualisation, fully customisable by the user 11

GWcloud + GWpilot advantages Scheduling features – Automatic discovering and monitoring of providers that belong to several cloud federations. – Scheduling capabilities based on constraints, ranking, fair- sharing, deadlines, etc., to instantiate VMs at providers with certain characteristics  Specific VM image (e.g. by the appdb.egi.eu identifier)  Available hardware, with advanced reservations  Associated costs and budgets, QoS, etc. 12

GWcloud + GWpilot advantages Scheduling features – Personalised user-level scheduling of tasks and VMs that allow  Post-configuration of VMs on demand  Customised monitoring of new configurations  Personalised provisioning  Efficient execution of very short jobs – Stackable with other scheduling tools – Parallel accounting of associated costs 13

Provider A Provider B OCCI DRMAA GridWay Core CLI GWpilot Factory pilots GWcloud Execution Driver GWpilot Server HTTPS pull task IaaS Federation abstraction Scheduler Federated IaaS Cloud GWcloud Information Driver LDAP Legacy Application task launching applications Top BDII BES Cloud-Init contextualisation a) a)creation of a new user with sudo privileges; b) b)creation of a le with the temporal user proxy; c) c)inclusion of EUGridPMA repositories; d) d)pre-installation of CAs certificates and minimal grid tools (globus-urlcopy); e) e)execute the pilot. GWcloud Information: The URI contact endpoint, the protocol, hypervisor and VIM release, maximum number of cores, etc. every os_tpl# and their appdb.egi.eu identifier in a list of pairs and included as a new tags. every resource_tpl# is shown as a different queue, with its own characterisation: number of cores, memory, etc. task troubleshooting

15

Experiments: Suitable EGI FedCloud providers Basic Ubuntu LTS image was selected from the appdb.egi.eu repository – OCCI 1.1, encrypted endpoint Miss configured and unreliable sites omitted (7) ProviderOCCI endpointresource tpl#Max.VIM IDGBHyp.Cores Ahttps://carach5.ics.muni.cz:11443small2Xen960OpenNebula Bhttps://controller.ceta-ciemat.es:8787m1-small2KVM224OpenStack Chttps://egi-cloud.pd.infn.it:8787m1-small2KVM96OpenStack Dhttps://fc-one.i3m.upv.es:11443small1KVM16OpenNebula Ehttps://nova2.ui.savba.sk:8787m1-small2KVM408OpenStack Fhttps://prisma-cloud.ba.infn.it:8787small1KVM600OpenStack Ghttps://stack-server-01.ct.infn.it:8787m1-small2KVM600OpenStack 16

Experiments: Scheduling configuration Factory : – Maximum of 200 pilots (running on VMs) – Waits maximum of 600 s for starting a pilot ( creation, booting, contextualisation…) Scheduling : – The number of tasks and VMs managed during a scheduling cycle of 10 s was set to 20 – One VM per suitable provider in every cycle – Resource banning is enabled: the only way to know the quotas established at providers 17

Experiments: VMs provisioned ProviderTest 1Test 2Test 3 set upfailedset upfailedset upfailed A B C D E F G Unreliable (7) BEAMnrc – – 4·10 8 particles on a rectangular geometry – – The workload was divided into 2,000 tasks ( s)

19 Experiments: Test 1

20 Experiments: Test 2

21 Experiments: Test 3

Conclusions 22 A generic framework for performing massive distributed calculations in federated clouds has been presented The system is able to perform a dynamic provisioning based on the current status of the cloud federations while supporting legacy applications The incorporation of diverse complex algorithms devoted to specific workloads will be evaluated in the future – – Stacking self-schedulers. – – Provisioning based on economical questions – – QoS and budgets – – Inclusion of deadlines or the management of checkpoints.

Thank you for your attention CIEMAT – Avda. Complutense, 40 – Madrid –

DRMAA Legacy Application GWpilot Factory pilots GridWay Core CLI task GWcloud Execution Driver GWpilot Server task HTTPS pull task IaaS Federation abstraction Scheduler troubleshooting launching applications Federated IaaS Cloud GWcloud Information Driver LDAP OCCI Top BDII Provider A Provider B BES 24

DRMAA Legacy Application GridWay Core CLI GWcloud Execution Driver IaaS Federation abstraction Scheduler troubleshooting launching applications Federated IaaS Cloud GWcloud Information Driver LDAP OCCI Top BDII Provider A Provider B BES remote BoT Direct execution Only suitable for long jobs or BoTs 25

GWcloud Information Driver GWcloud Information Driver The URI contact endpoint, the protocol, hypervisor and VIM release, the maximum number of available cores, etc. Every OS template name (the os tpl ) and their appdb.egi.eu image identifier is compiled in a list of pairs and included as a new tags. Every resource template (resource tpl ) is shown as a different queue, with its own characterisation: number of cores, memory, etc 26

GWcloud Execution Driver GWcloud Execution Driver 1.It gets and stores the match, i.e. the description of the job and the URI of the OCCI service. 2.It interprets the job description to obtain the inputs, outputs and executable URIs, the os tpl, and the resource tpl. 3.Creates the contextualisation file 4.It builds and performs an OCCI create operation that includes the contextualisation file, the resource tpl, os tpl and the URI of provider. Subsequently, the job is considered in a PENDING state. 5.It waits for the VM starting to change the job state to ACTIVE. To make this periodically, it performs OCCI describe operations. If this circumstance does not happen during the timeout set in the job description, the job is considered as FAILED. 6.When the VM is running, the driver waits for the VM becoming into inactive ; subsequently, the job is considered DONE. However, if other VM condition is reached, it returns FAILED. 7.Finally, it deletes the VM 27