1 Resource Provisioning Overview Laurence Field 12 April 2015.

Slides:



Advertisements
Similar presentations
1/16 Distributed Systems Architecture Research Group Universidad Complutense de Madrid An Introduction to Virtualization and Cloud Technologies to Support.
Advertisements

University of Notre Dame
System Center 2012 R2 Overview
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
HEPiX Virtualisation Working Group Status, July 9 th 2010
VMware Virtualization Last Update Copyright Kenneth M. Chipps Ph.D.
“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 4.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
ProjectWise Virtualization Kevin Boland. What is Virtualization? Virtualization is a technique for deploying technologies. Virtualization creates a level.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Bob Jones, CERN, IT department 4 October Why Procurement The activities within the Helix Nebula initiative have shown that public research organisations.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Assessment of Core Services provided to USLHC by OSG.
Mostafa Abdollahi Mazandaran University Of Science And Technology January 2011.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
The Data Bridge Laurence Field IT/SDC 6 March 2015.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
From Virtualization Management to Private Cloud with SCVMM 2012 Dan Stolts Sr. IT Pro Evangelist Microsoft Corporation
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Magellan: Experiences from a Science Cloud Lavanya Ramakrishnan.
Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
Virtual Lab for e-Science Towards a new Science Paradigm.
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
Enterprise Cloud Computing
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Julia Andreeva on behalf of the MND section MND review.
D. Giordano Second Price Enquiry Update L. Field On behalf of the IT-SDC WLC Resource Integration Team.
Lecture III: Challenges for software engineering with the cloud CS 4593 Cloud-Oriented Big Data and Software Engineering.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Possibilities for joint procurement of commercial cloud services for WLCG WLCG Overview Board Bob Jones (CERN) 28 November 2014.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Testing CernVM-FS scalability at RAL Tier1 Ian Collier RAL Tier1 Fabric Team WLCG GDB - September
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
Building Cloud Solutions Presenter Name Position or role Microsoft Azure.
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1 how to profit of the ATLAS HLT farm during the LS1 & after Sergio Ballestrero.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
Volunteer Clouds for the LHC experiments H. Riahi – 12/11/15 EGI User Forum Laurence Field Hassen Riahi CERN IT-SDC.
The Helix Nebula marketplace 13 May 2015 Bob Jones, CERN.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
New Paradigms: Clouds, Virtualization and Co.
Laurence Field IT/SDC Cloud Activity Coordination
Virtualization and Clouds ATLAS position
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Provisioning 160,000 cores with HEPCloud at SC17
WLCG experiments FedCloud through VAC/VCycle in the EGI
WLCG Collaboration Workshop;
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Outline Virtualization Cloud Computing Microsoft Azure Platform
Productive + Hybrid + Intelligent + Trusted
Nolan Leake Co-Founder, Cumulus Networks Paul Speciale
Presentation transcript:

1 Resource Provisioning Overview Laurence Field 12 April 2015

Overview Architectural Model – Focus Areas – Resource Groups Area Details Commercial Cloud Activities 2

From Pilot Jobs to Pilot VMs 3

Focus Areas Image Management Capacity Management Monitoring Accounting Pilot Job Framework Data Access and Networking Quota Management Supporting Services 4

Resource Groups WLCG Tier 0/1/2 HLT Farms Commercial Providers Volunteer Computing HPC Parameter space contains 160 possibilities – 5 resource groups, 8 areas, 4 VOs – Need consolidation of solutions The cloud paradigm – The adoption of generic solutions Reducing our total cost of ownership 5

Image Management Image provides the job environment – PilotJob Pilot VMs – Software CVMFS Contextualization – Balance pre- and post-instantiation operations Simple or Complex Frequency of Updates Data Transferred Transient – No updates of running machines Destroy (gracefully) and create new instance 6

CernVM The OS via CVMFS – HTTP replication of a reference file system Stratum 0 Why? – Because CVMFS is already a requirement Reduces the overhead of distributed image management – Manage version control centrally CernVM as a common requirement – Parameter space reduction 20 possibilities => 1 CernVM – 5 resource groups, 4 VOs Availability becomes an infrastructure issue – Potentially 20 different contextualizations Responsibility of the VO The goal is to start a CernVM-based instance – There are exceptions where this is not possible 7

Capacity Management Pilot Jobs => Pilot VMs Ensure there are enough resources (capacity) – Scale Up/Down Managing the VM life cycle isn’t the focus Requires a specific component with some intelligence – Do I need to start of VM and if so where? – Do I need to stop a VM and if so where? – Are the VMs that I started OK? Detection of the not ok state maybe non-trivial Existing solutions focus on deploying applications in the cloud – Difference components, one cloud – May not be appropriate for the resource group – May manage load balancing and failover Is this a load balancing problem? – One configuration, many places, enough instances? 8

Vacuum Model Pilot VMs appear spontaneously out of the vacuum – Robust generic approach Can choose whether or not to use the resource Pilot Job framework parameter space reduction – 20 possibilities => 4 approaches 1 per experiment Stimulating resources via framework – Optimization – Adds complexity – Requires additional communication No parameter space reduction Area for discussion 9

Monitoring Fabric management – The responsibility of the Capacity Manager – Basic monitoring is required The objective is to triage the machines – Invoke a restart operation if it not ok Detection of the not ok state maybe non-trivial Other metrics may be of interest Spotting dark resources – Deployed but not usable Can help to identify issues in other systems – Discovering inconsistent information through cross-checks A common requirement for all – Potential for 20 => 1 parameter space reduction Jobs monitoring is VO specific – Should be no difference with Grid jobs 10

Provider Accounting Commercial suppliers send invoices – They have their own (proprietary) accounting systems WLCG sites are suppliers – Cloud accounting generates monthly “invoices” We trust sites and hence their invoices Need to method to record usage – To cross-check invoices – Detect issues and inefficiencies No abstraction of job concept in IaaS layer – Job activities are in the domain of the VO A job is the VOs measurement of work done – i.e. value for money 11

Consumer Accounting Monitoring and recording resource usage – What, where, when for VM resources And whatever else is billed for – Course granularity acceptable Report generation – Mirror invoices Use same metrics as the invoice Needs a uniform approach for all the VOs – Require 20 => 1 parameter space reduction – Provide the same information to the budget holder Job information is not considered – But could potentially be repurposed Both monitoring and job information could be used – An area for discussion 12

Other Considerations Data access and networking – Have so far focus on non-data intensive workloads Quota Management – Currently mostly fixed limits Leading the partitioning of resources between VOs – How can the sharing of resources be implemented? Supporting Services – What else is required? – Eg squid caches in the provider – How are these managed and by who? Non-Virtualized approaches – Instantiation of a pilot job Without CE 13

Commercial Clouds Helix Nebula – A public-private partnership Between research organizations and IT industry Microsoft Azure Pilot – Preliminary discussions with CERN OpenLab Amazon – BNL RACF for ATLAS and CMS – With new Scientific Computing group at AWS Deutsche Börse Cloud Exchange AG – Beta testing platform – Will go live beginning of May PICSE – Procurement Innovation for Cloud Services in Europe European Science Cloud Pilot – Pre-Commercial Procurement (PCP) proposal Buyers group public organizations that are members of the WLCG collaboration 14

Next Steps Define and publish contextualizations – For each VO For the different resource groups WLCG Resource Accounting – No significant capacity from Tier 1/2 sites without it Commercial Resources – Have gained experience for a number of initiatives – Who pays? No significant capacity without a budget – Need to be ready to integrate commercial resources Capacity Management – Multiple solutions Continue to experiment – Consolidate 15

Summary Resources in many places – Not just traditional Tier 0/1/2 sites Common approaches – To avoid an explosions in our parameter space The vacuum model as the generic approach – Centered around contextualizing a CernVM To run a pilot job Have to consider specializations – Commercial clouds – Volunteer Computing – HPC But keep just the special bits special – And have common things everywhere else 16