1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.

Slides:



Advertisements
Similar presentations
System Center 2012 R2 Overview
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Volunteer Computing Laurence Field IT/SDC 21 November 2014.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
The Prototype Laurence Field IT/SDC 11 November 2014.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Assessment of Core Services provided to USLHC by OSG.
1 port BOSS on Wenjing Wu (IHEP-CC)
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
BESIII distributed computing and VMDIRAC
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
The Data Bridge Laurence Field IT/SDC 6 March 2015.
DataGrid Applications Federico Carminati WP6 WorkShop December 11, 2000.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.
Machine/Job Features Update Stefan Roiser. Machine/Job Features Recap Resource User Resource Provider Batch Deploy pilot Cloud Node Deploy VM Virtual.
ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL
Predrag Buncic (CERN/PH-SFT) Virtualizing LHC Applications.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
D. Giordano Second Price Enquiry Update L. Field On behalf of the IT-SDC WLC Resource Integration Team.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
– Past, Present, Future Volunteer Computing at CERN Helge Meinhard, Nils Høimyr / CERN for the CERN BOINC service team H. Meinhard et al. - Volunteer.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
Predrag Buncic (CERN/PH-SFT) CernVM Status. CERN, 24/10/ Virtualization R&D (WP9)  The aim of WP9 is to provide a complete, portable and easy.
Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Volunteer Clouds for the LHC experiments H. Riahi – 12/11/15 EGI User Forum Laurence Field Hassen Riahi CERN IT-SDC.
Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
Laurence Field IT/SDC Cloud Activity Coordination
WLCG Workshop 2017 [Manchester] Operations Session Summary
Use of HLT farm and Clouds in ALICE
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
WLCG experiments FedCloud through VAC/VCycle in the EGI
How to enable computing
David Cameron ATLAS Site Jamboree, 20 Jan 2017
WLCG Collaboration Workshop;
Management of Virtual Execution Environments 3 June 2008
Cloud Management Mechanisms
Backfilling the Grid with Containerized BOINC in the ATLAS computing
Exploit the massive Volunteer Computing resource for HEP computation
Exploring Multi-Core on
Presentation transcript:

1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014

2 Cloud Date Title 2 SaaS PaaS IaaS VMs on demand

3 High Level View Date Title 3

4 Areas  Image Management  Capacity Management  Monitoring  Accounting  Pilot Job Framework  Data Access and Networking  Quota Management  Supporting Services

5 Image Management  Provides the job environment  Software  CVMFS  PilotJob  Configuration  Contextualization  Balance pre- and post-instantiation operations  Simplicity, Complexity, Data Transfer, Frequency of Updates  Transient  No updates of running machines  Destroy (gracefully) and create new instance

6 CernVM  The OS via CVMFS  Replica via HTTP a reference file system  Stratum 0  Why?  Because CVMFS is already a requirement  Removes the overhead of distributed image management  Manage version control centrally  CernVM as a common requirement  Availability becomes and infrastructure issue  Recipe to contextualize  Responsibility of the VO  The goal is to start a CernVM-based instance  Which needs minimal contextualization

7 Capacity Management  Managing the VM life cycle isn’t the focus  It is about ensuring the is enough resources (capacity)  Requires a specific component with some intelligence  Do I need to start of VM and if so where?  Do I need to stop a VM and if so where?  Are the VMs that I started OK?  Existing solutions focus on deploying applications in the cloud  Difference components, one cloud  May managed load balancing and failover  Is this a load balancing problem?  One configuration, many places, enough instances?  Developing our own solutions  Site centric  The VAC model  VO centric

8 Monitoring  Fabric management  The responsibility of the VO  Basic monitoring is required  The objective is to triage the machines  Invoke a restart operation if it not ok  Detection of the not ok state maybe non-trivial  Other metrics may be of interest  Spotting dark resources  Deployed but not usable  Can help to identify issues in other systems  Discovering inconsistent information through cross-checks  A Common for all VOs  Pilot jobs monitoring in VO specific

9 Provider Accounting  Helix Nebula  Pathfinder project  Development and exploitation  Cloud Computing Infrastructure  Divided into supply and demand  Three flagship applications  CERN (ATLAS simulation)  EMBL  ESA  FW: New Invoice!  Can you please confirm that these are legit?  Need to method to record usage to cross-check invoices  Dark resources  Billed for x machines but not delivered (controllable) 9

10 Consumer-Side Accounting  Monitor resource usage  Course granularity acceptable  No need to accurately measure  What, where, when for resources  Basic infrastructure level  VM instances and whatever else is billed for  Report generation  Mirror invoices  Use same metrics as charged for  Needs a uniform approach  Should work for all VOs  Deliver same information to the budget holder

11 Cloud Accounting in WLCG  Sites are the suppliers  Site accounting generates invoices  For resources used  Need to monitor the resource usage  We trust sites and hence their invoices  Comparison can detect issues and inefficiencies  Job activities in the domain of the VO  Measurement of work done  i.e. value for money  Information not included in cloud accounting  Need a common approach to provide information  Dashboard?

12 Comparison to Grid  Grid accounting = supply side accounting  No CE or batch system  Different information source  No per job information available  Only concerned about resources used  Mainly time-based  For a flavour  A specific composition of CPU, memory, disk

13 Core Metrics  Time  Billed by time per flavour  Capacity  How many were used?  Power  Performance will differ by flavour  And potentially over time  Total computing done  power x capacity x time  Efficiency  How much computing did something useful

14 Measuring Computing  Resources provided by flavour  How can we compare?  What benchmarking metrics?  And how we obtain them  Flavour = SLA  SLA monitoring  How do we do this?  Rating sites  Against the SLA  Variance

15 Data Mining Approach  VOs already have metrics on completed jobs  Can they be used to define relative performance?  Avoids dicussion on bechmarking  And how the benchmark compares to the job  May work for within the VO  But what about WLCG?  Specification for procurement?  May not accept a VO specific metric  Approach currently being investigated

16 The Other Areas  Data access and networking  Have so far focus on non-data intensive workloads  Quota Management  Currently have fixed limits  Leading the partitioning of resources between VOs  How can the sharing of resources be implemented?  Supporting Services  What else is required?  Eg squid caches in the provider  How are these managed and by who?

17 Status of Adoption

18 Alice  Will gradually increase adoption  Using CERN’s AI for release validation  Dynamic provisioning of an elastic virtual cluster  Offline simulation jobs for the HLT farm  Set up as hypervisors for VMs  Suspend or even terminate when DAQ is required  Possibly will enable more I/O-intensive jobs  During LHC shutdown periods

19 Alice  CERN Analysis Facility  Another cluster in CERN’s AI  Allows elastic scaling  Disk space needs served directly by EOS  Using CernVM via the WebAPI portal  An upcoming outreach activity  Extend as volunteer computing platform  to support simulation activities

20 ATLAS  HLT Farm  Virtualization is used to ensure isolation  For tasks that require external network access  OpenStack is used to provide an IaaS layer  Distributed Computing  Existing Grid sites can also provide resource using their IaaS  Cloud Scheduler used for capacity management  VAC is also used by some sites in the UK  Also experimenting with VCycle  VM instances are monitored Ganglia service  Repurpose for the consumer-side accounting

21  Volunteer computing is supported by the BOINC  Uses a pre-loaded CernVM image  Run on the volunteer's machine using VirtualBox  Job files injected via a share directory  An ARC-CE used as a gateway interface  Between PanDA and the BOINC server  PandDA submits the job to the ARC-CE as normal  ARC-CE uses a specific BOINC backend plugin

22

23 CMS  HLT Farm  Focus of the majority of work with respect to the adoption of cloud  Overlaying IaaS provisioning based on Openstack  Open vSwitch to virtualize the network  CMSoooooCloud  CMS Openstack, OpenSwitch-ed, Opportunistic, Overlay, Online-cluster Cloud  GlideinWMS v3 is used to manage the job submission  CERN’s Agile Infrastructure  Two different projects based at CERN and Wigner  Plan to consolidate the resources into just one Tier-0 resource  Volunteer Computing  A prototype for has recently been developed  Further development required for production

24 prototype Boinc plugin Boinc Server Active MQ Server Webserver Boinc client CernVM CMS Job Agent Job Data flow Pushes job description Pushes job files Pulls CernVM Starts CernVM Pulls job description Pulls job files

25 LHCb  CernVM is used for the VM image  Monitoring is done via Ganglia  The pilot system used is the same as for bare-metal worker node execution  A pilot is injected into the VM via CVMFS  The DIRAC job agent is started  It contacts the central task queue  Retrieves a matching a payload for execution  VAC for hypervisor only based sites  Used in production on several WLCG sites  Mainly T2 sites for simulation payloads  VCycle for IaaS controlled sites  Currently using the CERN AI  Planned to be expanded to more sites  Developing a project using BOINC  Not using virtualization on the HLT farm  Running offline workloads directly on the physical machines

26 Summary  It is all about starting a CernVM image  And running a job agent  Similar to a pilot job  Different options are being explored  To manage the VM life cycle  To deliver elastic capacity  Already a great deal of commonality exists between the VOs  Should be exploited and built upon to provide common solutions  How resources are going to be accounted?  Counting cores is easy  Normalizing for power is hard  Investigating using job metrics  To discover relative performance  Many open questions remain