CERN Computing Infrastructure

Slides:



Advertisements
Similar presentations
STUDY ON OPENSTACK BY JAI KRISHNA. LIST OF COMPONENTS Introduction Components Architecture Where it is used.
Advertisements

Earth Observation Data to build a value chain from science to business
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Pre-Commercial Procurement proposal - HNSciCloud
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Bob Jones, CERN, IT department 4 October Why Procurement The activities within the Helix Nebula initiative have shown that public research organisations.
Advanced Computing Services for Research Organisations Bob Jones Head of openlab IT dept CERN This document produced by Members of the Helix Nebula consortium.
University of Murcia 8 June 2011 IPv6 in Europe Jacques Babot European Commission - DG INFSO Directorate, Emerging Technologies and Infrastructures.
Helix Nebula The Science Cloud CERN – 14 May 2014 Bob Jones (CERN) This document produced by Members of the Helix Nebula consortium is licensed under a.
Rackspace Analyst Event Tim Bell
Cloud Computing Infrastructure at CERN
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Tim Bell 24/09/2015 2Tim Bell - RDA.
A public-private partnership building a multidisciplinary cloud platform for data intensive science Bob Jones Head of openlab IT dept CERN This document.
26 September 2013 Federating OpenStack: a CERN and Rackspace Collaboration Tim Bell Toby Owen
Cloud Services for Research CERN – 26 June 2014 Bob Jones (CERN) This document produced by Members of the Helix Nebula consortium is licensed under a Creative.
Infrastructure Manager, CERN Clouds and Research Collide at CERN TIM BELL.
Tim 18/09/2015 2Tim Bell - Australian Bureau of Meteorology Visit.
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance.
ESA UNCLASSIFIED – For Official Use Supersites entry portal Presented by Francesco Casu (CNR/IREA) on behalf of Wolfgang Lengert (ERS and ADM Aeolus Mission.
Using Heat to Deploy and Manage Applications in OpenStack Trevor Roberts Jr, VMware, Inc. CNA1763 #CNA1763.
Helix Nebula The Science Cloud CERN – 13 June 2014 Alberto Di MEGLIO on behalf of Bob Jones (CERN) This document produced by Members of the Helix Nebula.
Tim Bell 04/07/2013 Intel Openlab Briefing2.
Computing for LHC Physics 7th March 2014 International Women's Day - CERN- GOOGLE Networking Event Maria Alandes Pradillo CERN IT Department.
LHC Computing, CERN, & Federated Identities
The Helix Nebula Initiative EMBL – 20 January 2016 Maryline Lengert (ESA) This document produced by Members of the Helix Nebula consortium is licensed.
Possibilities for joint procurement of commercial cloud services for WLCG WLCG Overview Board Bob Jones (CERN) 28 November 2014.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
INDIGO – DataCloud CERN CERN RIA
The Helix Nebula marketplace 13 May 2015 Bob Jones, CERN.
CON8473 – Oracle Distribution of OpenStack Ronen Kofman Director of Product Management Oracle OpenStack September, 2014 Copyright © 2014, Oracle and/or.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Ian Bird LHCC Referees; CERN, 2 nd June 2015 June 2,
Work Plan for the Second Period Bob Jones, CERN First Helix Nebula Review 03 July This document produced by Members of the Helix Nebula consortium.
LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Bob Jones EGEE Technical Director
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Organizations Are Embracing New Opportunities
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Resource Provisioning Services Introduction and Plans
IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.
H2020, COEs and PRACE.
CIM Modeling for E&U - (Short Version)
SUSE® Cloud The Open Source Private Cloud Solution for the Enterprise
Openlab Compute Provisioning Topics Tim Bell 1st March 2017
EUMEDCONNECT3-ASREN, 27 June 2014
Ian Bird GDB Meeting CERN 9 September 2003
Restorative Practice Programme
SCD Cloud at STFC By Alexander Dibbo.
OpenStack Foundation. OpenStack Foundation Latest: Foundation Mission The OpenStack Foundation.
Dagmar Adamova, NPI AS CR Prague/Rez
The LHC Computing Grid Visit of Her Royal Highness
Azure Hybrid Use Benefit Overview
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Huawei : Clouds at Scale
OpenStack Ani Bicaku 18/04/ © (SG)² Konsortium.
Connecting the European Grid Infrastructure to Research Communities
Understanding the Universe with help from OpenStack, CERN and Budapest
EGI Webinar - Introduction -
OpenStack Pike August 2017 Marketing Community Preview.
Single Cell’s Progenitor Powered by Microsoft Azure Improves Organisational Efficiency with Strategic Procurement, Contract Management, and Analytics MICROSOFT.
Using an Object Oriented Database to Store BaBar's Terabytes
Brian Matthews STFC EOSCpilot Brian Matthews STFC
OpenStack Summit Berlin – November 14, 2018
Pitch Deck.
WP6 – EOSC integration J-F. Perrin (ILL) 15th Jan 2019
OpenStack for the Enterprise
Presentation transcript:

CERN Computing Infrastructure Tim Bell tim.bell@cern.ch 20/10/2016 Tim Bell - CERN Computing Infrastructure

Computing Infrastructure Diverse computing services Physics computing IT and Experiment Services Administrative Computing Target is for Standardised procedures Bulk purchasing 20/10/2016 Tim Bell - CERN Computing Infrastructure

THE CERN MEYRIN DATA CENTRE http://goo.gl/maps/K5SoG Recording and analysing the data takes a lot of computing power. The CERN computer centre was built in the 1970s for mainframes and crays. Now running at 3.5MW of power, it houses 11,000 servers but is at the limit of cooling and electrical power. It is also a tourist attraction with over 80,000 visitors last year! As you can see, racks are only partially empty in view of the limits on cooling. 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure 20/10/2016 Tim Bell - CERN Computing Infrastructure

Public Procurement Cycle Step Time (Days) Elapsed (Days) User expresses requirement Market Survey prepared 15 Market Survey for possible vendors 30 45 Specifications prepared 60 Vendor responses 90 Test systems evaluated 120 Offers adjudicated 10 130 Finance committee 160 Hardware delivered 250 Burn in and acceptance 30 days typical with 380 worst case 280 Total 280+ Days However, CERN is a publically funded body with strict purchasing rules to make sure that the contributions from our contributing countries are also provided back to the member states, our hardware purchases should be distributed to each of the countries in ratio of their contributions., So, we have a public procurement cycle that takes 280 days in the best case… we define the specifications 6 months before we actually have the h/w available and that is in the best case. Worst case, we find issues when the servers are delivered. We’ve had cases such as swapping out 7,000 disk drives where you stop tracking by the drive but measure it by the pallet of disks. With these constraints, we needed to find an approach that allows us to be flexible for the physicists while still being compliant with the rules. 20/10/2016 Tim Bell - CERN Computing Infrastructure

O’Reilly Consideration So how did we choose our tools ? There were the technical requirements are a significant factor but there is also the need to look at the community ecosystem. Open source on its own is not enough.. Our fragile legacy tools were open source but were lacking a community. Typical example of this is the O’Reilly books.. Once the O’Reilly book is out, the tool is worth a good look. Furthermore, it greatly helps to train new staff… you can buy them a copy and let them work it through to learn rather than needing to be guru mentored. 20/10/2016 Tim Bell - CERN Computing Infrastructure

Job Trends Consideration CERN staff are generally on short term contracts, 2-5 years and come from all over the member states. They come to CERN, often out of university or their 1st jobs. We look for potential rather than specific skills in the current tools. After a time at CERN, they leave with expert skills and experience in our tools which is a great help for finding future job opportunities and ensuring motivation to the end of their contracts. 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure CERN Tool Chain 20/10/2016 Tim Bell - CERN Computing Infrastructure

Upstream OpenStack on its own does not give you a cloud service Metering and chargeback Autoscaling Upstream OpenStack on its own does not give you a cloud service Packaging Integration Burn In SLA Monitoring … Monitoring and alerting Remediation Cloud is a service! High availability Config management Scale out Log processing Infra onboarding Capacity planning OpenStack APIs Metrics CI Network design Upgrades Cloud monitoring Builds Net/info sec SLA Alerting Incident resolution Customer support User experience Source: eBay 20/10/2016 Tim Bell - CERN Computing Infrastructure

CERN OpenStack Project July 2013 CERN OpenStack Production Service February 2014 CERN OpenStack Havana Release October 2014 CERN OpenStack Icehouse Release March2015 CERN OpenStack Juno Release September 2015 CERN OpenStack Kilo Release August 2016 CERN OpenStack Liberty Release April 2012 September 2012 April 2013 October 2013 April 2014 October 2014 April 2015 October 2015 April 2016 ESSEX Nova Swift Glance Horizon Keystone FOLSOM Nova Swift Glance Horizon Keystone Quantum Cinder GRIZZLY Nova Swift Glance Horizon Keystone Quantum Cinder Ceilometer HAVANA Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat ICEHOUSE Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat JUNO Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally KILO Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally LIBERTY Nova Swift Glance Horizon Keystone Neutron (*) Cinder Ceilometer Heat Rally Magnum (*) Barbican (*) MITAKA Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally Magnum Barbican (*) Pilot 12/10/2016 Tim Bell - CHEP 2016

Tim Bell - OpenStack Summit Barcelona 2016 >190K cores in production under OpenStack >90% of CERN compute resources are virtualised >5,000 VMs migrated from old hardware in 2016 >100K cores to be added in the next 6 months Having started in 2012, we went into production in 2013 on Grizzly and have since then done online upgrades of 6 releases. We are currently running around 50 cells distributed across 2 data centres with around 7,500 hypervisors. Automation is key… a single framework for managing compute resources within limited staffing, many of whom are only at CERN for a few years and return to their home countries. 25/10/2016 Tim Bell - OpenStack Summit Barcelona 2016

Onwards the Federated Clouds Public Cloud such as Rackspace or IBM CERN Private Cloud 150K cores ATLAS Trigger 28K cores ALICE Trigger 12K cores CMS Trigger INFN Italy Brookhaven National Labs NecTAR Australia Many Others on Their Way The trigger farms are those servers nearest the accelerator which are not needed while the accelerator is shut down till 2015 Public clouds are interesting for burst load (such as coming up to a conference) or when price drops such as spot market Private clouds allow universities and other research labs to collaborate in processing the LHC data 20/10/2016 Tim Bell - CERN Computing Infrastructure

Nova Cells Top level cell Child cells run: Runs API service Top cell scheduler Child cells run: Compute nodes Nova network Scheduler Conductor 12/10/2016 Tim Bell - CHEP 2016

Tim Bell - CERN Computing Infrastructure Strategic Plan Establish multi-tenant, multi-provider cloud infrastructure Identify and adopt policies for trust, security and privacy Create governance structure Define funding schemes To support the computing capacity needs for the ATLAS experiment Setting up a new service to simplify analysis of large genomes, for a deeper insight into evolution and biodiversity To create an Earth Observation platform, focusing on earthquake and volcano research To improve the speed and quality of research for finding surrogate biomarkers based on brain images Additional Users: Adopters Suppliers 20/10/2016 Tim Bell - CERN Computing Infrastructure

Past, ongoing & future commercial activities@CERN Support CERN’s scientific computing programme HN - Helix Nebula Partnership between research organization and European commercial cloud providers * EC co-funded joint Pre-Commercial Procurement (PCP) project: https://indico.cern.ch/event/319753 ** Other work has been conducted outside CERN, such as the Amazon Pilot project at BNL for ATLAS

Tim Bell - OpenStack Summit Barcelona 2016 Containers on Clouds For the user Interactive Dynamic For IT - Magnum Secure Managed Integrated Scalable For industry and EU Prepare for the next disruptive technology at scale Managed means: Project structure Shared free resource pool Authorisation / Identity Audit Resource lifecycle Familiar end user interface We’ve been running tests at scale, 1000 nodes delivering 7 million requests/second, for appliations using OpenStack Magnum. A router firmware issue stopped us scaling further. We’ll work with the Magnm and Heat teams to make this even easier. 25/10/2016 Tim Bell - OpenStack Summit Barcelona 2016

For Further Information ? Technical details at http://openstack-in-production.blogspot.fr Helix Nebula Initiative at http://www.helix-nebula.eu/ Scientific Working Group at https://wiki.openstack.org/wiki/Scientific_working_group 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure Some history of scale… Date Collaboration sizes Data volume, archive technology Late 1950’s 2-3 Kilobits, notebooks 1960’s 10-15 kB, punchcards 1970’s ~35 MB, tape 1980’s ~100 GB, tape, disk 1990’s ~750 TB, tape, disk 2010’s ~3000 PB, tape, disk For comparison: 1990’s: Total LEP data set ~few TB Would fit on 1 tape today Today: 1 year of LHC data ~27 PB 20/10/2016 Tim Bell - CERN Computing Infrastructure

Innovation Dilemma How can we avoid the sustainability trap ? Define requirements No solution available that meets those requirements Develop our own new solution Accumulate technical debt How can we learn from others and share ? Find compatible open source communities Contribute back where there is missing functionality Stay mainstream Are CERN computing needs really special ? We came up with a number of guiding principles… We took an approach that CERN was not special. Culturally, for a research organisation this is a big challenge. Many continue to feel that our requirements would be best met by starting again from scratch but with the modern requirements. In the past, we had extensive written procedures for sysadmins to execute with lots of small tools to run, These were error prone and often the guys did not read the latest ones before they performed the operation. We needed to find ways to scale the productivity the team to match the additional servers. One of the highest people cost items was the tooling. We had previously been constructing requirements lists, with detailed must-have needs for acceptance. Instead, we asked ourselves how come the other big centres could run using these open source tools yet we had special requirements. Often, the root cause was that we did not understand the best approach to use the tools rather than that we were special. The maintenance of our tools was high. The skills and experienced staff were taking up more and more of their time with the custom code so we took an approach of deploy rather than develop. This meant finding the open source tools that made sense for us, trying them out. Where we found something that was missing, we challenged it again and again. Finally, we would develop in collaboration with the community generalised solutikons for the problems that can eb maintained by the community afterwards. Long term forking is not sustainable. 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure CERN New data 15 PB 23 PB 27 PB CERN Archive >100 PB 20/10/2016 Tim Bell - CERN Computing Infrastructure

OpenStack Collaborations Large Deployment Team Walmart, Yahoo!, Rackspace, eBay, Paypal, … Containers Rackspace, Red Hat OpenStack Scientific Working Group Not just academic High Performance and High Throughput 20/10/2016 Tim Bell - CERN Computing Infrastructure

The Worldwide LHC Computing Grid nearly 170 sites, 40 countries TIER-0 (CERN): data recording, reconstruction and distribution ~350’000 cores TIER-1: permanent storage, re-processing, analysis 500 PB of storage > 2 million jobs/day TIER-2: Simulation, end-user analysis 10-100 Gb links 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure LHC Data Growth Expecting to record 400PB/year by 2023 with the High Luminosity LHC upgrade PB per year 2010 2015 2018 2023 20/10/2016 Tim Bell - CERN Computing Infrastructure

Tim Bell - CERN Computing Infrastructure Where is x3 improvement ? Compute: Growth > x50 What we think is affordable unless we do something differently 20/10/2016 Tim Bell - CERN Computing Infrastructure