Download presentation
Presentation is loading. Please wait.
2
CERN Computing Infrastructure
Tim Bell 20/10/2016 Tim Bell - CERN Computing Infrastructure
3
Computing Infrastructure
Diverse computing services Physics computing IT and Experiment Services Administrative Computing Target is for Standardised procedures Bulk purchasing 20/10/2016 Tim Bell - CERN Computing Infrastructure
4
THE CERN MEYRIN DATA CENTRE
Recording and analysing the data takes a lot of computing power. The CERN computer centre was built in the 1970s for mainframes and crays. Now running at 3.5MW of power, it houses 11,000 servers but is at the limit of cooling and electrical power. It is also a tourist attraction with over 80,000 visitors last year! As you can see, racks are only partially empty in view of the limits on cooling. 20/10/2016 Tim Bell - CERN Computing Infrastructure
5
Tim Bell - CERN Computing Infrastructure
20/10/2016 Tim Bell - CERN Computing Infrastructure
6
Public Procurement Cycle
Step Time (Days) Elapsed (Days) User expresses requirement Market Survey prepared 15 Market Survey for possible vendors 30 45 Specifications prepared 60 Vendor responses 90 Test systems evaluated 120 Offers adjudicated 10 130 Finance committee 160 Hardware delivered 250 Burn in and acceptance 30 days typical with 380 worst case 280 Total 280+ Days However, CERN is a publically funded body with strict purchasing rules to make sure that the contributions from our contributing countries are also provided back to the member states, our hardware purchases should be distributed to each of the countries in ratio of their contributions., So, we have a public procurement cycle that takes 280 days in the best case… we define the specifications 6 months before we actually have the h/w available and that is in the best case. Worst case, we find issues when the servers are delivered. We’ve had cases such as swapping out 7,000 disk drives where you stop tracking by the drive but measure it by the pallet of disks. With these constraints, we needed to find an approach that allows us to be flexible for the physicists while still being compliant with the rules. 20/10/2016 Tim Bell - CERN Computing Infrastructure
7
O’Reilly Consideration
So how did we choose our tools ? There were the technical requirements are a significant factor but there is also the need to look at the community ecosystem. Open source on its own is not enough.. Our fragile legacy tools were open source but were lacking a community. Typical example of this is the O’Reilly books.. Once the O’Reilly book is out, the tool is worth a good look. Furthermore, it greatly helps to train new staff… you can buy them a copy and let them work it through to learn rather than needing to be guru mentored. 20/10/2016 Tim Bell - CERN Computing Infrastructure
8
Job Trends Consideration
CERN staff are generally on short term contracts, 2-5 years and come from all over the member states. They come to CERN, often out of university or their 1st jobs. We look for potential rather than specific skills in the current tools. After a time at CERN, they leave with expert skills and experience in our tools which is a great help for finding future job opportunities and ensuring motivation to the end of their contracts. 20/10/2016 Tim Bell - CERN Computing Infrastructure
9
Tim Bell - CERN Computing Infrastructure
CERN Tool Chain 20/10/2016 Tim Bell - CERN Computing Infrastructure
10
Upstream OpenStack on its own does not give you a cloud service
Metering and chargeback Autoscaling Upstream OpenStack on its own does not give you a cloud service Packaging Integration Burn In SLA Monitoring … Monitoring and alerting Remediation Cloud is a service! High availability Config management Scale out Log processing Infra onboarding Capacity planning OpenStack APIs Metrics CI Network design Upgrades Cloud monitoring Builds Net/info sec SLA Alerting Incident resolution Customer support User experience Source: eBay 20/10/2016 Tim Bell - CERN Computing Infrastructure
11
CERN OpenStack Project
July 2013 CERN OpenStack Production Service February 2014 CERN OpenStack Havana Release October 2014 CERN OpenStack Icehouse Release March2015 CERN OpenStack Juno Release September 2015 CERN OpenStack Kilo Release August 2016 CERN OpenStack Liberty Release April 2012 September 2012 April 2013 October 2013 April 2014 October 2014 April 2015 October 2015 April 2016 ESSEX Nova Swift Glance Horizon Keystone FOLSOM Nova Swift Glance Horizon Keystone Quantum Cinder GRIZZLY Nova Swift Glance Horizon Keystone Quantum Cinder Ceilometer HAVANA Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat ICEHOUSE Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat JUNO Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally KILO Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally LIBERTY Nova Swift Glance Horizon Keystone Neutron (*) Cinder Ceilometer Heat Rally Magnum (*) Barbican (*) MITAKA Nova Swift Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally Magnum Barbican (*) Pilot 12/10/2016 Tim Bell - CHEP 2016
12
Tim Bell - OpenStack Summit Barcelona 2016
>190K cores in production under OpenStack >90% of CERN compute resources are virtualised >5,000 VMs migrated from old hardware in 2016 >100K cores to be added in the next 6 months Having started in 2012, we went into production in 2013 on Grizzly and have since then done online upgrades of 6 releases. We are currently running around 50 cells distributed across 2 data centres with around 7,500 hypervisors. Automation is key… a single framework for managing compute resources within limited staffing, many of whom are only at CERN for a few years and return to their home countries. 25/10/2016 Tim Bell - OpenStack Summit Barcelona 2016
13
Onwards the Federated Clouds
Public Cloud such as Rackspace or IBM CERN Private Cloud 150K cores ATLAS Trigger 28K cores ALICE Trigger 12K cores CMS Trigger INFN Italy Brookhaven National Labs NecTAR Australia Many Others on Their Way The trigger farms are those servers nearest the accelerator which are not needed while the accelerator is shut down till 2015 Public clouds are interesting for burst load (such as coming up to a conference) or when price drops such as spot market Private clouds allow universities and other research labs to collaborate in processing the LHC data 20/10/2016 Tim Bell - CERN Computing Infrastructure
14
Nova Cells Top level cell Child cells run: Runs API service
Top cell scheduler Child cells run: Compute nodes Nova network Scheduler Conductor 12/10/2016 Tim Bell - CHEP 2016
15
Tim Bell - CERN Computing Infrastructure
Strategic Plan Establish multi-tenant, multi-provider cloud infrastructure Identify and adopt policies for trust, security and privacy Create governance structure Define funding schemes To support the computing capacity needs for the ATLAS experiment Setting up a new service to simplify analysis of large genomes, for a deeper insight into evolution and biodiversity To create an Earth Observation platform, focusing on earthquake and volcano research To improve the speed and quality of research for finding surrogate biomarkers based on brain images Additional Users: Adopters Suppliers 20/10/2016 Tim Bell - CERN Computing Infrastructure
16
Past, ongoing & future commercial activities@CERN
Support CERN’s scientific computing programme HN - Helix Nebula Partnership between research organization and European commercial cloud providers * EC co-funded joint Pre-Commercial Procurement (PCP) project: ** Other work has been conducted outside CERN, such as the Amazon Pilot project at BNL for ATLAS
17
Tim Bell - OpenStack Summit Barcelona 2016
Containers on Clouds For the user Interactive Dynamic For IT - Magnum Secure Managed Integrated Scalable For industry and EU Prepare for the next disruptive technology at scale Managed means: Project structure Shared free resource pool Authorisation / Identity Audit Resource lifecycle Familiar end user interface We’ve been running tests at scale, 1000 nodes delivering 7 million requests/second, for appliations using OpenStack Magnum. A router firmware issue stopped us scaling further. We’ll work with the Magnm and Heat teams to make this even easier. 25/10/2016 Tim Bell - OpenStack Summit Barcelona 2016
18
For Further Information ?
Technical details at Helix Nebula Initiative at Scientific Working Group at 20/10/2016 Tim Bell - CERN Computing Infrastructure
19
Tim Bell - CERN Computing Infrastructure
Some history of scale… Date Collaboration sizes Data volume, archive technology Late 1950’s 2-3 Kilobits, notebooks 1960’s 10-15 kB, punchcards 1970’s ~35 MB, tape 1980’s ~100 GB, tape, disk 1990’s ~750 TB, tape, disk 2010’s ~3000 PB, tape, disk For comparison: 1990’s: Total LEP data set ~few TB Would fit on 1 tape today Today: 1 year of LHC data ~27 PB 20/10/2016 Tim Bell - CERN Computing Infrastructure
20
Innovation Dilemma How can we avoid the sustainability trap ?
Define requirements No solution available that meets those requirements Develop our own new solution Accumulate technical debt How can we learn from others and share ? Find compatible open source communities Contribute back where there is missing functionality Stay mainstream Are CERN computing needs really special ? We came up with a number of guiding principles… We took an approach that CERN was not special. Culturally, for a research organisation this is a big challenge. Many continue to feel that our requirements would be best met by starting again from scratch but with the modern requirements. In the past, we had extensive written procedures for sysadmins to execute with lots of small tools to run, These were error prone and often the guys did not read the latest ones before they performed the operation. We needed to find ways to scale the productivity the team to match the additional servers. One of the highest people cost items was the tooling. We had previously been constructing requirements lists, with detailed must-have needs for acceptance. Instead, we asked ourselves how come the other big centres could run using these open source tools yet we had special requirements. Often, the root cause was that we did not understand the best approach to use the tools rather than that we were special. The maintenance of our tools was high. The skills and experienced staff were taking up more and more of their time with the custom code so we took an approach of deploy rather than develop. This meant finding the open source tools that made sense for us, trying them out. Where we found something that was missing, we challenged it again and again. Finally, we would develop in collaboration with the community generalised solutikons for the problems that can eb maintained by the community afterwards. Long term forking is not sustainable. 20/10/2016 Tim Bell - CERN Computing Infrastructure
21
Tim Bell - CERN Computing Infrastructure
CERN New data 15 PB 23 PB 27 PB CERN Archive >100 PB 20/10/2016 Tim Bell - CERN Computing Infrastructure
22
OpenStack Collaborations
Large Deployment Team Walmart, Yahoo!, Rackspace, eBay, Paypal, … Containers Rackspace, Red Hat OpenStack Scientific Working Group Not just academic High Performance and High Throughput 20/10/2016 Tim Bell - CERN Computing Infrastructure
23
The Worldwide LHC Computing Grid
nearly 170 sites, 40 countries TIER-0 (CERN): data recording, reconstruction and distribution ~350’000 cores TIER-1: permanent storage, re-processing, analysis 500 PB of storage > 2 million jobs/day TIER-2: Simulation, end-user analysis Gb links 20/10/2016 Tim Bell - CERN Computing Infrastructure
24
Tim Bell - CERN Computing Infrastructure
LHC Data Growth Expecting to record 400PB/year by 2023 with the High Luminosity LHC upgrade PB per year 20/10/2016 Tim Bell - CERN Computing Infrastructure
25
Tim Bell - CERN Computing Infrastructure
Where is x3 improvement ? Compute: Growth > x50 What we think is affordable unless we do something differently 20/10/2016 Tim Bell - CERN Computing Infrastructure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.