The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers and Demanding User Communities Dr. Andreas Unterkircher, Dr. Markus Schulz EGEE SA3 & LCG Deployment April 2009,CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland Outline CERN LHC the computing challenge – Data rates, computing, community Grid CERN – WLCG, EGEE gLite Middleware – Code Base Experience – Integration – Certification Lessons Learned Markus Schulz, CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland CERN stands for over 50 years of 1954 Rebuilding Europe First meeting of the CERN Council 1980 East meets West Visit of a delegation from Beijing 2004 Global Collaboration The Large Hadron Collider involves over 80 countries fundamental research and discoveries technological innovation training and education bringing the world together
CERN IT Department CH-1211 Genève 23 Switzerland CERN’s mission in Science Understand the fundamental laws of nature – We accelerate elementary particles and make them collide. – Then compare the results with the theory Provide a world-class laboratory to researchers in Europe and beyond A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes
CERN IT Department CH-1211 Genève 23 Switzerland Markus Schulz, CERN, IT Department View of the LHC tunnel CERN build the Large Hadron Collider (LHC) the world’s largest particle accelerator (27 km long, 100 m under ground) First beam in 2008 Start of the physics program autumn 2009
View of the ATLAS detector (2005) 150 million sensors deliver data … … 40 million times per second
View of the ATLAS detector (almost ready)
CERN IT Department CH-1211 Genève 23 Switzerland The LHC Computing Challenge Signal/Noise <10 -9 Data volume High rate * large number of channels * 4 experiments 15 PetaBytes of new data each year ( 20 Million CDs) Compute power Event complexity * Nb. events * thousands users >100 k of (today's) fastest CPUs Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere GRID technology The Needle
CERN IT Department CH-1211 Genève 23 Switzerland LHC User Community Europe: 267 Institutes, 4603 Users Other: 208 Institutes, 1632 Users Over 6000 LHC Scientists world wide Markus Schulz, CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland Flow to the CERN Computer Center Markus Schulz, CERN, IT Department 10Gbit
CERN IT Department CH-1211 Genève 23 Switzerland LHC Computing Grid project (LCG) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) 10Gbit links to each of the 10 T1 centers large facilities with mass storage capability Tier-2s ~150 centres in ~35 countries from CPUs
CERN IT Department CH-1211 Genève 23 Switzerland LHC Computing Multi-science MONARC project – First LHC computing architecture – hierarchical distributed model 2000 – growing interest in grid technology – HEP community main driver in launching the DataGrid project EU DataGrid project – middleware & testbed for an operational grid – LHC Computing Grid – LCG – deploying the results of DataGrid to provide a production facility for LHC experiments – EU EGEE project phase 1 – starts from the LCG grid – shared production infrastructure – expanding to other communities and sciences – EU EGEE project phase 2 – expanding to other communities and sciences – Scale and stability – Interoperations/Interoperability – EU EGEE project phase 3 – More communities – Efficient operations – Less central coordination CERN
The EGEE project EGEE –Started in April 2004, now in third phase with 91 partners in 32 countries –3 rd phrase ( ) –2010 egi.org Objectives –Large-scale, production-quality grid infrastructure for e-Science –Attracting new resources and users from industry as well as science –Maintain and further improve “gLite” Grid middleware CERN, IT Department
Enabling Grids for E-sciencE EGEE-II INFSO-RI Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … >250 sites 48 countries >100,000 CPUs >20 PetaBytes >10,000 users >200 communities >350,000 jobs/day CERN, IT Department Global Multi Science Infrastructure, mission critical for many communitiesNumber of jobs from 2004 to 2009 Rapid growth of the infrastructure
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Data Services Storage Element File and Replica Catalog Metadata Catalog Job Management Services Computing Element Worker Node Workload Management Job Provenance Security Services AuthorizationAuthentication Information & Monitoring Services Information System Job MonitoringAccounting Access Services User InterfaceAPI gLite middleware Development effort from different projects: - Condor - globus - Virtual Data Toolkit (VDT) -EGEE -LCG - others………… The project relies on a collaborative consensus based process - No single architect -Technical Director and Technical Management Board -Agree with stakeholders on next steps -Agree on priorities -Bi-weekly phone conference to coordinate -Short term priorities -Incidents ( bugs) -2-3 all hands meetings/year -Mail mail and mail …
CERN IT Department CH-1211 Genève 23 Switzerland gLite code base CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department
CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department 10K5K 2K 1K
CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department 2K Complex external and internal cross dependencies Integration, configuration management was always a challenge The components are grouped together to ~30 services
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Complex Dependencies
CERN IT Department CH-1211 Genève 23 Switzerland Markus Schulz, CERN, IT Department Example: Data Management
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Stability of the software All components still see frequent changes Many developments started 2002 – Why do we still need changes? Scale of the system increased rapidly Number of user and use cases increased – Deeper code coverage – New functional requirements Less tolerance to failures – Implementation of fail over Emerging standards – Project started when no standards where available – Incremental introduction Exponential growth
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Software stability: Defects Most changes are triggered by defects 81% ~40% are found by users ~2000 open bugs at any time Increased production use Developers use the same system
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Software Process (since 2006) Component based, frequent releases – Components are updated independently No big bang releases – Updates (patches) are delivered on a weekly basis to PPS Move after 2 weeks to production – Clear prioritization by stakeholders – Clear definition of roles and responsibilities – Use of a common build system ( ETICS) Release model: Pull Sites pick up updates when convenient Multiple versions are in production Retirement of old versions takes > 1 year
CERN IT Department CH-1211 Genève 23 Switzerland Component based process
CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Patch and Bug Lifecycle State changes are tracked by Savannah – progress is monitored by dashboards
CERN IT Department CH-1211 Genève 23 Switzerland Effort Work areas – Integration – Configuration – Testing & Certification – Release Management Coordinated by CERN – 10 partner institutes – ~30 FTEs
CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Deployment tests – Developers produce rpms that conflict with existing rpms (gLite or system). – Update affects production node types with the produced rpms. – Deployment tests are available can be launched by the developer before giving the rpms to certification. 29
CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Deployment tests issues – We provide a repository, rpm lists and tarballs (for certain services). – Sites install/update the middleware differently yum, fabric management tools,… – Difficult to “test” all deployment scenarios Sites and regions customize install and configuration procedures – The base OS version is updated frequently independently 30
CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Configuration tests Grid services are configured with YAIM – YAIM (Ain’t an Installation Manager). – Modular bash shell script > lines, >30 modules Test configuration after changes – middleware or YAIM 31
CERN IT Department CH-1211 Genève 23 Switzerland System testing Services have to be tested against a grid What version should we test against? – Production service is not homogenous One patch may affect several node types. For every node type we have a list of tests that have to be done. Regression tests are available and evolving 32
CERN IT Department CH-1211 Genève 23 Switzerland Acceptance testing Pre-Production Service (PPS) – ~ 20 sites several hundred nodes – Provides access to grid services in previews to interested users. – Evaluate deployment procedures, interoperability and basic functionality of the software against operational scenarios reflecting real production conditions. – After certification patches go to PPS before being released to production. Time spent in PPS: 1-2 weeks. 33
CERN IT Department CH-1211 Genève 23 Switzerland Acceptance testing It is difficult to convince users to try out the services before they are being released to production. Production Grid conditions cannot be fully replicated – Size of the Grid. – File catalogs with millions of entries. Early life support – Dedicated sites install certain service immediately after release to production. – Well defined rollback procedure in case of problems. Pilot services – Preview of a new (version of a) service. – Users can (stress) test it with typical production workloads. – Quick feedback to developers. 34
CERN IT Department CH-1211 Genève 23 Switzerland Test process Tailored to our environment People in different locations involved. – Independent in their work habits and infrastructure. Open source tools. Use „least common denominator“. 35
CERN IT Department CH-1211 Genève 23 Switzerland Test writing Biggest challenge: to get tests written at all. Learning curve for grid services is steep – We maintain lists of expertise. It is difficult to get realistic use cases Keep it simple to focused on test writing Tests are in one defined test categories – installation, functionality etc. Test script may use : Bash, Python or Perl. Tests can be executed as a command – Ensures integration into different frameworks. Tests must be fully configurable Focus on test script, not the integration into a framework. 36
CERN IT Department CH-1211 Genève 23 Switzerland Available tests and checklists are documented 37
CERN IT Department CH-1211 Genève 23 Switzerland Test framework Testing requires a grid Ideal: bring up a complete grid with one click, well defined versions of the nodes according to test results. – Installing grid nodes is non-trivial Pragmatic approach – CERN provides a certification testbed Complete, self-contained grid providing all services. Certifiers install the nodes they need to test and integrate them into the testbed. – Heavy use of virtualization We developed our own tools to create customized images and a VM management framework (Xen based) 38
CERN IT Department CH-1211 Genève 23 Switzerland Test framework Don’t let the framework distract you from doing tests! – We tried complex test frameworks ….. – Execute tests store and display results, information about test set up Pragmatic approach: – Test data and results are stored with the patch. – Patch & bug tracking tool: Savannah – Tests are simple scripts that can be used by anybody 39
CERN IT Department CH-1211 Genève 23 Switzerland Experience We are victims of our own success – Moved prototypes into production very early – With production users we can evolve only slowly ( standards) Software life cycle management has to change with the project’s maturity – Before 2006 focus on functionality Big bang releases, large dedicated testbeds Central team – manage diversity and scale, reactive Fast release cycles Deployment scenarios via PPS Pilot services using production Strong central team & distributed team
CERN IT Department CH-1211 Genève 23 Switzerland Future Components will be developed more independently – Process has to reflect this – Decentralized approach Tests follow agreed process, can be run everywhere More problems are found at full scale in production – Focus on pilots and staged rollout – Improved “Undo” ( rollback) – Deployment tests move to sites Too many different setups to handle in one place
CERN IT Department CH-1211 Genève 23 Switzerland If we could start again….. Expectation management – Software developers and users have to understand the limitation of testing better Enforce unit and basic tests to be provided by software producers – Often software is rejected for trivial reasons Very inefficient Avoid a overambitious Pre-Production Service – Limited gain Enforce control over dependencies from the start on Add process monitoring earlier in the project