The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers.

Slides:



Advertisements
Similar presentations
Computing for LHC Dr. Wolfgang von Rüden, CERN, Geneva ISEF students visit CERN, 28 th June - 1 st July 2009.
Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.
Assessment of Core Services provided to USLHC by OSG.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite Release Process Maria Alandes Pradillo.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
Frédéric Hemmer, CERN, IT DepartmentThe LHC Computing Grid – October 2006 LHC Computing and Grids Frédéric Hemmer IT Deputy Department Head October 10,
Advanced Computing Services for Research Organisations Bob Jones Head of openlab IT dept CERN This document produced by Members of the Helix Nebula consortium.
Frédéric Hemmer, CERN, IT Department The LHC Computing Grid – June 2006 The LHC Computing Grid Visit of the Comité d’avis pour les questions Scientifiques.
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
A short introduction to the Worldwide LHC Computing Grid Maarten Litmaath (CERN)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse EGEE’s plans for transition.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
Ian Bird LCG Deployment Manager EGEE Operations Manager LCG - The Worldwide LHC Computing Grid Building a Service for LHC Data Analysis 22 September 2006.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE – paving the way for a sustainable infrastructure.
INFSO-RI Enabling Grids for E-sciencE Plan until the end of the project and beyond, sustainability plans Dieter Kranzlmüller Deputy.
Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST
Evolution of Grid Projects and what that means for WLCG Ian Bird, CERN WLCG Workshop, New York 19 th May 2012.
INFSO-RI Enabling Grids for E-sciencE SA1 and gLite: Test, Certification and Pre-production Nick Thackray SA1, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland Visit of Professor Karel van der Toorn President University of Amsterdam Wednesday 10 th.
EGEE is a project funded by the European Union under contract IST Middleware Planning for LCG/EGEE Bob Jones EGEE Technical Director e-Science.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE II: an eInfrastructure for Europe and.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
EGEE’06 Conference Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Testing gLite middleware: overview & status Andreas.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
Service, Operations and Support Infrastructures in HEP Processing the Data from the World’s Largest Scientific Machine Patricia Méndez Lorenzo (IT-GS/EIS),
EGEE-III INFSO-RI Enabling Grids for E-sciencE Antonio Retico CERN, Geneva 19 Jan 2009 PPS in EGEEIII: Some Points.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The future of the gLite release process Oliver.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.
Dr. Andreas Wagner Deputy Group Leader - Operating Systems and Infrastructure Services CERN IT Department The IT Department & The LHC Computing Grid –
European Middleware Initiative (EMI) The Software Engineering Model Alberto Di Meglio (CERN) Interim Project Director.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
LCG CERN David Foster LCG WP4 Meeting 20 th June 2002 LCG Project Status WP4 Meeting Presentation David Foster IT/LCG 20 June 2002.
U.S. Grid Projects and Involvement in EGEE Ian Foster Argonne National Laboratory University of Chicago EGEE-LHC Town Meeting,
LHC Computing, CERN, & Federated Identities
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA3 partner collaboration tasks & process.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Components Selection Validation Integration Deployment What it could mean inside EGI
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementing product teams Oliver Keeble.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
Bob Jones EGEE Technical Director
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
EGEE Middleware Activities Overview
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Ian Bird GDB Meeting CERN 9 September 2003
EGEE support for HEP and other applications
Readiness of ATLAS Computing - A personal view
Testing for patch certification
Future Test Activities SA3 All Hands Meeting Dublin
Visit of US House of Representatives Committee on Appropriations
LHC Data Analysis using a worldwide computing grid
Overview & Status Al-Ain, UAE November 2007.
Presentation transcript:

The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers and Demanding User Communities Dr. Andreas Unterkircher, Dr. Markus Schulz EGEE SA3 & LCG Deployment April 2009,CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland Outline CERN LHC the computing challenge – Data rates, computing, community Grid CERN – WLCG, EGEE gLite Middleware – Code Base Experience – Integration – Certification Lessons Learned Markus Schulz, CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland CERN stands for over 50 years of 1954 Rebuilding Europe First meeting of the CERN Council 1980 East meets West Visit of a delegation from Beijing 2004 Global Collaboration The Large Hadron Collider involves over 80 countries fundamental research and discoveries technological innovation training and education bringing the world together

CERN IT Department CH-1211 Genève 23 Switzerland CERN’s mission in Science Understand the fundamental laws of nature – We accelerate elementary particles and make them collide. – Then compare the results with the theory Provide a world-class laboratory to researchers in Europe and beyond A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes

CERN IT Department CH-1211 Genève 23 Switzerland Markus Schulz, CERN, IT Department View of the LHC tunnel CERN build the Large Hadron Collider (LHC) the world’s largest particle accelerator (27 km long, 100 m under ground) First beam in 2008 Start of the physics program autumn 2009

View of the ATLAS detector (2005) 150 million sensors deliver data … … 40 million times per second

View of the ATLAS detector (almost ready)

CERN IT Department CH-1211 Genève 23 Switzerland The LHC Computing Challenge  Signal/Noise <10 -9  Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year ( 20 Million CDs)  Compute power Event complexity * Nb. events * thousands users  >100 k of (today's) fastest CPUs  Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology The Needle

CERN IT Department CH-1211 Genève 23 Switzerland LHC User Community Europe: 267 Institutes, 4603 Users Other: 208 Institutes, 1632 Users Over 6000 LHC Scientists world wide Markus Schulz, CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland Flow to the CERN Computer Center Markus Schulz, CERN, IT Department 10Gbit

CERN IT Department CH-1211 Genève 23 Switzerland LHC Computing Grid project (LCG) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) 10Gbit links to each of the 10 T1 centers large facilities with mass storage capability Tier-2s ~150 centres in ~35 countries from CPUs

CERN IT Department CH-1211 Genève 23 Switzerland LHC Computing  Multi-science MONARC project – First LHC computing architecture – hierarchical distributed model 2000 – growing interest in grid technology – HEP community main driver in launching the DataGrid project EU DataGrid project – middleware & testbed for an operational grid – LHC Computing Grid – LCG – deploying the results of DataGrid to provide a production facility for LHC experiments – EU EGEE project phase 1 – starts from the LCG grid – shared production infrastructure – expanding to other communities and sciences – EU EGEE project phase 2 – expanding to other communities and sciences – Scale and stability – Interoperations/Interoperability – EU EGEE project phase 3 – More communities – Efficient operations – Less central coordination CERN

The EGEE project EGEE –Started in April 2004, now in third phase with 91 partners in 32 countries –3 rd phrase ( ) –2010 egi.org Objectives –Large-scale, production-quality grid infrastructure for e-Science –Attracting new resources and users from industry as well as science –Maintain and further improve “gLite” Grid middleware CERN, IT Department

Enabling Grids for E-sciencE EGEE-II INFSO-RI Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … >250 sites 48 countries >100,000 CPUs >20 PetaBytes >10,000 users >200 communities >350,000 jobs/day CERN, IT Department Global Multi Science Infrastructure, mission critical for many communitiesNumber of jobs from 2004 to 2009 Rapid growth of the infrastructure

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Data Services Storage Element File and Replica Catalog Metadata Catalog Job Management Services Computing Element Worker Node Workload Management Job Provenance Security Services AuthorizationAuthentication Information & Monitoring Services Information System Job MonitoringAccounting Access Services User InterfaceAPI gLite middleware Development effort from different projects: - Condor - globus - Virtual Data Toolkit (VDT) -EGEE -LCG - others………… The project relies on a collaborative consensus based process - No single architect -Technical Director and Technical Management Board -Agree with stakeholders on next steps -Agree on priorities -Bi-weekly phone conference to coordinate -Short term priorities -Incidents ( bugs) -2-3 all hands meetings/year -Mail mail and mail …

CERN IT Department CH-1211 Genève 23 Switzerland gLite code base CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department

CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department 10K5K 2K 1K

CERN IT Department CH-1211 Genève 23 Switzerland gLite code details CERN, IT Department 2K Complex external and internal cross dependencies  Integration, configuration management was always a challenge  The components are grouped together to ~30 services

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Complex Dependencies

CERN IT Department CH-1211 Genève 23 Switzerland Markus Schulz, CERN, IT Department Example: Data Management

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Stability of the software All components still see frequent changes Many developments started 2002 – Why do we still need changes? Scale of the system increased rapidly Number of user and use cases increased – Deeper code coverage – New functional requirements Less tolerance to failures – Implementation of fail over Emerging standards – Project started when no standards where available – Incremental introduction Exponential growth

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Software stability: Defects Most changes are triggered by defects 81% ~40% are found by users ~2000 open bugs at any time Increased production use Developers use the same system

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Software Process (since 2006) Component based, frequent releases – Components are updated independently No big bang releases – Updates (patches) are delivered on a weekly basis to PPS Move after 2 weeks to production – Clear prioritization by stakeholders – Clear definition of roles and responsibilities – Use of a common build system ( ETICS) Release model: Pull Sites pick up updates when convenient Multiple versions are in production Retirement of old versions takes > 1 year

CERN IT Department CH-1211 Genève 23 Switzerland Component based process

CERN IT Department CH-1211 Genève 23 Switzerland CERN, IT Department Patch and Bug Lifecycle State changes are tracked by Savannah – progress is monitored by dashboards

CERN IT Department CH-1211 Genève 23 Switzerland Effort Work areas – Integration – Configuration – Testing & Certification – Release Management Coordinated by CERN – 10 partner institutes – ~30 FTEs

CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Deployment tests – Developers produce rpms that conflict with existing rpms (gLite or system). – Update affects production node types with the produced rpms. – Deployment tests are available can be launched by the developer before giving the rpms to certification. 29

CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Deployment tests issues – We provide a repository, rpm lists and tarballs (for certain services). – Sites install/update the middleware differently yum, fabric management tools,… – Difficult to “test” all deployment scenarios Sites and regions customize install and configuration procedures – The base OS version is updated frequently independently 30

CERN IT Department CH-1211 Genève 23 Switzerland Integration testing Configuration tests Grid services are configured with YAIM – YAIM (Ain’t an Installation Manager). – Modular bash shell script > lines, >30 modules Test configuration after changes – middleware or YAIM 31

CERN IT Department CH-1211 Genève 23 Switzerland System testing Services have to be tested against a grid What version should we test against? – Production service is not homogenous One patch may affect several node types. For every node type we have a list of tests that have to be done. Regression tests are available and evolving 32

CERN IT Department CH-1211 Genève 23 Switzerland Acceptance testing Pre-Production Service (PPS) – ~ 20 sites several hundred nodes – Provides access to grid services in previews to interested users. – Evaluate deployment procedures, interoperability and basic functionality of the software against operational scenarios reflecting real production conditions. – After certification patches go to PPS before being released to production. Time spent in PPS: 1-2 weeks. 33

CERN IT Department CH-1211 Genève 23 Switzerland Acceptance testing It is difficult to convince users to try out the services before they are being released to production. Production Grid conditions cannot be fully replicated – Size of the Grid. – File catalogs with millions of entries. Early life support – Dedicated sites install certain service immediately after release to production. – Well defined rollback procedure in case of problems. Pilot services – Preview of a new (version of a) service. – Users can (stress) test it with typical production workloads. – Quick feedback to developers. 34

CERN IT Department CH-1211 Genève 23 Switzerland Test process Tailored to our environment People in different locations involved. – Independent in their work habits and infrastructure. Open source tools. Use „least common denominator“. 35

CERN IT Department CH-1211 Genève 23 Switzerland Test writing Biggest challenge: to get tests written at all. Learning curve for grid services is steep – We maintain lists of expertise. It is difficult to get realistic use cases Keep it simple to focused on test writing Tests are in one defined test categories – installation, functionality etc. Test script may use : Bash, Python or Perl. Tests can be executed as a command – Ensures integration into different frameworks. Tests must be fully configurable Focus on test script, not the integration into a framework. 36

CERN IT Department CH-1211 Genève 23 Switzerland Available tests and checklists are documented 37

CERN IT Department CH-1211 Genève 23 Switzerland Test framework Testing requires a grid Ideal: bring up a complete grid with one click, well defined versions of the nodes according to test results. – Installing grid nodes is non-trivial Pragmatic approach – CERN provides a certification testbed Complete, self-contained grid providing all services. Certifiers install the nodes they need to test and integrate them into the testbed. – Heavy use of virtualization We developed our own tools to create customized images and a VM management framework (Xen based) 38

CERN IT Department CH-1211 Genève 23 Switzerland Test framework Don’t let the framework distract you from doing tests! – We tried complex test frameworks ….. – Execute tests store and display results, information about test set up Pragmatic approach: – Test data and results are stored with the patch. – Patch & bug tracking tool: Savannah – Tests are simple scripts that can be used by anybody 39

CERN IT Department CH-1211 Genève 23 Switzerland Experience We are victims of our own success – Moved prototypes into production very early – With production users we can evolve only slowly ( standards) Software life cycle management has to change with the project’s maturity – Before 2006 focus on functionality Big bang releases, large dedicated testbeds Central team – manage diversity and scale, reactive Fast release cycles Deployment scenarios via PPS Pilot services using production Strong central team & distributed team

CERN IT Department CH-1211 Genève 23 Switzerland Future Components will be developed more independently – Process has to reflect this – Decentralized approach Tests follow agreed process, can be run everywhere More problems are found at full scale in production – Focus on pilots and staged rollout – Improved “Undo” ( rollback) – Deployment tests move to sites Too many different setups to handle in one place

CERN IT Department CH-1211 Genève 23 Switzerland If we could start again….. Expectation management – Software developers and users have to understand the limitation of testing better Enforce unit and basic tests to be provided by software producers – Often software is rejected for trivial reasons Very inefficient Avoid a overambitious Pre-Production Service – Limited gain Enforce control over dependencies from the start on Add process monitoring earlier in the project