Dr. Ian Bird LHC Computing Grid Project Leader Göttingen Tier 2 Inauguration 13 th May 2008 Challenges and Opportunities
The scales 2
Chambres à muons Calorimètre - High Energy Physics machines and detectors √s=14 TeV L : /cm 2 /s L: /cm 2 /s 2,5 million collisions per second LVL1: 10 KHz, LVL3: Hz 25 MB/sec digitized recording 40 million collisions per second LVL1: 1 kHz, LVL3: 100 Hz 0.1 to 1 GB/sec digitized recording 3
LHC: 4 experiments … ready! First physics expected in autumn
The LHC Computing Challenge 5 Signal/Noise: Data volume High rate * large number of channels * 4 experiments 15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 100 k of (today's) fastest CPUs Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere GRID technology
A collision at LHC Luminosity : cm -2 s MHz – every 25 ns 20 events overlaying 6
The Data Acquisition 7
Tier 0 at CERN: Acquisition, First pass reconstruction, Storage & Distribution 1.25 GB/sec (ions) 8
Tier 0 – Tier 1 – Tier 2 9 Tier-0 (CERN): Data recording First-pass reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (>200 centres): Simulation End-user analysis
Evolution of requirements LHC approved ATLAS & CMS approved ALICE approved LHCb approved “Hoffmann” Review 7x10 7 MIPS 1,900 TB disk ATLAS (or CMS) requirements for first year at design luminosity ATLAS&CMS CTP 10 7 MIPS 100 TB disk LHC start Computing TDRs 55x10 7 MIPS 70,000 TB disk (140 MSi2K)
Evolution of CPU Capacity at CERN SC (0.6GeV) PS (28GeV) ISR (300GeV) SPS (400GeV) ppbar (540GeV) LEP (100GeV) LEP II (200GeV) LHC (14 TeV) Costs (2007 Swiss Francs) Includes infrastructure costs (comp.centre, power, cooling,..) and physics tapes Tape & disk requirements: >10 times CERN possibility
Evolution of Grids 12 Data Challenges First physics GRID 3 EGEE 1 LCG 1 EU DataGrid GriPhyN, iVDGL, PPDG EGEE 2 OSG LCG 2 EGEE WLCG Service Challenges Cosmics
The Worldwide LHC Computing Grid Purpose Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments Ensure the computing service … and common application libraries and tools Phase I – Development & planning Phase II – – Deployment & commissioning of the initial services
WLCG Collaboration The Collaboration 4 LHC experiments ~250 computing centres 12 large centres (Tier-0, Tier-1) 56 federations of smaller “Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid Technical Design Reports WLCG, 4 Experiments: June 2005 Memorandum of Understanding Agreed in October 2005 Resources 5-year forward look 14 Tier 1 – all have now signed Tier 2: Tier 1 – all have now signed Tier 2: Australia Belgium Canada * China Czech Rep. * Denmark Estonia Finland France Germany (*) Hungary * Italy India Israel Japan JINR Korea Netherlands Norway * Pakistan Poland Potugal Romania Russia Slovenia Spain Sweden * Switzerland Taipei Turkey * UK Ukraine USA Still to sign: Austria Brazil (under discussion) * Recent additions MoU Signing Status
WLCG Service Hierarchy 15 Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres Tier-1 – “online” to the data acquisition process high availability Managed Mass Storage – grid-enabled data service Data-heavy analysis National, regional support Tier-2: ~130 centres in ~35 countries End-user (physicist, research group) analysis – where the discoveries are made Simulation Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY)
Recent grid use Tier 2: 54% CERN: 11% Tier 1: 35% Across all grid infrastructures EGEE, OSG, Nordugrid The grid concept really works – all contributions – large & small are essential!
Recent grid activity These workloads (reported across all WLCG centres) are at the level anticipated for 2008 data taking 230k /day WLCG ran ~ 44 M jobs in 2007 – workload has continued to increase 29M in 2008 – now at ~ >300k jobs/day Distribution of work across Tier0/Tier1/Tier 2 really illustrates the importance of the grid system Tier 2 contribution is around 50%; > 85% is external to CERN 300k /day
LHCOPN Architecture 18
Data Transfer out of Tier-0 19 Target: 2008/ GB/s
Production Grids WLCG relies on a production quality infrastructure Requires standards of: ○ Availability/reliability ○ Performance ○ Manageability Will be used 365 days a year... (has been for several years!) Tier 1s must store the data for at least the lifetime of the LHC - ~20 years ○ Not passive – requires active migration to newer media Vital that we build a fault-tolerant and reliable system That can deal with individual sites being down and recover 20
The EGEE Production Infrastructure Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Support Structures & Processes Training infrastructure (NA4) Training activities (NA3)
Site Reliability 22 Sep 07Oct 07Nov 07Dec 07Jan 08Feb 08 All89%86%92%87%89%84% 8 best93% 95% 96% Above target (+>90% target)
Improving Reliability Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers
Gridmap
Middleware: Baseline Services Storage Element Castor, dCache, DPM Storm added in 2007 SRM 2.2 – deployed in production – Dec 2007 Basic transfer tools – Gridftp,.. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0 T1s 3D project Information System Scalability improvements Compute Elements Globus/Condor-C – improvements to LCG-CE for scale/reliability web services (CREAM) Support for multi-user pilot jobs (glexec, SCAS) gLite Workload Management in production VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools The Basic Baseline Services – from the TDR (2005) Focus now on continuing evolution of reliability, performance, functionality, requirements For a production grid the middleware must allow us to build fault-tolerant and scalable services: this is more important than sophisticated functionality
Database replication In full production Several GB/day user data can be sustained to all Tier 1s ~100 DB nodes at CERN and several 10’s of nodes at Tier 1 sites Very large distributed database deployment Used for several applications Experiment calibration data; replicating (central, read-only) file catalogues
LCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid 27 Interoperability & interoperation is vital significant effort in building the procedures to support it
Enabling Grids for E-sciencE EGEE-II INFSO-RI sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … Grid infrastructure project co-funded by the European Commission - now in 2 nd phase with 91 partners in 32 countries
EGEE: Increasing workloads ⅓ non-LHC
Grid Applications 30 Medical Seismology Chemistry Astronomy Fusion Particle Physics
Share of EGEE resources 31 5/07 – 4/08: 45 Million jobs HEP
HEP use of EGEE: May 07 – Apr 08 32
The next step
Sustainability: Beyond EGEE-II Need to prepare permanent, common Grid infrastructure Ensure the long-term sustainability of the European e- infrastructure independent of short project funding cycles Coordinate the integration and interaction between National Grid Infrastructures (NGIs) Operate the European level of the production Grid infrastructure for a wide range of scientific disciplines to link NGIs
EGI – European Grid Initiative EGI Design Study proposal to the European Commission (started Sept 07) Supported by 37 National Grid Initiatives (NGIs) 2 year project to prepare the setup and operation of a new organizational model for a sustainable pan-European grid infrastructure after the end of EGEE-3
Summary We have an operating production quality grid infrastructure that: Is in continuous use by all 4 experiments (and many other applications); Is still growing in size – sites, resources (and still to finish ramp up for LHC start-up); Demonstrates interoperability (and interoperation!) between 3 different grid infrastructures (EGEE, OSG, Nordugrid); Is becoming more and more reliable; Is ready for LHC start up For the future we must: Learn how to reduce the effort required for operation; Tackle upcoming issues of infrastructure (e.g. Power, cooling); Manage migration of underlying infrastructures to longer term models; Be ready to adapt the WLCG service to new ways of doing distributed computing 36