The Worldwide LHC Computing Grid BEGrid Seminar

Slides:



Advertisements
Similar presentations
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Why Grids Matter to Europe Bob Jones EGEE.
Advertisements

Computing for LHC Dr. Wolfgang von Rüden, CERN, Geneva ISEF students visit CERN, 28 th June - 1 st July 2009.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Advanced Computing Services for Research Organisations Bob Jones Head of openlab IT dept CERN This document produced by Members of the Helix Nebula consortium.
A short introduction to the Worldwide LHC Computing Grid Maarten Litmaath (CERN)
Open Science Grid  Consortium of many organizations (multiple disciplines)  Production grid cyberinfrastructure  80+ sites, 25,000+ CPU.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE – paving the way for a sustainable infrastructure.
Jürgen Knobloch/CERN Slide 1 A Global Computer – the Grid Is Reality by Jürgen Knobloch October 31, 2007.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
WLCG and the India-CERN Collaboration David Collados CERN - Information technology 27 February 2014.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
LHC Computing, CERN, & Federated Identities
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
tons, 150 million sensors generating data 40 millions times per second producing 1 petabyte per second The ATLAS experiment.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 1 st March 2011 Visit of Dr Manuel Eduardo Baldeón.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Round-table session «Presentation of the new telecommunication channel JINR-Moscow and of the JINR Grid-segment» of the JINR Grid-segment» Korenkov Vladimir.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
Bob Jones EGEE Technical Director
Grid Computing: Running your Jobs around the World
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Grid Computing in HIGH ENERGY Physics
EGEE Middleware Activities Overview
Grid site as a tool for data processing and data analysis
U.S. ATLAS Tier 2 Computing Center
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Applications Using the EGEE Grid Infrastructure
Ian Bird GDB Meeting CERN 9 September 2003
Data Challenge with the Grid in ATLAS
Long-term Grid Sustainability
Dagmar Adamova, NPI AS CR Prague/Rez
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
EGEE support for HEP and other applications
The LHC Computing Grid Visit of Her Royal Highness
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
UK Status and Plans Scientific Computing Forum 27th Oct 2017
Southwest Tier 2.
Connecting the European Grid Infrastructure to Research Communities
Input on Sustainability
Visit of US House of Representatives Committee on Appropriations
EGI – Organisation overview and outreach
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
LHC Data Analysis using a worldwide computing grid
Cécile Germain-Renaud Grid Observatory meeting 19 October 2007 Orsay
Collaboration Board Meeting
The LHC Grid Service A worldwide collaboration Ian Bird
Overview & Status Al-Ain, UAE November 2007.
GRIF : an EGEE site in Paris Region
The LHC Computing Grid Visit of Professor Andreas Demetriou
The LHCb Computing Data Challenge DC06
Presentation transcript:

The Worldwide LHC Computing Grid BEGrid Seminar Frédéric Hemmer CERN IT Deputy Department Head 23rd October 2008

Outline The LHC and Detectors Computing Challenges Current Grid Usage EGEE & OSG Site Reliability & Middleware

LHC: a very large scientific instrument… An aerial view of the LHC footprint (26.7 km circumference) with Lake Geneva and Mont Blanc in the background

… based on advanced technology 23 km of superconducting magnets cooled in superfluid helium at 1.9 K A view of the interconnected string of superconducting magnets installed in the accelerator tunnel

CMS Closed & Ready for First Beam 3 Sept 2008

generating data 40 millions times per second The ATLAS experiment 7000 tons, 150 million sensors generating data 40 millions times per second i.e. a petabyte/s

A collision at LHC

The Data Acquisition Ian.Bird@cern.ch

Tier 0 at CERN: Acquisition, First pass processing Storage & Distribution The next two slides illustrate what happens to the data as it moves out from the experiments. Each of CMS and ATLAS produce data at the rate of 1 DVD-worth every 15 seconds or so, while the rates for LHCb and ALICE are somewhat less. However, during the part of the year when LHC will accelerate lead ions rather than protons, ALICE (which is an experiment dedicated to this kind of physics) alone will produce data at the rate of over 1 Gigabyte per second (1 DVD every 4 seconds). Initially the data is sent to the CERN Computer Centre – the Tier 0 - for storage on tape. Storage also implies guardianship of the data for the long term – the lifetime of the LHC – at least 20 years. This is not passive guardianship but requires migrating data to new technologies as they arrive. We need large scale sophisticated mass storage systems that not only are able to manage the incoming data streams, but also allow for evolution of technology (tapes and disks) without hindering access to the data. The Tier 0 centre provides the initial level of data processing – calibration of the detectors and the first reconstruction of the data. 1.25 GB/sec (ions) Ian.Bird@cern.ch

(extracted by physics topic) Data Handling and Computation for Physics Analysis reconstruction detector event filter (selection & reconstruction) analysis processed data event summary data raw data batch physics analysis event reprocessing simulation analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch

The LHC Computing Challenge Signal/Noise: 10-9 Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 100 k of (today's) fastest CPUs 45 PB of disk storage Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology The challenge faced by LHC computing is one primarily of data volume and data management. The scale and complexity of the detectors – the large number of “pixels” if one can envisage them as huge digital cameras, and the high rate of collisions – some 600 million per second – means that we will need to store around 15 Petabytes of new data each year. This is equivalent to about 3 million standard DVDs. In order to process this volume of data requires large numbers of processors: about 100,000 processor cores are available in WLCG today, and this need will grow over the coming years, as will the 45 PB of disk currently required for data storage and analysis. This can only be achieved through a worldwide effort, with locally funded resources in each country being brought together into a virtual computing cloud through the use of grid technology.

The Worldwide LHC Computing The LHC Grid Service is a worldwide collaboration between: 4 LHC experiments and ~140 computer centres that contribute resources International grid projects providing software and services The collaboration is brought together by a MoU that: Commits resources for the coming years Agrees a certain level of service availability and reliability As of today 33 countries have signed the MoU: CERN (Tier 0) + 11 large Tier 1 sites 130 Tier 2 sites in 60 “federations” Other sites are expected to participate but without formal commitment The WLCG is a worldwide collaboration put together in order to initially prototype, and then to put into production the computing environment for the LHC experiments. It is a collaboration between the 4 experiments and the computer centres, as well as with several national and international grid projects that have developed software and services. The collaboration has agreed a Memorandum of Understanding that commits certain levels of resources – processing power, storage, and networking – for the coming years, and agrees levels of availability and reliability for the services. In addition to sites whose funding agencies have made formal commitments via the MoU, we anticipate many other universities and labs that collaborate in LHC to participate with resources, and eventually to join the collaboration.

Tier 0 – Tier 1 – Tier 2 Tier-0 (CERN): Data recording Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis The Tier 0 centre at CERN stores the primary copy of all the data. A second copy is distributed between the 11 so-called Tier 1 centres. These are large computer centres in different geographical regions of the world, that also have a responsibility for long term guardianship of the data. The data is sent from CERN to the Tier 1s in real time over dedicated network connections. In order to keep up with the data coming from the experiments this transfer must be capable of running at around 1.3 GB/s continuously. This is equivalent to a full DVD every 3 seconds. The Tier 1 sites also provide the second level of data processing and produce data sets which can be used to perform the physics analysis. These data sets are sent from the Tier 1 sites to the around 130 Tier 2 sites. A Tier 2 is typically a university department or physics laboratories and are located all over the world in most of the countries that participate in the LHC experiments. Often, Tier 2s are associated to a Tier 1 site in their region. It is at the Tier 2s that the real physics analysis is performed. Ian.Bird@cern.ch

Evolution of Grids WLCG GriPhyN, iVDGL, PPDG GRID 3 OSG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Service Challenges Cosmics What I have just described is the working system as it is today, based on the “tiered” model first proposed in the 90’s as a way to distribute the data and processing to the experiment collaborations. At that time it was not clear how such a distributed model would be implemented. However, at the end of the 90’s the concept of grid computing was becoming popular and was seen as a way to actually implement the LHC distributed model. Many grid projects were initiated in the High Energy Physics community both in Europe and in the US to prototype these ideas in dedicated test set ups. The LCG project started in 2002 and was the first to really try and integrate grid computing into the production environments of the computing centres that would be providing the computing power. LCG service started in 2003 with a handful of sites, and in 2004 the Enabling Grids in E-sciencE (EGEE) project was funded in Europe to build on the developments made by LCG and to broaden the availability to other sciences. The Open Science Grid (OSG) in the US started in late 2005 with similar goals. These large science grid infrastructures (and other national grids) today provide the infrastructure on which the Worldwide LCG collaboration builds its service. As you see on the slide, since the beginning, in parallel to building up the infrastructure and service, we have been executing a series of data and service challenges. First physics Data Challenges

Preparation for accelerator start up Since 2004 WLCG has been running a series of challenges to demonstrate aspects of the system; with increasing targets for: Data throughput Workloads Service availability and reliability Culminating in a 1 month challenge in May with All 4 experiments running realistic work (simulating what will happen in data taking) Demonstrated that we were ready for real data In essence the LHC Grid service has been running for several years

Recent grid use The grid concept really works – all contributions – large & small contribute to the overall effort! CERN: 11% Tier 2: 54% Tier 1: 35% The next couple of slides illustrate the success of the grid concept for LHC. Here, these graphics show the distribution of computer time used in the first part of this year. As you can see CERN is providing around 10% of the processing, the Tier 1 sites together about 35% and more than 50% comes from the Tier 2s in this example. In real data taking conditions this sharing will be more like CERN 20%, Tier 1s 40%, Tier 2s 40%. But note that the left hand graphic shows that all of the Tier 2s can really contribute. There are large contributions but also very small. Everyone is able to contribute their share and participate in the analysis. This really shows that this technology is really able to bring together all of these dsitributed resources in a usable way.

Recent grid activity In readiness testing WLCG ran more than10 million jobs /month (1 job is ~ 8 hours use of a single processor) 350k /day Here you can see the growth in the workload on the grid over the last 18 months, with May of this year as our final readiness challenge. The plot shows the number of jobs (individual pieces of analysis work, each running for about 8 to 12 hours on a single processor) run per month. In May we ran more that 10 million such jobs – or about 350, 000 per day – which is the scale required in real LHC running over the next year. The system is able to manage this level of complexity and load. More importantly is that this is done in a way that is supportable by the staff in the computer centres that make up the grid. The goals of reliability and responsiveness to problems are achievable with the procedures and staff that we have. This is the result of significant efforts over the past few years to improve and robustify this really very young technology. These workloads are at the level anticipated for 2009 data

Data transfer out of Tier 0 Full experiment rate needed is 650 MB/s Desire capability to sustain twice that to allow for Tier 1 sites to shutdown and recover Have demonstrated far in excess of that All experiments exceeded required rates for extended periods, & simultaneously All Tier 1s achieved (or exceeded) their target acceptance rates The other aspect of performance that is vital to the success of LHC is our ability to move data: both at the rates needed to keep up with the experiments, but also ensuring the reliability of the transfers and the integrity of the data. Each of the Tier 1 sites has a dedicated 10 Gb primary connection to CERN and also many backup links. The Tier 1s and Tier 2s are interconnected through the standard academic internet in each country. The two graphics here show the data throughput rate out of CERN to the Tier 1 sites. In data taking the experiments together with push data at 650 MB/s; our target was twice that (1.3 GB/s) to allow for failures and recovery. As can be seen the target is easily achieved and exceeded, and can be sustained for extended periods. In the testing each of the experiments showed that their individual data rates can be achieved; each of the Tier 1s also demonstrated that they can accept data at their target rates or more.

Production Grids WLCG relies on a production quality infrastructure Requires standards of: Availability/reliability Performance Manageability Will be used 365 days a year ... (has been for several years!) Tier 1s must store the data for at least the lifetime of the LHC - ~20 years Not passive – requires active migration to newer media Vital that we build a fault-tolerant and reliable system That can deal with individual sites being down and recover It is important to understand that this is a facility which will be in continuous use for many years to come. We have to run at high levels of service performance and reliability. However, the grid infrastructure allows us to avoid single points of failure and means that the service as a whole can continue even if parts of it are down for maintenance or because of the problems that will surely arise. The challenge over the coming years will be for us to integrate new technologies and services, and to evolve while maintaining and ongoing and reliable service for the experiments.

WLCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid ... as well as many national grid projects Interoperability & interoperation is vital significant effort in building the procedures to support it The LHC computing service is a truly worldwide collaboration which today relies on 2 major science grid infrastructure projects, as well as several other regional or national scale projects that all work together towards a common goal. While LHC has very significant requirements and has been the driving force behind many of these projects, it has also been the binding that has fostered worldwide collaboration in computing. The potential of such large scale infrastructures to other sciences is becoming clear and we anticipate that this will be a direct benefit of the LHC computing needs.

Impact of the LHC Computing Grid in Europe LCG has been the driving force for the European multi-science Grid EGEE (Enabling Grids for E-sciencE) EGEE is now a global effort, and the largest Grid infrastructure worldwide Co-funded by the European Commission (Cost: ~130 M€ over 4 years, funded by EU ~70M€) EGEE already used for >100 applications, including… Bio-informatics Education, Training Medical Imaging

Grid infrastructure project co-funded by the European Commission - now in 3rd phase with partners in 45 countries 240 sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … In Europe, EGEE – which started in 2004 from the service that LCG had built over the preceding 18 months has now grown to be the world’s largest scientific grid infrastructure. While LHC is by far its biggest user, it supports a whole range of other scientific applications from life sciences, to physics, climate modeling, and many others. It is itself a worldwide infrastructure with partners in around 45 countries.

Access to 45,000 Cores, 6 Petabytes Disk, 15 Petabytes Tape OSG Project : Supported by the Department of Energy & the National Science Foundation Access to 45,000 Cores, 6 Petabytes Disk, 15 Petabytes Tape >15,000 CPU Days/Day ~85% Physics: LHC, Tevatron Run II, LIGO; ~15% non-physics: biology, climate, text mining, Including ~20% Opportunistic use of others resources. Virtual Data Toolkit: Common software developed between Computer Science & applications used by OSG and others. NERSC BU UNM SDSC UTA OU FNAL ANL WISC BNL VANDERBILT PSU UVA CALTECH IOWA STATE PURDUE IU BUFFALO TTU CORNELL ALBANY UMICH INDIANA IUPUI STANFORD UWM UNL UFL UNI WSU MSU LTU LSU CLEMSON UMISS UIUC UCR UCLA LEHIGH NSF ORNL HARVARD UIC SMU UCHICAGO MIT RENCI LBL GEORGETOWN UIOWA UCDAVIS ND In the US, the Open Science grid plays a similar role to EGEE. Again, its largest user is LHC, but it supports many other scientific disciplines and has partners outside of the US. Of course EGEE and OSG have very similar goals and they also collaborate closely with each other – often driven by the common needs of supporting LHC. Partnering with: US LHC: Tier-1s, Tier-2s, Tier-3s Campus Grids: Clemson, Wisconsin, Fermilab, Purdue Regional & National Grids: TeraGrid, New York State Grid, EGEE, UK NGS International Collaboration: South America, Central America, Taiwan, Korea, UK.

Sustainability Need to prepare for permanent Grid infrastructure Ensure a high quality of service for all user communities Independent of short project funding cycles Infrastructure managed in collaboration with National Grid Initiatives (NGIs) European Grid Initiative (EGI)

Middleware: Baseline Services The Basic Baseline Services – from the TDR (2005) Storage Element Castor, dCache, DPM Storm added in 2007 SRM 2.2 – deployed in production – Dec 2007 Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0T1s 3D project Information System Scalability improvements Compute Elements Globus/Condor-C – improvements to LCG-CE for scale/reliability web services (CREAM) Support for multi-user pilot jobs (glexec, SCAS) gLite Workload Management in production VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools ... continuing evolution reliability, performance, functionality, requirements

Site Reliability – CERN + Tier-1s “Site Reliability” a function of grid services middleware site operations storage management systems networks ........ Targets – CERN + Tier-1s Before July July 07 Dec 07 Avg.last 3 months Each site 88% 91% 93% 89% 8 best sites 95%

Tier-2 Site Reliability Tier-2 Sites 83 Tier-2 sites being monitored

Improving Reliability Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers

Future Middleware … an opinion Today’s middleware is complex – support effort is large (sw & operational) Built a production service from “prototypes” – often too complex for need Web Services promises have not delivered – except in proprietary or organised implementations; robust tools do not exist Technology changes rapidly Cloud computing – very (too) simple interfaces; virtualisation of environments Global file systems seem a realistic expectation now A reliable messaging system is a better “glue” for heterogeneous distributed services A cursory analysis suggests that most interactions (even “synchronous”) can be implemented We need to work with industry (and opensource) to better integrate remote interfaces to storage and batch systems Some of today’s complexity is due to the need to add on external gateways E.g. LSF, SGE, PBS could have (full) BES interface – we don’t need a CE SRM is too complex as SE interface – what is a better way  filesystem ? We should aim to get as much as possible delivered with the OS, or as “standard” layered software packages

The LHC Grid Service A worldwide collaboration Has been in production for several years Is now being used for real data Is ready to face the computing challenges as LHC gets up to full speed Many, many people have contributed to building up the grid service to today’s levels. We have taken a new and very immature technology and worked to make it usable at the scale and levels of reliability that are unprecedented in a distributed system. This has been a fantastic achievement and like the LHC itself is really the fruit of a truly worldwide effort. Such an effort in computing is also a first. We have relied upon these people to get to this point, and now we rely on them to support and evolve the service over the coming years. In fact the challenge is really only beginning now.

Ian.Bird@cern.ch