The LHC Grid Service A worldwide collaboration Ian Bird

The LHC Grid Service A worldwide collaboration Ian Bird
LHC Computing Grid Project Leader LHC Grid Fest 3rd October 2008

Introduction The LHC Grid Service is a worldwide collaboration between: 4 LHC experiments and ~140 computer centres that contribute resources International grid projects providing software and services The collaboration is brought together by a MoU that: Commits resources for the coming years Agrees a certain level of service availability and reliability As of today 33 countries have signed the MoU: CERN (Tier 0) + 11 large Tier 1 sites 130 Tier 2 sites in 60 “federations” Other sites are expected to participate but without formal commitment The WLCG is a worldwide collaboration put together in order to initially prototype, and then to put into production the computing environment for the LHC experiments. It is a collaboration between the 4 experiments and the computer centres, as well as with several national and international grid projects that have developed software and services. The collaboration has agreed a Memorandum of Understanding that commits certain levels of resources – processing power, storage, and networking – for the coming years, and agrees levels of availability and reliability for the services. In addition to sites whose funding agencies have made formal commitments via the MoU, we anticipate many other universities and labs that collaborate in LHC to participate with resources, and eventually to join the collaboration.

The LHC Computing Challenge
Signal/Noise: 10-9 Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 100 k of (today's) fastest CPUs 45 PB of disk storage Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology The challenge faced by LHC computing is one primarily of data volume and data management. The scale and complexity of the detectors – the large number of “pixels” if one can envisage them as huge digital cameras, and the high rate of collisions – some 600 million per second – means that we will need to store around 15 Petabytes of new data each year. This is equivalent to about 3 million standard DVDs. In order to process this volume of data requires large numbers of processors: about 100,000 processor cores are available in WLCG today, and this need will grow over the coming years, as will the 45 PB of disk currently required for data storage and analysis. This can only be achieved through a worldwide effort, with locally funded resources in each country being brought together into a virtual computing cloud through the use of grid technology.

Tier 0 at CERN: Acquisition, First pass processing Storage & Distribution
The next two slides illustrate what happens to the data as it moves out from the experiments. Each of CMS and ATLAS produce data at the rate of 1 DVD-worth every 15 seconds or so, while the rates for LHCb and ALICE are somewhat less. However, during the part of the year when LHC will accelerate lead ions rather than protons, ALICE (which is an experiment dedicated to this kind of physics) alone will produce data at the rate of over 1 Gigabyte per second (1 DVD every 4 seconds). Initially the data is sent to the CERN Computer Centre – the Tier 0 - for storage on tape. Storage also implies guardianship of the data for the long term – the lifetime of the LHC – at least 20 years. This is not passive guardianship but requires migrating data to new technologies as they arrive. We need large scale sophisticated mass storage systems that not only are able to manage the incoming data streams, but also allow for evolution of technology (tapes and disks) without hindering access to the data. The Tier 0 centre provides the initial level of data processing – calibration of the detectors and the first reconstruction of the data. 1.25 GB/sec (ions)

Tier 0 – Tier 1 – Tier 2 Tier-0 (CERN): Data recording
Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis The Tier 0 centre at CERN stores the primary copy of all the data. A second copy is distributed between the 11 so-called Tier 1 centres. These are large computer centres in different geographical regions of the world, that also have a responsibility for long term guardianship of the data. The data is sent from CERN to the Tier 1s in real time over dedicated network connections. In order to keep up with the data coming from the experiments this transfer must be capable of running at around 1.3 GB/s continuously. This is equivalent to a full DVD every 3 seconds. The Tier 1 sites also provide the second level of data processing and produce data sets which can be used to perform the physics analysis. These data sets are sent from the Tier 1 sites to the around 130 Tier 2 sites. A Tier 2 is typically a university department or physics laboratories and are located all over the world in most of the countries that participate in the LHC experiments. Often, Tier 2s are associated to a Tier 1 site in their region. It is at the Tier 2s that the real physics analysis is performed.

Evolution of Grids WLCG Ian.Bird@cern.ch GriPhyN, iVDGL, PPDG GRID 3
OSG WLCG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Service Challenges Cosmics What I have just described is the working system as it is today, based on the “tiered” model first proposed in the 90’s as a way to distribute the data and processing to the experiment collaborations. At that time it was not clear how such a distributed model would be implemented. However, at the end of the 90’s the concept of grid computing was becoming popular and was seen as a way to actually implement the LHC distributed model. Many grid projects were initiated in the High Energy Physics community both in Europe and in the US to prototype these ideas in dedicated test set ups. The LCG project started in 2002 and was the first to really try and integrate grid computing into the production environments of the computing centres that would be providing the computing power. LCG service started in 2003 with a handful of sites, and in 2004 the Enabling Grids in E-sciencE (EGEE) project was funded in Europe to build on the developments made by LCG and to broaden the availability to other sciences. The Open Science Grid (OSG) in the US started in late 2005 with similar goals. These large science grid infrastructures (and other national grids) today provide the infrastructure on which the Worldwide LCG collaboration builds its service. As you see on the slide, since the beginning, in parallel to building up the infrastructure and service, we have been executing a series of data and service challenges. First physics Data Challenges

Preparation for accelerator start up
Since 2004 WLCG has been running a series of challenges to demonstrate aspects of the system; with increasing targets for: Data throughput Workloads Service availability and reliability Culminating in a 1 month challenge in May with All 4 experiments running realistic work (simulating what will happen in data taking) Demonstrated that we were ready for real data In essence the LHC Grid service has been running for several years

Recent grid use The grid concept really works – all contributions – large & small contribute to the overall effort! CERN: 11% Tier 2: 54% Tier 1: 35% The next couple of slides illustrate the success of the grid concept for LHC. Here, these graphics show the distribution of computer time used in the first part of this year. As you can see CERN is providing around 10% of the processing, the Tier 1 sites together about 35% and more than 50% comes from the Tier 2s in this example. In real data taking conditions this sharing will be more like CERN 20%, Tier 1s 40%, Tier 2s 40%. But note that the left hand graphic shows that all of the Tier 2s can really contribute. There are large contributions but also very small. Everyone is able to contribute their share and participate in the analysis. This really shows that this technology is really able to bring together all of these dsitributed resources in a usable way.

Recent grid activity In readiness testing WLCG ran more than10 million jobs /month (1 job is ~ 8 hours use of a single processor) 350k /day Here you can see the growth in the workload on the grid over the last 18 months, with May of this year as our final readiness challenge. The plot shows the number of jobs (individual pieces of analysis work, each running for about 8 to 12 hours on a single processor) run per month. In May we ran more that 10 million such jobs – or about 350, 000 per day – which is the scale required in real LHC running over the next year. The system is able to manage this level of complexity and load. More importantly is that this is done in a way that is supportable by the staff in the computer centres that make up the grid. The goals of reliability and responsiveness to problems are achievable with the procedures and staff that we have. This is the result of significant efforts over the past few years to improve and robustify this really very young technology. These workloads are at the level anticipated for 2009 data

Data transfer out of Tier 0
Full experiment rate needed is 650 MB/s Desire capability to sustain twice that to allow for Tier 1 sites to shutdown and recover Have demonstrated far in excess of that All experiments exceeded required rates for extended periods, & simultaneously All Tier 1s achieved (or exceeded) their target acceptance rates The other aspect of performance that is vital to the success of LHC is our ability to move data: both at the rates needed to keep up with the experiments, but also ensuring the reliability of the transfers and the integrity of the data. Each of the Tier 1 sites has a dedicated 10 Gb primary connection to CERN and also many backup links. The Tier 1s and Tier 2s are interconnected through the standard academic internet in each country. The two graphics here show the data throughput rate out of CERN to the Tier 1 sites. In data taking the experiments together with push data at 650 MB/s; our target was twice that (1.3 GB/s) to allow for failures and recovery. As can be seen the target is easily achieved and exceeded, and can be sustained for extended periods. In the testing each of the experiments showed that their individual data rates can be achieved; each of the Tier 1s also demonstrated that they can accept data at their target rates or more.

Production Grids WLCG relies on a production quality infrastructure
Requires standards of: Availability/reliability Performance Manageability Will be used 365 days a year ... (has been for several years!) Tier 1s must store the data for at least the lifetime of the LHC - ~20 years Not passive – requires active migration to newer media Vital that we build a fault-tolerant and reliable system That can deal with individual sites being down and recover It is important to understand that this is a facility which will be in continuous use for many years to come. We have to run at high levels of service performance and reliability. However, the grid infrastructure allows us to avoid single points of failure and means that the service as a whole can continue even if parts of it are down for maintenance or because of the problems that will surely arise. The challenge over the coming years will be for us to integrate new technologies and services, and to evolve while maintaining and ongoing and reliable service for the experiments.

WLCG depends on two major science grid infrastructures ….
EGEE - Enabling Grids for E-Science OSG - US Open Science Grid ... as well as many national grid projects The LHC computing service is a truly worldwide collaboration which today relies on 2 major science grid infrastructure projects, as well as several other regional or national scale projects that all work together towards a common goal. While LHC has very significant requirements and has been the driving force behind many of these projects, it has also been the binding that has fostered worldwide collaboration in computing. The potential of such large scale infrastructures to other sciences is becoming clear and we anticipate that this will be a direct benefit of the LHC computing needs. Interoperability & interoperation is vital significant effort in building the procedures to support it

Grid infrastructure project co-funded by the European Commission - now in 2nd phase with 91 partners in 32 countries 240 sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … In Europe, EGEE – which started in 2004 from the service that LCG had built over the preceding 18 months has now grown to be the world’s largest scientific grid infrastructure. While LHC is by far its biggest user, it supports a whole range of other scientific applications from life sciences, to physics, climate modeling, and many others. It is itself a worldwide infrastructure with partners in around 45 countries.

Access to 45,000 Cores, 6 Petabytes Disk, 15 Petabytes Tape
OSG Project : Supported by the Department of Energy & the National Science Foundation Access to 45,000 Cores, 6 Petabytes Disk, 15 Petabytes Tape >15,000 CPU Days/Day ~85% Physics: LHC, Tevatron Run II, LIGO; ~15% non-physics: biology, climate, text mining, Including ~20% Opportunistic use of others resources. Virtual Data Toolkit: Common software developed between Computer Science & applications used by OSG and others. NERSC BU UNM SDSC UTA OU FNAL ANL WISC BNL VANDERBILT PSU UVA CALTECH IOWA STATE PURDUE IU BUFFALO TTU CORNELL ALBANY UMICH INDIANA IUPUI STANFORD UWM UNL UFL UNI WSU MSU LTU LSU CLEMSON UMISS UIUC UCR UCLA LEHIGH NSF ORNL HARVARD UIC SMU UCHICAGO MIT RENCI LBL GEORGETOWN UIOWA UCDAVIS ND In the US, the Open Science grid plays a similar role to EGEE. Again, its largest user is LHC, but it supports many other scientific disciplines and has partners outside of the US. Of course EGEE and OSG have very similar goals and they also collaborate closely with each other – often driven by the common needs of supporting LHC. Partnering with: US LHC: Tier-1s, Tier-2s, Tier-3s Campus Grids: Clemson, Wisconsin, Fermilab, Purdue Regional & National Grids: TeraGrid, New York State Grid, EGEE, UK NGS International Collaboration: South America, Central America, Taiwan, Korea, UK.

The LHC Grid Service A worldwide collaboration
Has been in production for several years Is now being used for real data Is ready to face the computing challenges as LHC gets up to full speed Many, many people have contributed to building up the grid service to today’s levels. We have taken a new and very immature technology and worked to make it usable at the scale and levels of reliability that are unprecedented in a distributed system. This has been a fantastic achievement and like the LHC itself is really the fruit of a truly worldwide effort. Such an effort in computing is also a first. We have relied upon these people to get to this point, and now we rely on them to support and evolve the service over the coming years. In fact the challenge is really only beginning now.

The LHC Grid Service A worldwide collaboration Ian Bird

Similar presentations

Presentation on theme: "The LHC Grid Service A worldwide collaboration Ian Bird"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The LHC Grid Service A worldwide collaboration Ian Bird

Similar presentations

Presentation on theme: "The LHC Grid Service A worldwide collaboration Ian Bird"— Presentation transcript:

Similar presentations

About project

Feedback