LCG Service Challenges Overview

LCG Service Challenges Overview
Based on SC Conferences and meetings: Victor Zhiltsov JINR

SC3 GOALS Service Challenge 1 (end of 2004):
Demonstrate the possibility of throughput of 500 MByte/s to Tier1 in LCG environment. Service Challenge 2 (spring 2005): Maintain the throughput 500 MByte/s cumulative on all Tier1s for prolonged time, and evaluate the data transfer environment on Tier0 и Tier1s. Service Challenge 3 (Summer-end of 2005) Show reliable and stable data transfer on each Tier1: to disk -150 MByte/s, to tape - 60 MByte/s. All Tier1s and some Tier2s involved. Service Challenge 4 (Spring 2006): Prove the GRID infrastructure performance to handle the LHC data in proposed rate (from raw data transfer up to final analysis) with all Tier1s and majority of Tier2s. Final Goal: Build the production GRID-infrastructure on all Tier0, Tier1 и Tier2 according to the LHC experiments specifics.

Summary of Tier0/1/2 Roles
Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times; Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s; Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. No long term data storage. N.B. there are differences in roles by experiment Essential to test using complete production chain of each!

SC2 met its throughput targets
>600MB/s daily average for 10 days was achieved - Midday 23rd March to Midday 2nd April Not without outages, but system showed it could recover rate again from outages Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites)

Division of Data between sites SC2 10 days period
Average throughput (MB/s) Data Moved (TB) BNL 61 51 FNAL GridKA 133 109 IN2P3 91 75 INFN 81 67 RAL 72 58 SARA 106 88 TOTAL 600 500

Storage and Software used
Most sites ran Globus gridftp servers CCIN2P3, CNAF, GridKa, SARA The rest of the sites ran dCache BNL, FNAL, RAL Most sites used local or system-attached disk FZK used SAN via GPFS FNAL used production CMS dCache, including tape Load-balancing for gridftp sites was done by the RADIANT software running at CERN in push mode

Tier0/1 Network Topology
April-July changes Triumf 2x1G CNAF IN2P3 GridKa 10G 1G shared CERN Tier-0 GEANT SARA 10G 2x1G Nether Light BNL PIC 1G shared 1G shared Nordic ESNet 10G 10G 2x1G 2x1G 10G StarLight UKLight RAL 2x1G 10G ASCC 2x1G 10G FNAL

GridPP Estimates of T2 Networking
Number of T1s Number of T2s Total T2 CPU Total T2 Disk Average T2 CPU Average T2 Disk Network In Network Out KSI2K TB Gb/s ALICE 6 21 13700 2600 652 124 0.010 0.600 ATLAS 10 30 16200 6900 540 230 0.140 0.034 CMS 6 to 10 25 20725 5450 829 218 1.000 0.100 LHCb 14 7600 23 543 2 0.008 1 kSI2k corresponds to 1 Intel Xeon 2.8 GHz processor; 1000 600 1.0 The CMS figure of 1Gb/s into a T2 comes from the following: Each T2 has ~10% of current RECO data and 1/2 AOD (real+MC sample) These data are refreshed every 3 weeks compatible with frequency of major selection pass at T1s See CMS Computing Model S-30 for more details

SC3 – Milestones

SC3 – Milestone Decomposition
File transfer goals: Build up disk – disk transfer speeds to 150MB/s SC2 was 100MB/s – agreed by site Include tape – transfer speeds of 60MB/s Tier1 goals: Bring in additional Tier1 sites wrt SC2 PIC and Nordic most likely added later: SC4? US-ALICE T1? Others? Tier2 goals: Start to bring Tier2 sites into challenge Agree services T2s offer / require On-going plan (more later) to address this via GridPP, INFN etc. Experiment goals: Address main offline use cases except those related to analysis i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 Service goals: Include CPU (to generate files) and storage Start to add additional components Catalogs, VOs, experiment-specific solutions etc, 3D involvement, … Choice of software components, validation, fallback, …

Key dates for Connectivity
SC2 SC3 SC4 LHC Service Operation Full physics run 2005 2007 2006 2008 First physics First beams cosmics June05 - Technical Design Report  Credibility Review by LHCC Sep05 - SC3 Service – 8-9 Tier-1s sustain - 1 Gbps at Tier-1s, 5 Gbps at CERN Extended peaks at 10 Gbps CERN and some Tier-1s Jan06 - SC4 Setup – AllTier-1s 10 Gbps at >5 Tier-1s, 35 Gbps at CERN July06 - LHC Service – All Tier-1s 10 Gbps at Tier-1s, 70 Gbps at CERN

2005 Sep-Dec - SC4 preparation
Historical slides from Les / Ian In parallel with the SC3 model validation period, in preparation for the first service challenge (SC4) – Using 500 MByte/s test facility test PIC and Nordic T1s and T2’s that are ready (Prague, LAL, UK, INFN, ..) Build up the production facility at CERN to 3.6 GBytes/s Expand the capability at all Tier-1s to full nominal data rate SC2 SC3 SC4 Full physics run 2005 2007 2006 2008 First physics First beams cosmics

2006 Jan-Aug - SC4 slides from Les / Historical Ian
SC4 – full computing model services - Tier-0, ALL Tier-1s, all major Tier-2s operational at full target data rates (~2 GB/sec at Tier-0) - acquisition - reconstruction - recording – distribution, PLUS ESD skimming, servicing Tier-2s Goal – stable test service for one month – April 2006 100% Computing Model Validation Period (May-August 2006) Tier-0/1/2 full model test - All experiments - 100% nominal data rate, with processing load scaled to 2006 cpus SC2 SC3 SC4 Full physics run 2005 2007 2006 2008 First physics First beams cosmics

2006 Sep – LHC service available
Historical slides from Les / Ian 2006 Sep – LHC service available The SC4 service becomes the permanent LHC service – available for experiments’ testing, commissioning, processing of cosmic data, etc. All centres ramp-up to capacity needed at LHC startup TWICE nominal performance Milestone to demonstrate this 3 months before first physics data  April 2007 SC2 SC3 SC4 LHC Service Operation Full physics run 2005 2007 2006 2008 First physics First beams cosmics

Tier2 Roles Tier2 roles vary by experiment, but include:
Production of simulated data; Production of calibration constants; Active role in [end-user] analysis Must also consider services offered to T2s by T1s e.g. safe-guarding of simulation output; Delivery of analysis input. No fixed dependency between a given T2 and T1 But ‘infinite flexibility’ has a cost…

Tier2 Functionality (At least) two distinct cases: Simulation output
This is relatively straightforward to handle Most simplistic case: associate a T2 with a given T1 Can be reconfigured Logical unavailability of a T1 could eventually mean that T2 MC production might stall More complex scenarios possible But why? Make it as simple as possible, but no simpler… Analysis Much less well understood and likely much harder…

Tier2s are assumed to offer, in addition to the basic Grid functionality:
Client services whereby reliable file transfers maybe initiated to / from Tier1/0 sites, currently based on the gLite File Transfer software (gLite FTS); Managed disk storage with an agreed SRM interface, such as dCache or the LCG DPM.

To participate in the Service Challenge, it is required:
Tier2 sites install the gLite FTS client and a disk storage manager (dCache?), For the throughput phase, no long term storage of the data transferred is required, but they nevertheless need to agree with the corresponding Tier1 that the necessary storage area to which they upload data (analysis is not included in Service Challenge 3) and the gLite FTS backend service is provided.

Tier2 Model As Tier2s do not typically provide archival storage, this is a primary service that must be provided to them, assumed via a Tier1. Although no fixed relationship between a Tier2 and a Tier1 should be assumed, a pragmatic approach for Monte Carlo data is nevertheless to associate each Tier2 with a ‘preferred’ Tier1 that is responsible for long-term storage of the Monte Carlo data produced at the Tier2. By default, it is assumed that data upload from the Tier2 will stall should the Tier1 be logically unavailable. This in turn could imply that Monte Carlo production will eventually stall, if local storage becomes exhausted, but it is assumed that these events are relatively rare and the production manager of the experiment concerned may in any case reconfigure the transfers to an alternate site in case of prolonged outage.

A Simple T2 Model (1/2) N.B. this may vary from region to region Each T2 is configured to upload MC data to and download data via a given T1 In case the T1 is logical unavailable, wait and retry MC production might eventually stall For data download, retrieve via alternate route / T1 Which may well be at lower speed, but hopefully rare Data residing at a T1 other than ‘preferred’ T1 is transparently delivered through appropriate network route T1s are expected to have at least as good interconnectivity as to T0

A Simple T2 Model (2/2) Each Tier-2 is associated with a Tier-1 that is responsible for getting them set up Services at T2 are managed storage and reliable file transfer FTS: DB component at T1, user agent also at T2; DB for storage at T2 1GBit network connectivity – shared (less will suffice to start with, more maybe needed!) Tier1 responsibilities: Provide archival storage for (MC) data that is uploaded from T2s Host DB and (gLite) File Transfer Server (Later): also data download (eventually from 3rd party) to T2s Tier2 responsibilities: Install / run dCache / DPM (managed storage s/w with agreed SRM i/f) Install gLite FTS client (batch service to generate & process MC data) (batch analysis service – SC4 and beyond) Tier2s do not offer persistent (archival) storage!

Service Challenge in Russia
T2 T1 T0

Tier1/2 Network Topology

Tier2 in Russia IHEP 5+ 1.6 TB ITEP 20 ? 2 TB ? JINR 10 ? SINP 30 ?
Institute Link CPUs Disk OS/Middleware IHEP 100 Mb/s half-duplex 5+ 1.6 TB …? ITEP 60 Mb/s 20 ? 2 TB ? SL (kernel ) JINR 40Mb/s 10 ? SLC3.0.X LCG-2_4_0, Castor, gridftp, gLite? SINP 1Gbit/s 30 ? gridftp, Castor

Summary The first T2 sites need to be actively involved in Service Challenges from Summer 2005 ~All T2 sites need to be successfully integrated just over one year later Adding the T2s and integrating the experiments’ software in the SCs will be a massive effort! Initial T2s for SC3 have been identified A longer term plan is being executed

Conclusions To be ready to fully exploit LHC, significant resources need to be allocated to a series of Service Challenges by all concerned parties These challenges should be seen as an essential on-going and long-term commitment to achieving production LCG The countdown has started – we are already in (pre-)production mode Next stop: 2020

LCG Service Challenges Overview

Similar presentations

Presentation on theme: "LCG Service Challenges Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LCG Service Challenges Overview

Similar presentations

Presentation on theme: "LCG Service Challenges Overview"— Presentation transcript:

Similar presentations

About project

Feedback