The LCG Service Challenges: Experiment Participation Jamie Shiers, CERN-IT-GD 4 March 2005.

Slides:



Advertisements
Similar presentations
Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
Advertisements

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
HEPiX Edinburgh 28 May 2004 LCG les robertson - cern-it-1 Data Management Service Challenge Scope Networking, file transfer, data management Storage management.
Service Challenge Meeting “Service Challenge 2 update” James Casey, IT-GD, CERN IN2P3, Lyon, 15 March 2005.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
The LCG Service Challenges and GridPP Jamie Shiers, CERN-IT-GD 31 January 2005.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
LHC Tier 2 Networking BOF Joe Metzger Joint Techs Vancouver 2005.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
HEPiX, CASPUR, April 3-7, 2006 – Steve McDonald Steven McDonald TRIUMF Network & Computing Services Canada’s National Laboratory.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
LHC Data Challenges and Physics Analysis Jim Shank Boston University VI DOSAR Workshop 16 Sept., 2005.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
NL Service Challenge Plans Kors Bos, Sander Klous, Davide Salomoni (NIKHEF) Pieter de Boer, Mark van de Sanden, Huub Stoffers, Ron Trompert, Jules Wolfrat.
CSCS Status Peter Kunszt Manager Swiss Grid Initiative CHIPP, 21 April, 2006.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
Ramping up the LCG Service The LCG Service Challenges: Ramping up the LCG Service Jamie Shiers, CERN-IT-GD-SC April 2005 LCG.
ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.
SC4 Planning Planning for the Initial LCG Service September 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.
Victoria, Sept WLCG Collaboration Workshop1 ATLAS Dress Rehersals Kors Bos NIKHEF, Amsterdam.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
LCG Storage Management Workshop - Goals and Timeline of SC3 Jamie Shiers, CERN-IT-GD April 2005.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
LCG LHC Grid Deployment Board Regional Centers Phase II Resource Planning Service Challenges LHCC Comprehensive Review November 2004 Kors Bos, GDB.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Planning the Planning for the LCG Service Challenges Jamie Shiers, CERN-IT-GD 28 January 2005.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
T0-T1 Networking Meeting 16th June Meeting
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
“A Data Movement Service for the LHC”
The LHC Computing Environment
LCG Service Challenge: Planning and Milestones
James Casey, IT-GD, CERN CERN, 5th September 2005
Database Readiness Workshop Intro & Goals
Olof Bärring LCG-LHCC Review, 22nd September 2008
The LCG Service Challenges: Ramping up the LCG Service
LCG Service Challenges Overview
LHC Data Analysis using a worldwide computing grid
Presentation transcript:

The LCG Service Challenges: Experiment Participation Jamie Shiers, CERN-IT-GD 4 March 2005

LCG Project, Grid Deployment Group, CERN Agenda  Reminder of the Goals and Timelines of the LCG Service Challenges  Outline of Service Challenges  Very brief review of SC1 – did it work?  Status of SC2  Plans for SC3 and beyond Experiment involvement begins here

LCG Project, Grid Deployment Group, CERN Problem Statement  ‘Robust File Transfer Service’ often seen as the ‘goal’ of the LCG Service Challenges  Whilst it is clearly essential that we ramp up at CERN and the T1/T2 sites to meet the required data rates well in advance of LHC data taking, this is only one aspect  Getting all sites to acquire and run the infrastructure is non-trivial (managed disk storage, tape storage, agreed interfaces, 24 x 365 service aspect, including during conferences, vacation, illnesses etc.)  Need to understand networking requirements and plan early  But transferring ‘dummy files’ is not enough…  Still have to show that basic infrastructure works reliably and efficiently  Need to test experiments’ Use Cases  Check for bottlenecks and limits in s/w, disk and other caches etc.  We can presumably write some test scripts to ‘mock up’ the experiments’ Computing Models  But the real test will be to run your s/w…  Which requires strong involvement from production teams

LCG Project, Grid Deployment Group, CERN LCG Service Challenges - Overview  LHC will enter production (physics) in April 2007  Will generate an enormous volume of data  Will require huge amount of processing power  LCG ‘solution’ is a world-wide Grid  Many components understood, deployed, tested..  But…  Unprecedented scale  Humungous challenge of getting large numbers of institutes and individuals, all with existing, sometimes conflicting commitments, to work together  LCG must be ready at full production capacity, functionality and reliability in less than 2 years from now  Issues include h/w acquisition, personnel hiring and training, vendor rollout schedules etc.  Should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential  Whilst being stable, reliable and easy to use

LCG Project, Grid Deployment Group, CERN Key Principles  Service challenges results in a series of services that exist in parallel with baseline production service  Rapidly and successively approach production needs of LHC  Initial focus: core (data management) services  Swiftly expand out to cover full spectrum of production and analysis chain  Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages  Necessary resources and commitment pre-requisite to success!  Should not be under-estimated!

LCG Project, Grid Deployment Group, CERN Initial Schedule (1/2)  Tentatively suggest quarterly schedule with monthly reporting  e.g. Service Challenge Meetings / GDB respectively  Less than 7 complete cycles to go!  Urgent to have detailed schedule for 2005 with at least an outline for remainder of period asap  e.g. end January 2005  Must be developed together with key partners  Experiments, other groups in IT, T1s, …  Will be regularly refined, ever increasing detail…  Detail must be such that partners can develop their own internal plans and to say what is and what is not possible  e.g. FIO group, T1s, …

LCG Project, Grid Deployment Group, CERN Initial Schedule (2/2)  Q1 / Q2: up to 5 T1s, writing to disk at 100MB/s per T1 (no expts)  Q3 / Q4: include two experiments, tape and a few selected T2s  2006: progressively add more T2s, more experiments, ramp up to twice nominal data rate  2006: production usage by all experiments at reduced rates (cosmics); validation of computing models  2007: delivery and contingency  N.B. there is more detail in Dec / Jan / Feb GDB presentations  Need to be re-worked now!

Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005

LCG Project, Grid Deployment Group, CERN Overview  Reminder of targets for the Service Challenge  What we did  What can we learn for SC2?

LCG Project, Grid Deployment Group, CERN Milestone I & II Proposal From NIKHEF/SARA Service Challenge Meeting  Dec04 - Service Challenge I complete  mass store (disk) - mass store (disk)  3 T1s (Lyon, Amsterdam, Chicago)  500 MB/sec (individually and aggregate)  2 weeks sustained  Software; GridFTP plus some scripts  Mar05 - Service Challenge II complete  Software: reliable file transfer service  mass store (disk) - mass store (disk),  5 T1’s (also Karlsruhe, RAL,..)  500 MB/sec T0-T1 but also between T1’s  1 month sustained

LCG Project, Grid Deployment Group, CERN Service Challenge Schedule From FZK Dec Service Challenge Meeting:  Dec 04  SARA/NIKHEF challenge  Still some problems to work out with bandwidth to teras system at SARA  Fermilab  Over CERN shutdown – best effort support  Can try again in January in originally provisioned slot  Jan 04  FZK Karlsruhe

LCG Project, Grid Deployment Group, CERN SARA – Dec 04  Used a SARA specific solution  Gridftp running on 32 nodes of SGI supercomputer (teras)  3 x 1Gb network links direct to teras.sara.nl  3 gridftp servers, one for each link  Did load balancing from CERN side  3 oplapro machines transmitted down each 1Gb link  Used radiant-load-generator script to generate data transfers  Much efforts was put in from SARA personnel (~1-2 FTEs) before and during the challenge period  Tests ran from 6-20 th December  Much time spent debugging components

LCG Project, Grid Deployment Group, CERN Problems seen during SC1  Network Instability  Router electrical problem at CERN  Interruptions due to network upgrades on CERN test LAN  Hardware Instability  Crashes seen on teras 32-node partition used for challenges  Disk failure on CERN transfer node  Software Instability  Failed transfers from gridftp. Long timeouts resulted in significant reduction in throughput  Problems in gridftp with corrupted files  Often hard to isolate a problem to the right subsystem

LCG Project, Grid Deployment Group, CERN SARA SC1 Summary  Sustained run of 3 days at end  6 hosts at CERN side. single stream transfers, 12 files at a time  Average throughput was 54MB/s  Error rate on transfers was 2.7%  Could transfer down each individual network links at ~40MB/s  This did not translate into the expected 120MB/s speed  Load on teras and oplapro machines was never high (~6-7 for a 32 node teras, < 2 for 2-node oplapro) Load on oplapro machines  See Service Challenge wiki for logbook kept during Challenge

LCG Project, Grid Deployment Group, CERN Gridftp problems  64 bit compatibility problems  logs negative numbers for file size > 2 GB  logs erroneous buffer sizes to the logfile if the server is 64-bits  No checking of file length on transfer  No error message doing a third party transfer with corrupted files  Issues followed up with globus gridftp team  First two will be fixed in next version.  The issue of how to signal problems during transfers is logged as an enhancement request

LCG Project, Grid Deployment Group, CERN FermiLab & FZK – Dec 04/Jan 05  FermiLab declined to take part in Dec 04 sustained challenge  They had already demonstrated 500MB/s for 3 days in November  FZK started this week  Bruno Hoeft will give more details in his site report

LCG Project, Grid Deployment Group, CERN What can we learn ?  SC1 did not succeed with all goals  We did not meet the milestone of 500MB/s for 2 weeks  We need to do these challenges to see what actually goes wrong  A lot of things do, and did, go wrong  Running for a short period with ‘special effort’ is not the same as sustained, production operation  We need better test plans for validating the infrastructure before the challenges (network throughput, disk speeds, etc…)  Ron Trompert (SARA) has made a first version of this  Discussion at Feb SC meeting that all sites will run this < SC2  We need to proactively fix low-level components  Gridftp, etc…  SC2 and SC3 will be a lot of work !

LCG Project, Grid Deployment Group, CERN 2005 Q1 - SC2 SC2 - Robust Data Transfer Challenge Set up infrastructure for 6 sites  Fermi, NIKHEF/SARA, GridKa, RAL, CNAF, CCIN2P3 Test sites individually  Target 100MByte/s per site  at least two at 500 MByte/s with CERN Agree on sustained data rates for each participating centre Goal – by end March sustained 500 Mbytes/s aggregate at CERN SC2 Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN Prepare for the next service challenge (SC3) -- in parallel with SC2 (reliable file transfer) Build up 1 GByte/s challenge facility at CERN  The current 500 MByte/s facility used for SC2 will become the testbed from April onwards (10 ftp servers, 10 disk servers, network equipment) Build up infrastructure at each external centre  Average capability ~150 MB/sec at a Tier-1 (to be agreed with each T-1) Further develop reliable transfer framework software  Include catalogues, include VO’s 2005 Q1 - SC3 preparation Full physics run First physics First beams cosmics SC3 SC2 disk-network-disk bandwidths

LCG Project, Grid Deployment Group, CERN SC3 - 50% service infrastructure  Same T1s as in SC2 (Fermi, NIKHEF/SARA, GridKa, RAL, CNAF, CCIN2P3)  Add at least two T2s  “50%” means approximately 50% of the nominal rate of ATLAS+CMS Using the 1 GByte/s challenge facility at CERN -  Disk at T0 to tape at all T1 sites at 60 Mbyte/s  Data recording at T0 from same disk buffers  Moderate traffic disk-disk between T1s and T2s Use ATLAS and CMS files, reconstruction, ESD skimming codes (numbers to be worked out when the models are published) Goal - 1 month sustained service in July  500 MBytes/s aggregate at CERN, 60 MBytes/s at each T1  end-to-end data flow peaks at least a factor of two at T1s  network bandwidth peaks ?? 2005 Q2-3 - SC3 challenge SC3 SC2 Full physics run First physics First beams cosmics tape-network-disk bandwidths

LCG Project, Grid Deployment Group, CERN 2005 Q2-3 - SC3 additional centres In parallel with SC3 prepare additional centres using the 500 MByte/s test facility  Test Taipei, Vancouver, Brookhaven, additional Tier-2s Further develop framework software  Catalogues, VO’s, use experiment specific solutions SC2 SC3 Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN 2005 Sep-Dec - SC3 Service 50% Computing Model Validation Period The service exercised in SC3 is made available to experiments as a stable, permanent service for computing model tests Additional sites are added as they come up to speed End-to-end sustained data rates –  500 Mbytes/s at CERN (aggregate)  60 Mbytes/s at Tier-1s  Modest Tier-2 traffic SC2 SC3 Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN 2005 Sep-Dec - SC4 preparation In parallel with the SC3 model validation period, in preparation for the first 2006 service challenge (SC4) – Using 500 MByte/s test facility  test PIC and Nordic T1s  and T2’s that are ready (Prague, LAL, UK, INFN,.. Build up the production facility at CERN to 3.6 GBytes/s Expand the capability at all Tier-1s to full nominal data rate SC2 SC3 SC4 Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN 2006 Jan-Aug - SC4 SC4 – full computing model services - Tier-0, ALL Tier-1s, all major Tier-2s operational at full target data rates (~2 GB/sec at Tier-0) - acquisition - reconstruction - recording – distribution, PLUS ESD skimming, servicing Tier-2s Goal – stable test service for one month – April % Computing Model Validation Period (May-August 2006) Tier-0/1/2 full model test - All experiments - 100% nominal data rate, with processing load scaled to 2006 cpus SC2 SC3 SC4 Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN 2006 Sep – LHC service available The SC4 service becomes the permanent LHC service – available for experiments’ testing, commissioning, processing of cosmic data, etc. All centres ramp-up to capacity needed at LHC startup  TWICE nominal performance  Milestone to demonstrate this 3 months before first physics data  April 2007 SC2 SC3 SC4 LHC Service Operation Full physics run First physics First beams cosmics

LCG Project, Grid Deployment Group, CERN Key dates for Connectivity SC2 SC3 SC4 LHC Service Operation Full physics run First physics First beams cosmics June05 - Technical Design Report  Credibility Review by LHCC Sep05 - SC3 Service – 8-9 Tier-1s sustain - 1 Gbps at Tier-1s, 5 Gbps at CERN Extended peaks at 10 Gbps CERN and some Tier-1s Jan06 - SC4 Setup – AllTier-1s 10 Gbps at >5 Tier-1s, 35 Gbps at CERN July06 - LHC Service – All Tier-1s 10 Gbps at Tier-1s, 70 Gbps at CERN

LCG Project, Grid Deployment Group, CERN Key dates for Services SC2 SC3 SC4 LHC Service Operation Full physics run First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 –SC4 Service Phase Sep06 – Initial LHC Service in stable operation

LCG Project, Grid Deployment Group, CERN Some Comments on Tier2 Sites  Summer 2005: SC3 - include 2 Tier2s; progressively add more  Summer / Fall 2006: SC4 complete  SC4 – full computing model services  Tier-0, ALL Tier-1s, all major Tier-2s operational at full target data rates (~1.8 GB/sec at Tier-0)  acquisition - reconstruction - recording – distribution, PLUS ESD skimming, servicing Tier-2s  How many Tier2s?  ATLAS: already identified 29  CMS: some 25  With overlap, assume some 50 T2s total(?) 100(?)  This means that in the 12 months from ~August 2005 we have to add 2 T2s per week  Cannot possibly be done using the same model as for T1s  SC meeting at a T1 as it begins to come online for service challenges  Typically 2 day (lunchtime – lunchtime) meeting

LCG Project, Grid Deployment Group, CERN GDB / SC meetings / T1 visit Plan In addition to planned GDB, Service Challenge, Network Meetings etc:  Visits to all Tier1 sites (initially)  Goal is to meet as many of the players as possible  Not just GDB representatives! Equivalents of ADC/CS/FIO/GD people  Current Schedule:  Aim to complete many of European sites by Easter  (FZK), NIKHEF, RAL, CNAF, IN2P3, PIC, (Nordic)  “Round world” trip to BNL / FNAL / Triumf / ASCC in April  Need to address also Tier2s  Cannot be done in the same way!  Work through existing structures, e.g.  HEPiX, national and regional bodies etc.  e.g. GridPP, INFN, …  Talking of SC Update at May HEPiX (FZK) with more extensive programme at Fall HEPiX (SLAC)  Maybe some sort of North American T2-fest around this? Visits in progress

LCG Project, Grid Deployment Group, CERN Conclusions  To be ready to fully exploit LHC, significant resources need to be allocated to a series of service challenges by all concerned parties  These challenges should be seen as an essential on-going and long-term commitment to achieving production LCG  The countdown has started – we are already in (pre-)production mode  Next stop: 2020