Kors Bos NIKHEF, Amsterdam.

Slides:



Advertisements
Similar presentations
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Advertisements

Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
Les Les Robertson LCG Project Leader LCG - The Worldwide LHC Computing Grid LHC Data Analysis Challenges for 100 Computing Centres in 20 Countries HEPiX.
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Frédéric Hemmer, CERN, IT DepartmentThe LHC Computing Grid – October 2006 LHC Computing and Grids Frédéric Hemmer IT Deputy Department Head October 10,
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
Frédéric Hemmer, CERN, IT Department The LHC Computing Grid – June 2006 The LHC Computing Grid Visit of the Comité d’avis pour les questions Scientifiques.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Ian Bird LCG Deployment Manager EGEE Operations Manager LCG - The Worldwide LHC Computing Grid Building a Service for LHC Data Analysis 22 September 2006.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
LCG LHC Computing Grid Project – LCG CERN – European Organisation for Nuclear Research Geneva, Switzerland LCG LHCC Comprehensive.
CERN IT Department CH-1211 Genève 23 Switzerland Visit of Professor Karel van der Toorn President University of Amsterdam Wednesday 10 th.
Jürgen Knobloch/CERN Slide 1 A Global Computer – the Grid Is Reality by Jürgen Knobloch October 31, 2007.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations.
Les Les Robertson LCG Project Leader WLCG – Management Overview LHCC Comprehensive Review September 2006.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.
SC4 Planning Planning for the Initial LCG Service September 2005.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
The LHC Computing Grid Visit of Dr. John Marburger
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
LCG LHC Grid Deployment Board Regional Centers Phase II Resource Planning Service Challenges LHCC Comprehensive Review November 2004 Kors Bos, GDB.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
T0-T1 Networking Meeting 16th June Meeting
Bob Jones EGEE Technical Director
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
Computing Operations Roadmap
“A Data Movement Service for the LHC”
Grid Computing in HIGH ENERGY Physics
The LHC Computing Environment
Grid site as a tool for data processing and data analysis
IT Department and The LHC Computing Grid
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
SA1 Execution Plan Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Collaboration Meeting
Jan 12, 2005 Improving CMS data transfers among its distributed Computing Facilities N. Magini CERN IT-ES-VOS, Geneva, Switzerland J. Flix Port d'Informació.
Data Challenge with the Grid in ATLAS
Database Readiness Workshop Intro & Goals
Update on Plan for KISTI-GSDC
The LHC Computing Challenge
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Readiness of ATLAS Computing - A personal view
The LHC Computing Grid Visit of Her Royal Highness
Olof Bärring LCG-LHCC Review, 22nd September 2008
SA1 ROC Meeting Bologna, October 2004
LHC Data Analysis using a worldwide computing grid
Collaboration Board Meeting
The LHC Computing Grid Visit of Prof. Friedrich Wagner
Overview & Status Al-Ain, UAE November 2007.
The LHC Computing Grid Visit of Professor Andreas Demetriou
The LHCb Computing Data Challenge DC06
Presentation transcript:

Kors Bos NIKHEF, Amsterdam

WLCG Management Overview (I skip my first 10 slides)

WLCG Collaboration The Collaboration 4 LHC experiments ~120 computing centres 12 large centres (Tier-0, Tier-1) 38 federations of smaller “Tier-2” centres Growing to ~40 countries Memorandum of Understanding Agreed in October 2005, now being signed Resources Commitment made each October for the coming year 5-year forward look

GDB membership – chair Kors Bos /NIKHEF With a vote: One person from a major site in each country One person from each experiment Without a vote: Experiment Computing Coordinators Site service management representatives Project Leader, Area Managers Management Board – chair Project Leader Experiment Computing Coordinators One person fromTier-0 and each Tier-1 Site GDB chair Project Leader, Area Managers EGEE Technical Director Support for Experiments

All boards except OB have open access to agendas, minutes, documents Planning page - MoU - resources - accounting - milestones - progress reports - …. …. …. http://www.cern.ch/lcg

WLCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid

See Talk by Bob Jones EGEE-II Phase 2 approved just after last year’s comprehensive review Significant operations resources at CERN and Regional Centres HEP applications support (NA4) extended to cover the full range of activities that interface to the grid – production as well as analysis The experiments represented directly in the Technical Coordination Group, where decisions on the evolution of middleware developments are prepared Former “LCG” middleware package merged into “gLite 3” Funding for Phase 2 - April 2006  March 2008 Discussion started on successor project – or third phase of EGEE Prospects will be clearer in 6 months See Talk by Bob Jones

See Talk by Paul Avery Open Science Grid Operates an expanding Production Distributed Facility which provides core capabilities in operations, security, software releases, the Virtual Data Toolkit, trouble-shooting, …. and supports the engagement of new communities to use the OSG. Education, Training and Outreach with a core focus on expanding the number, location and scope of the Grid Schools and web based training. User Support, Service and System Extensions for increased scale, performance and capability of the common services for the stakeholder VOs. OSG proposal submitted to DOE and NSF in February/March, revised proposal In June Planning going ahead in expectation of a positive decision in next few months – five year funding cycle See Talk by Paul Avery

The Network (I’ll give another talk on this on Wednesday)

The LHCOPN Tier-2s and Tier-1s are LCG T2 The LHCOPN T2 T2 T2 Tier-2s and Tier-1s are inter-connected by the general purpose research networks T2 T2 T2 GridKa T2 IN2P3 Dedicated 10 Gbit optical network TRIUMF Any Tier-2 may access data at any Tier-1 T2 Brookhaven ASCC Nordic T2 Fermilab RAL CNAF PIC T2 T2 SARA T2 T2

The new European Network Backbone LCG working group with Tier-1s and national/ regional research network organisations New GÉANT 2 – research network backbone  Strong correlation with major European LHC centres Swiss PoP at CERN

The new European Network Backbone LCG working group with Tier-1s and national/ regional research network organisations New GÉANT 2 – research network backbone  Strong correlation with major European LHC centres Swiss PoP at CERN

Recent Developments I’ll show you some of the dirty laundry !

WLCG Services Baseline services from the TDR are in operation gLite 3 Agreement (after much discussion) on VO Boxes.. gLite 3 Basis for startup on EGEE grid Introduced (just) on time for SC4 New Workload Management System - now entering production Metrics accounting introduced for Tier-1s and CERN (cpu and storage) site availability measurement system introduced – reporting for Tier-1s & CERN from May job failure analysis Grid operations All major LCG sites active Daily monitoring and operations is maturing Evolution of EGEE regional operations support structure

WLCG Services at CERN CERN Fabric Tier-0 testing has progressed well Artificial system tests .. and ATLAS Tier-0 testing at full throughput Comfortable that target data rates, throughput can be met .. Including CASTOR 2 But DAQ systems not yet integrated in these tests CERN Analysis Facility (CAF) Testing of experiment approaches to this has started only in the past few months Much has still to be understood Essential to maintain Tier-0/CAF flexibility for hardware during early years CASTOR 2 Performance is largely understood Stability and the ability to maintain a 24 X 365 service is now the main issue

Storage Resource Management (a flash from recent GDB discussions) September 2005 – Agreement on requirements for SRM v2.1 End of year – first implementations appear First signs that requirements not so clear Durable, permanent, volatile February 2006 – Mumbai workshop Agreement on storage classes and a way of expressing them May – workshop at FNAL – strong guideline to converge agreement on SRM v2.2 – functionality and storage classes Tape1disk0,tape0disk1, tape1disk1 Long process but major advance in understanding real storage management needs of the experiments Scheduled to be in production 1Q07 Task-force looking into implementation issues at sites

VOMS Issues (a flash from recent GDB discussions) VOMS adds more characteristics to a user VOMS server works  proxies are given out But the VOs have not yet defined all groups/roles When/how to be used for Job priorities ? GridPolicy service ? Do we need it ? If so, when ? Usage of VOMS attributes in Storage ? DPM, dCache, CASTOR implementations ? How is this implemented at sites ? VOMS and job accounting ? By what ? Groups ? How ? With pilot jobs, no user-level accounting ? VOMS and storage accounting ? By what ? Groups ? How ?

Service Challenges

Service Challenge 4 (SC4) Pilot LHC Service from June 2006 A stable service on which experiments can make a full demonstration of their offline chain DAQ  Tier-0  Tier-1 data recording, calibration, reconstruction Offline analysis - Tier-1  Tier-2 data exchange simulation, batch and end-user analysis And sites can test their operational readiness WLCG services -- monitoring  reliability Grid services Mass storage services, including magnetic tape Extension to most Tier-2 sites Target for service by end September Service metrics  90% of MoU service levels Data distribution from CERN to tape at Tier-1s at nominal LHC rates

Data Distribution Pre-SC4 April tests CERN T1s 1.6 GBytes/sec Pre-SC4 April tests CERN T1s SC4 target 1.6 GB/s reached – but only for one day Sustained data rate 80% of the target But – experiment-driven transfers (ATLAS and CMS) sustained 50% of the SC4 target under much more realistic conditions CMS transferred a steady 1 PByte/month between Tier-1s & Tier-2s during a 90 day period ATLAS distributed 1.25 PBytes from CERN during a 6-week period 1.3 GBytes/sec 0.8 GBytes/sec

Production Grids for LHC 14K Jobs/day EGEE Grid EGEE Grid ~50K jobs/day ~14K simultaneous jobs during prolonged periods

OSG Production for LHC OSG OSG-CMS Data Distribution - past 3 months OSG ~15K jobs/day. 3 big users are ATLAS, CDF, CMS. ~3K simultaneous jobs -- at the moment use quite spiky. Jobs/day OSG Grid 20,000 OSG-ATLAS Running Jobs - past 3 months 10,000 1,000 CMS ATLAS

SC4 Formal Reliability Targets 8 Tier-1s and 20 Tier-2s must have demonstrated availability better than 90% of the levels specified in MoU Success rate of standard application test jobs > 90% excluding failures due to the applications environment and non-availability of sites

August 2006 two sites not yet integrated in measurement framework target 88% availability 10-site average – 74% best 8 sites average – 85% reliability (excludes scheduled down time) ~1% higher

May-August Availability Target Daily availab-ility (10 sites) SC4 target 88%

Job Reliability Monitoring Ongoing work System to process and analyse job logs implemented for some of the major activities in ATLAS and CMS  Errors identified, frequency reported to developers, TCG Expect to see results feeding through from development to products in a fairly short time More impact expected when the new RB enters in full production (old RB is frozen) Daily report on most important site problems allows the operation team to drill down from site, to computing elements to worker nodes In use since the end of August Intention is to report long-term trends by site, VO FNAL

Schedule and Challenges for 2007

Revised Estimates of Physics Beam Time TDR requirements were based on the assumption that there would be 50 days of physics data taking in 2007, with 10**7 seconds of proton beam time and 10**6 seconds of heavy ions in subsequent years As the planning for the start-up is refined it has become clear that this is unrealistic and a better estimate of running time in the first years was needed in order to allow funding agencies and regional centres to improve their medium term planning New assumptions: 2007 – 0.7 x 10**6 seconds protons 2008 – 4 x 10**6 secs protons, 0.2 x 10**6 secs ions 2009 – 6 x 10**6 secs protons, 10**6 secs ions 2010 – 10**7 secs protons, 2 x 10**6 secs ions

Commissioning Schedule 2006 2007 2008 SC4 – becomes initial service when reliability and performance goals met Continued testing of computing models, basic services Testing DAQTier-0 (??) & integrating into DAQTier-0Tier-1 data flow Building up end-user analysis support Exercising the computing systems, ramping up job rates, data management performance, …. Introduce residual services Full FTS services; 3D; SRM v2.2; VOMS roles Initial service commissioning – increase reliability, performance, capacity to target levels, experience in monitoring, 24 X 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance first physics Experiments Sites & Services

Challenges and Concerns Site reliability Achieve MoU targets – with a more comprehensive set of tests Tier-0, Tier-1 and (major) Tier-2 sites Concerns on staffing levels at some sites 24 X 7 operation needs to be planned and tested – will be problematic at some sites, including CERN, during the first year when unexpected problems have to be resolved Tier-1s and Tier-2s learning exactly how they will be used Mumbai workshop, several Tier-2 workshops Experiment computing model tests Storage, data distribution Tier-1/Tier-2 interaction Test out data transfer services, network capability Build operational relationships Mass storage Complex systems - difficult to configure Castor 2 not yet fully mature SRM v2.2 to be deployed – and storage classes, policies implemented by sites 3D Oracle - Phase 2 – sites not yet active/staffed

Challenges and Concerns Experiment service operation Manpower intensive Interaction between experiments & Tier-1s, large Tier-2s Need sustained test load – to verify site and experiment readiness Analysis on the Grid Very challenging Overall growth in usage very promising CMS running over 13k analysis jobs/day submitted by ~100 users using ~75 sites (July 06) Understanding how the CERN Analysis Facility will be used DAQ testing looks late the Tier-0 needs time to react to any unexpected requirements and problems

Conclusions WLCG is on track to be production ready at LHC start-up Steep ramp-up ahead in resources and usage of them Ramp-up of data more a concern than computing Round-the-clock operation of the services is a challenge Stability is mostly needed Analysis patterns are still unknown Exciting times ahead !

The End Kors Bos NIKHEF, Amsterdam

Introduction LHC (when needed)

Large Hadron Collider

Large Hadron Collider

The LHC Accelerator The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors

LHC DATA This is reduced by online computers that filter out a few hundred “good” events per sec. Which are recorded on disk and magnetic tape at 100-1,000 MegaBytes/sec ~16 PetaBytes per year for all four experiments

(extracted by physics topic) Tier-0 Data Handling and Computation for Physics Analysis reconstruction detector event filter (selection & reconstruction) Tier-1 analysis processed data event summary data raw data batch physics analysis event reprocessing simulation Tier-2 analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch

WLCG Service Hierarchy Tier-0 – the accelerator centre Data acquisition & initial processing Archive one copy of the data Distribution of data  Tier-1 centres WLCG Service Hierarchy Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – 11 computer centers Archive second copy of the data Re-processing National, regional support Tier-2 – ~100 centres in ~40 countries End-user Analysis Monte Carlo Simulation

WLCG Collaboration The Collaboration 4 LHC experiments ~120 computing centres 12 large centres (Tier-0, Tier-1) 38 federations of smaller “Tier-2” centres Growing to ~40 countries Memorandum of Understanding Agreed in October 2005, now being signed Resources Commitment made each October for the coming year 5-year forward look