ATLAS Grid Computing in the Real World Taipei, 27 th April 2005 Roger Jones ATLAS International Computing Board Chair GridPP Applications Co-ordinator.

Slides:

Advertisements

Similar presentations

Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.

Advertisements

Réunion DataGrid France, Lyon, fév CMS test of EDG Testbed Production MC CMS Objectifs Résultats Conclusions et perspectives C. Charlot / LLR-École.

Applications Area Issues RWL Jones Deployment Team – 2 nd June 2005.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Stuart Wakefield Imperial College London1 How (and why) HEP uses the Grid.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25,

Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.

ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

ATLAS Computing Model – US Research Program Manpower J. Shank N.A. ATLAS Physics Workshop Tucson, AZ 21 Dec., 2004.

ATLAS: Heavier than Heaven? Roger Jones Lancaster University GridPP19 Ambleside 28 August 2007.

Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.

The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.

SC4 Planning Planning for the Initial LCG Service September 2005.

The following is a collection of slides from a few recent talks on computing for ATLAS in Canada, plus a few new ones. I might refer to all of them, I.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

ATLAS Grid Computing Rob Gardner University of Chicago ICFA Workshop on HEP Networking, Grid, and Digital Divide Issues for Global e-Science THE CENTER.

The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.

Interfacing Grid-Canada to LCG M.C. Vetterli, R. Walker Simon Fraser Univ. Grid Deployment Area Mtg August 2 nd, 2004.

LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.

ATLAS Computing Model & Service Challenges Roger Jones 12 th October 2004 CERN.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Bob Jones EGEE Technical Director

Grid Computing: Running your Jobs around the World

Computing Operations Roadmap

“A Data Movement Service for the LHC”

EGEE Middleware Activities Overview

U.S. ATLAS Grid Production Experience

Data Challenge with the Grid in ATLAS

INFN-GRID Workshop Bari, October, 26, 2004

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Readiness of ATLAS Computing - A personal view

ATLAS DC2 ISGC-2005 Taipei 27th April 2005

WP1 activity, achievements and plans

Network Requirements Javier Orellana

Simulation use cases for T2 in ALICE

LCG middleware and LHC experiments ARDA project

N. De Filippis - LLR-Ecole Polytechnique

R. Graciani for LHCb Mumbay, Feb 2006

ExaO: Software Defined Data Distribution for Exascale Sciences

LHC Data Analysis using a worldwide computing grid

Collaboration Board Meeting

ATLAS DC2 & Continuous production

The ATLAS Computing Model

The LHCb Computing Data Challenge DC06

Presentation transcript:

ATLAS Grid Computing in the Real World Taipei, 27 th April 2005 Roger Jones ATLAS International Computing Board Chair GridPP Applications Co-ordinator

2 RWL Jones, Lancaster University Overview  The ATLAS Computing Model  Distribution of data and jobs, roles of sites  Required Resources  Testing the Model  See Gilbert Poulard’s Data Challenge talk next  Observations about the Real World (taken from ATLAS DC experiences and those of GridPP)  Matching the global view with local expectations  Sharing with other user communities  Working across several Grid deployments  Resources required > resources available?

3 RWL Jones, Lancaster University Major challenges associated with: Communication and collaboration at a distance (collaborative tools) Distributed computing resources Remote software development and physics analysis Complexity of the Problem

4 RWL Jones, Lancaster University ATLAS is not one experiment Higgs Extra Dimensions Extra Dimensions Heavy Ion Physics Heavy Ion Physics QCD Electroweak B physics B physics SUSY

5 RWL Jones, Lancaster University Computing Resources  Computing Model well evolved  Externally reviewed in January 2005  There are (and will remain for some time) many unknowns  Calibration and alignment strategy is still evolving  Physics data access patterns MAY start to be exercised this soon  Unlikely to know the real patterns until 2007/2008!  Still uncertainties on the event sizes  If there is a problem with resources, e.g. disk, the model will have to change  Lesson from the previous round of experiments at CERN (LEP, )  Reviews in 1988 underestimated the computing requirements by an order of magnitude!

6 RWL Jones, Lancaster University Tier2 Centre ~200kSI2k Event Builder Event Filter ~7.5MSI2k T0 ~5MSI2k UK Regional Centre (RAL) ASCC US Regional Centre Dutch Regional Centre SheffieldManchesterLiverpool Lancaster ~0.25TIPS Workstations ~100 Gb/sec ~3 Gb/sec raw Mb/s links Some data for calibration and monitoring to institutes Calibrations flow back Each of ~30 Tier 2s have ~20 physicists (range) working on one or more channels Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data Tier 2 do bulk of simulation Physics data cache ~Pb/sec ~ 75MB/s/T1 nominal for ATLAS Tier2 Centre ~200kSI2k  622Mb/s links Tier 0 Tier 1 Desktop PC (2004) = ~1 kSpecInt2k Northern Tier ~200kSI2k Tier 2  ~200 TB/year/T2  ~2MSI2k/T1  ~2 PB/year/T1  ~5 PB/year  No simulation  622Mb/s links The Tiered System 10 Tier-1s reprocess house simulation Group Analysis

7 RWL Jones, Lancaster University A Tiered Grid CERN Physics Department    Desktop Germany 10 Tier 1 Taipei ASCC UK France Italy NL USA Brookhaven ……… NorthGridSouthGridLondonGridScotGrid 30 Tier2 Tier 0 T1AF The LHC Computing Facility

8 RWL Jones, Lancaster University Processing Roles  Tier-0:  First pass processing on express/calibration physics stream  hours later, process full physics data stream with reasonable calibrations  Curate the full RAW data and curate the first pass processing  Getting good calibrations in a short time is a challenge  There will be large data movement from T0 to T1s  Tier-1s:  Reprocess 1-2 months after arrival with better calibrations  Curate and reprocess all resident RAW at year end with improved calibration and software  Curate and allow access to all the reprocessed data sets  There will be large data movement from T1 to T1 and significant data movement from T1 to T2  Tier-2s:  Simulation will take place at T2s  Simulated data stored at T1s  There will be significant data movement from T2s to T1s  Need partnerships to plan networking  Must have fail-over to other sites

9 RWL Jones, Lancaster University Analysis Roles Analysis model broken into two components  Scheduled central production of augmented AOD, tuples & TAG collections from ESD  Done at T1s  Derived files moved to other T1s and to T2s  Chaotic user analysis of (shared) augmented AOD streams, tuples, new selections etc and individual user simulation and CPU-bound tasks matching the official MC production  Done at T2s and local facility  Modest job traffic between T2s

10 RWL Jones, Lancaster University Required CERN Resources Includes Heavy Ions data No replacements included

11 RWL Jones, Lancaster University Required Non-CERN Resources The growth is challenging!

12 RWL Jones, Lancaster University Networking – CERN to T1s  The ATLAS Tier 1s will be: ASCC, CCIN2P3-Lyon, RAL, NIKHEF, F2K- Karlsruhe, BNL, PIC, NorduGrid, CNAF, TRIUMF  They vary in size!  Traffic from T0 to average Tier-1 is ~75MB/s raw  With LCG headroom, efficiency and recovery factors for service challenges this is 3.5Gbps/sec  Most ATLAS T1s are shared with other experiments, so aggregate bandwidth & contention larger

13 RWL Jones, Lancaster University Networking – T1 to T1  Significant traffic of ESD (full reconstructed data) and AOD (analysis summary data) from reprocessing between T1s  125MB/sec raw for average T1 (pp only)  5-6Gbps for average T1 after usual factors

14 RWL Jones, Lancaster University Networking and Tier—2s  Tier-2 to Tier-1 networking requirements far more uncertain  without job traffic ~17.5MB/s for ‘average’ T2  ~850Mbps required?

15 RWL Jones, Lancaster University Issues  The following issues are mainly, but not exclusively derived from LCG-2 experience  No-one should feel they are being criticised  Much help given and appreciated  It is acknowledged that new tools are coming that address some of the issues  LCG a major resource for ATLAS, doing serious work

16 RWL Jones, Lancaster University Grid Projects Until deployments provide interoperability the experiments must provide it themselves ATLAS must span 3 major Grid deployments

17 RWL Jones, Lancaster University Grid Deployments ATLAS must span at least three major Grid deployments:  LCG2  Common to all LHC experiments  LCG2 now rolling-out.  Much improved job success rate  Grid3/Open Science Grid  Demonstrated success of grid computing model for HEP  Developing & deploying grid middleware and applications  wrap layers around apps,simplify deployment  Very important tools for data management MAGDA and software installation (pacman)  Evolve into fully functioning scalable distributed tiered grid  NorduGrid  A very successful regional test bed  Light-weight Grid user interface, middleware, working prototypes etc  Now to be part of Northern European Grid in EGEE

18 RWL Jones, Lancaster University ATLAS Production System LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super ProdDB Data Man. System RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea A big problem is data management Must cope with >= 3 Grid catalogues Demands even greater for analysis

19 RWL Jones, Lancaster University Data Management  Multiple deployments currently means multiple file catalogues  ATLAS build a meta-catalogue on top to join things up  Convergence would really help!  Existing data movement tools are not sophisticated in all three deployments  Anything beyond a single point-to-point movement has required user-written tools eg Don Quijote, PhEdEx…  Data transfer scheduling is vital for large transfers  Should also take care of data integrity and retries.  file transfer service tools are emerging from deployments but…  Tools must exist in and/or span all three Grids  Other observations  Existing DM tools are too fragile: they don't have timeouts (so can hang a node) or retries and are not resilient against information system glitches.  They must also handle race conditions if there are multiple writes to a file  Catalogue failures and inconsistencies need to be well-handled

20 RWL Jones, Lancaster University Workload Management  Resource brokerage is still an important idea when serving multiple VOs  ATLAS has tried to use the Resource Broker ‘as intended’  However, submission to Resource Broker in LCG is slow (although much improved thanks to work on both sides)  Bulk job submission is needed in cases of shared input  Fair usage policies need to be clearly defined and enforced  Example: using a dummy job to obtain slots then pulling in a real job is very smart and helps get around RB submission problems  But is it reasonable? Pool-table analogy…  Ranking is an art not a science!  Default LCG-2 ranking is not very good!  Matchmaking with input files is not optimal  Would want to add information on replicas in the site ranking  Currently, if input files are specified, sites with at least one replica of any of them are always preferred, even if they have much less CPU than sites with no replicas.  Important to know if file is on tape or disk

21 RWL Jones, Lancaster University Workload Management (2)  At present, we are exploiting LCG resources both with and without using the RB  Condor-G job submission interfaced to LCG (see Rod Walker’s talk) Submission faster, destination chosen by VO  No central logging and bookkeeping services  Choosing suitable sites for BDII is also an art  Requiring test suite to succeed every day is too restrictive (3 rd party failures etc)  White-listing helps, but looses responsiveness to genuine failures  Current approach is labour intensive and not scaleable

22 RWL Jones, Lancaster University User Interaction  Running jobs  User needs global query commands for all her submitted jobs  Really need to be able to tag jobs into groups  Needs access to STDOUT/STDERR while job is running  Information system  Should span Grids!  Should be accurate (e.g. evaluate free CPUs correctly)  Should be robust  Should have better user interface than ldapsearch!  Debugging  When a job fails, information is not sufficient  Logging info is unfriendly and incomplete  Log files on RB and CE are not accessible to ordinary user and difficult to access by support team.  There is huge scope development here!

23 RWL Jones, Lancaster University Site Issues  Sysadmin education and communication  Many people need to be trained  Example: storage element usage  Site defaults to ‘permanent’, but many sysadmins think storage is effectively a scratch area  Need clear lines of communication to VOs and to Grid Operations  Unannounced down times!  Policies!  Site misconfiguration.  It is the cause of most problems,  Debugging site problems is very time-consuming  Common misconfigurations:  Wrong information published to Information system  Worker Node disk space becomes full;  Incorrect firewall settings, GLOBUS_TCP_RANGE, ssh keys not syncrhonized,  NFS stale mounts, etc.;  Incorrect configuration of the experiment software area;  Bad middleware installation.

24 RWL Jones, Lancaster University Reality check  The requirement is not completely matched by the current pledges  The table presents Atlas's estimate of the lower bound on the Tier-1 resources available  Recent indications suggest the CPU shortage may be met, but the storage remains a problem  Recovery strategies reduce access to data  The disk shortage is serious Snapshot of 2008 Tier-1 status Summary Tier1sSplit 2008ATLAS CPU (kSI2K) Offered Required Balance -24% Disk (Tbytes) Offered Required Balance -35% Tape (Pbytes) Offered Required 10.1 Balance 3%

25 RWL Jones, Lancaster University Conclusions  The Grid is the only practical way to function as a world-wide collaboration  First experiences have inevitably had problems, but we have real delivery  Slower than desirable  Problems of coherence  But a PhD student can submit 2500 simulation jobs at once and have them complete within 48 hours with 95.3% success rate  (all failures were /nfs not accessible)  Real tests of the computing models contine  Analysis on the Grid is still to be seriously demonstrated  Calibration and alignment procedures need to be brought in  Resources are an issue  There are shortfalls with respect to the initial requirements  The requirements grow with time  This may be difficult for Tier-2s especially

26 RWL Jones, Lancaster University A Final Thought “It is amazing what can be achieved when you do not care who will get the credit” Harry S Trueman