ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was.

Slides:

Advertisements

Similar presentations

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Advertisements

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

FZU participation in the Tier0 test CERN August 3, 2006.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.

The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.

7April 2000F Harris LHCb Software Workshop 1 LHCb planning on EU GRID activities (for discussion) F Harris.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.

Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.

ATLAS: Heavier than Heaven? Roger Jones Lancaster University GridPP19 Ambleside 28 August 2007.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

TDTIMS Overview What is TDTIMS? & Why Do We Do It?

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Point-to-point Architecture topics for discussion Remote I/O as a data access scenario Remote I/O is a scenario that, for the first time, puts the WAN.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

ATLAS Metadata Interface Campaign Definition in AMI S.Albrand 23/02/2016ATLAS Metadata Interface1.

TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

M. Gheata ALICE offline week, October Current train wagons GroupAOD producersWork on ESD input Work on AOD input PWG PWG31 (vertexing)2 (+

Victoria, Sept WLCG Collaboration Workshop1 ATLAS Dress Rehersals Kors Bos NIKHEF, Amsterdam.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

The GridPP DIRAC project DIRAC for non-LHC communities.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

WLCG November Plan for shutdown and 2009 data-taking Kors Bos.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

ATLAS Computing: Experience from first data processing and analysis Workshop TYL’10.

ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

Computing Operations Roadmap

Overview of the Belle II computing

Cluster Optimisation using Cgroups

Data Challenge with the Grid in ATLAS

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Readiness of ATLAS Computing - A personal view

ExaO: Software Defined Data Distribution for Exascale Sciences

Proposal for a DØ Remote Analysis Model (DØRAM)

Presentation transcript:

ATLAS Report 14 April 2010 RWL Jones

The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was nothing to say about it  It is good he thinks so!  But it is also a little hopeful!  But stick with the good news for now…..  People doing real work on the Grid, and in large numbers

Data throughput MB/s per day Total ATLAS data throughput via Grid (Tier0, Tier-1s, Tier-2s) Beam splashes First collisions Cosmics End data-taking  ATLAS was pumping data out at a high rate, despite the low luminosity  We are limited by trigger rate and event size  We also increased the number of first data replicas

900GeV reprocessing We are reprocessing swiftly and often We need prompt and continuous response This dataset is tiny compared with what is to come!

Reprocessed data to UK  Only one outstanding data transfer issue for UK by end of Xmas period

6 Tier 2 Nominal Shares 2010

Event data flow to the UK 7 Data distribution has been very fast despite event size Most datasets available in Tier 2s in a couple of hours

High Energy Data Begins UK T2 throughput MB/sUK T2 transfer volume GB UK T1 throughput MB/s

Analysis  This is going generally well  Previously recognised good sites are generally doing well  Workload is uneven depending on data distribution  This is not yet in ‘equilibrium’ as the data is sparse  Remember, datasets for specific streams go to specific sites, so usage reflects those hosting minimum bias samples  However, the fact the UK sites were favoured (e.g. Glasgow) also reflects good performance  The 7TeV will move more to equilibrium

Data Placement  There are issues in the current ATLAS distribution  Should be a full set in each cloud  This has not always happened because of a bug  We need to be more responsive to site performance & capacity  At the moment, the UK has been patching-in extra copies ‘manually’  ATLAS has followed UK ideas and has introduced ‘primary’ and ‘secondary’ copies of datasets  Secondary copies live only while space permits  Improves access – UK typically has 1.6 copies

11 The UK and Data Placement  The movement and placement of data must be managed  Overload of the data management system slows the system down for everyone  Unnecessary multiple copies waste disk space and will prevent a full set being available  Some multiple copies will be a good idea to balance loads  We have a group for deciding the data placement:  UK Physics Co-ordinator, UK deputy spokesman, Tony Doyle (UK data ops), Roger Jones (UK ops), aided by Stewart, Love & Brochu  The UK Physics co-ordinator consults the institute physics reps  The initial data plans follows the matching of trigger type to site from previous exercises  We will make second copies until we run short of space, then the second copies will be removed *at little notice*

But dataset x is not in the UK  In general, this should not be the case, unless it is RAW  Access it elsewhere (unless RAW/less popular ESD)  The job goes to the data, not the data to you  We can copy small amounts to the UK on request  E.g. my Higgs candidate in RAW or ESD  But we must manage it - specify  What need for the data is (activity, which physics and performance group)  Why it is not already covered by a physics or performance group area  How big it will be *at a maximum*  How the data will be used (what sort of job to be run, database access etc)  We are still surprised to see requests for datasets that are freely available on the Grid in the UK to be copied to ‘their’ Tier 2  Local requests should be to Tier 3 (non-GridPP) space 12

Site responsibilities  Sites are either  Supporting important datasets  Supporting physics groups  Reliability is vital  We need to be in the high 90s at least  Means paying a lot of attention to ATLAS monitoring and not just to SAM tests.  The switch to a SAM nagios based system is potentially useful, but many bugs to be ironed out  Sites just have to be pro-actively looking at the ATLAS dashboards (blame this on the infrastructure people again).  We are reviewing middleware, but the sites must play their part  Local monitoring is important  It should not be users who spot site problems first!  Sites must also look at ATLAS monitoring, not just SAM tests – they are not enough  ATLAS is working to help this…

Monitoring & Validation  ATLAS is working to improve the monitoring  Learn more from the user jobs:  We focus on “active” probing of the sites.  But “passive” yet automatic observation of the user jobs would lead to a better understanding of what is happening at the sites.  The current ADC metrics for analysis are the Hammer Cloud tests using the GangaRobot  These tests are heavy but fairly reliable  Reflect the computing model and needs in data-taking era  Reminder:  About 55% of CPU for ATLAS-wide analysis  About 100% of disk for ATLAS-wide analysis  About 0% of either for local use!

GangaRobot Today  ~8 tests per site per day w/ a mix of:  A few different releases  Different access modes  Mc and real data  Cond DB access  All are defined/prepared by Johannes Elmshauser  Results on GR web and in SAM  Non-critical; sites usually ignore it  Auto blacklisting on EGEE/WMS2x daily report sent to DAST containing:  Sites with failures  Sites with no job submitted (brokerage error, e.g. no release)

ATLAS Validation – GR2  New tool, GR2, under development to validate sites  Lighter load on sites – GR2 is HC in ‘gentle mode’  Concept of Test templates (release, analysis, ds pattern, [sites])  Defined by ADC  Still has bugs  Installations need to be clearly defined and installed  Test samples need to be in place  This will almost certainly be the framework for future metrics  The metrics themselves require more experience to define

Installations  Installations:  Our sites have been apparently ‘off’ because of missing releases  ATLAS central is also slow at responding to problems with non-CERN set-ups  Major clean-up underway  Auto-retry installer under development

PANDA & WMS  There are now two distinct groups of users  Those who use the PANDA back-end  Those who use the WMS  There is less monitoring of the WMS, and less control  Some (e.g. Graeme) favour a tight focus on the PANDA approach  I am not sure this is possible  However, ATLAS clearly has more feedback and more control if this route is taken  Do not be surprised!

Middleware  Sites cannot be made 100% reliable with the current middleware  Many options are being considered  In particular, data management may reduce from 3 layers to 2  This would effectively remove the LFC if so  Radical options are also being considered  BUT ATLAS involved in recognizing the limitations of the system today and making it work

Conclusion  We are now finally dealing with real data  We are still learning  We must all work hard to make things work  Many thanks for everyone’s effort so far  But the work continues for 20 years!  The UK has been heavily used and involved in first physics studies  This is partly because of data location  But also because we are a reliable cloud  We can all celebrate this at the dinner tonight  But please keep an eye on your sites on your smart phones!