DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005
Run II 2005 Computing Review2 Outline Operational status Globally – continue to do well D Ø Computing model & data flow Ongoing, ‘ long ’ established plan Highlights from the last year Algorithms SAM-Grid & reprocessing of Run II data 10 9 events reprocessed on the grid – largest HEP grid effort Looking forward Budget request Manpower Conclusions
Run II 2005 Computing Review3 Snapshot of Current Status Reconstruction keeping up with data taking Data handling is performing well Production computing is off-site and grid based. It continues to grow & work well Over 75 million Monte Carlo events produced in last year Run IIa data set being reprocessed on the grid – 10 9 events Analysis cpu power has been expanded Globally doing well
Run II 2005 Computing Review4 Computing Model Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks Scalable Not alone – Joint effort with CD and others, LHC … 1997 – Original Plan All Monte Carlo to be produced off-site SAM to be used for all data handling, provides a ‘ data-grid ’ Now: Monte Carlo and data reprocessing with SAM-Grid Next: Other production tasks e.g. fixing and then user analysis (~in order of increasing complexity)
Run II 2005 Computing Review5 Reminder of Data Flow Data acquisition (raw data in evpack format) Currently limited to 50 Hz Level-3 accept rate Request increase to 100 Hz, as planned for Run IIb – see later Reconstruction (tmb/DST in evpack format) Additional information in tmb → tmb ++ (DST format stopped) Sufficient for ‘complex’ corrections, inc track fitting Fixing (tmb in evpack format) Improvements / corrections coming after cut of production release Centrally performed Skimming (tmb in evpack format) Centralised event streaming based on reconstructed physics objects Selection procedures regularly improved Analysis (out: root histogram) Common root-based Analysis Format (CAF) introduced in last year tmb format remains
Run II 2005 Computing Review6 Reconstruction Central farm Processing & reprocessing (SAM-Grid) with spare cycles Evolving to shared FNAL farms Reco-timing Significant improvement, especially at higher instantaneous luminosity See Qizhong ’ s talk
Run II 2005 Computing Review7 Highlights – “Algorithms” Algorithms reaching maturity P17 improvements include Reco speed-up Full calorimeter calibration Fuller description of detector material Common Analysis Format (CAF) Limits development of different root-based formats Common object-selection, trigger-selection, normalization tools Simplify, accelerate analysis development First 1fb -1 analyses by Moriond See Qizhong’s talk
Run II 2005 Computing Review8 SAM continues to perform well, providing a data-grid 50 SAM sites worldwide Over 2.5 PB (50B events) consumed in the last year Up to 300 TB moved per month Larger SAM cache solved tape access issues Continued success of SAM shifters Often remote collaborators Form 1 st line of defense SAMTV monitors SAM & SAM stations Data Handling - SAM
Run II 2005 Computing Review9 Remote Production Activities / SAM-Grid Monte Carlo Over 75M events produced in last year, at more than 10 sites More than double last year’s production Vast majority on shared sites (often national Tier 1 sites - primarily “LCG”) SAM-Grid introduced in spring 04, becoming the default Consolidation of SAM-Grid / LCG co-existence Over 17M events produced ‘directly’ on LCG via submission from Nikhef Data reprocessing After significant improvements to reconstruction, reprocess old data P14 – winter 03/04 – from DST - 500M events, 100M off-site P17 – now – from raw – 1B events – SAM-Grid default – basically all off-site Massive task – largest HEP activity on the grid GHz PIIIs for 6 months Led to significant improvements to SAM-Grid Collaborative effort
Run II 2005 Computing Review10 Reprocessing / SAM-Grid - I More than 10 DØ execution sites SAM – data handling JIM – job submission & monitoring SAM + JIM SAM-Grid
Run II 2005 Computing Review11 ~825 Mevents done Reprocessing / SAM-Grid - II Speed ( = production speed in M events / day) SAM-Grid ‘enables’ a common environment & operation scripts as well as effective book-keeping JIM’s XML-DB used for monitoring / bug tracing SAM avoids data duplication + defines recovery jobs Monitor speed and efficiency – by site or overall ( Started end march Comment: Tough deploying a product, under evolution to new sites (we are a running experiment)
Run II 2005 Computing Review12 Need access to greater resources as data sets grow Ongoing programme on LCG and OSG interoperability Step 1 (co-existence) – use shared resources with SAM-Grid head-node Widely done for both Reprocessing and MC OSG co-existence shown for data reprocessing Step 2 – SAMGrid-LCG interface SAM does data handling & JIM job submission Basically forwarding mechanism Prototype established at IN2P3/Wuppertal Extending to production level OSG activity increasing – build on LCG experience Limited manpower SAM-Grid Interoperability
Run II 2005 Computing Review13 Looking Forward – Budget Request Long planned increase to 100Hz for Run IIb Experiment performing well Run II average data taking eff ~84%, now pushing 90% Making efficient use of data and resources Many analyses published (have a complete analysis with 0.6fb -1 data) “Core” physics program saturates 50Hz rate at 1 x Maintaining 50Hz at 2 x → an effective loss of 1-2fb -1 Increase requires ~$1.5M in FY06/07, and <$1M after Details to come in Amber’s talk
Run II 2005 Computing Review14 Looking Forward – Manpower - I From directorate Taskforce report on manpower issues…. Some vulnerability through limited number of suitably qualified experts in either collaboration or CD Databases – serve most of our needs but concern wrt trigger and luminosity data bases Online system – key dependence on a single individual Offline code management, build and distribution systems Additional areas where central consolidation of hardware support → reduced overall manpower needs e.g. Level-3 trigger hardware support Under discussion with CD
Run II 2005 Computing Review15 Looking Forward – Manpower - II Increased data sets require increased resources Route to these is increased use of grid and common tools Have an ongoing joint program, but work to do.. Need effort to Continue development of SAM-Grid Automated production job submission by shifters User analyses Deployment team Bring in new sites in manpower efficient manner Full interoperability Ability to access efficiently all shared resources Additional resources for above recommended by Taskforce Support recommendation that some of additional manpower come via more guest scientists, postdocs and associate scientists
Run II 2005 Computing Review16 Conclusions Computing model continues to be successful Significant advances this year: Reco speed-up, Common Analysis Format Extension of grid capabilities P17 reprocessing with SAM-Grid Interoperability Want to maintain / build on this progress Potential issues/ challenges being addressed Short term - Ongoing action on immediate vulnerabilities Longer term – larger data sets Continued development of common tools, increased use of the grid Continued development of above in collaboration with others Manpower injection required to achieve reduced effort in steady state, with increased functionality – see Taskforce talk Globally doing well
Run II 2005 Computing Review17 Back-up
Run II 2005 Computing Review18 Terms Tevatron Approx equiv challenge to LHC in “today’s” money Running experiments SAM (Sequential Access to Metadata) Well developed metadata and distributed data replication system Originally developed by DØ & FNAL-CD JIM (Job Information and Monitoring) handles job submission and monitoring (all but data handling) SAM + JIM → SAM-Grid – computational grid Tools Runjob- Handles job workflow management dØtools– User interface for job submission dØrte- Specification of runtime needs
Run II 2005 Computing Review19 Reprocessing
Run II 2005 Computing Review20 The Good and Bad of the Grid Only viable way to go … Increase in resources (cpu and potentially manpower) Work with, not against, LHC Still limited BUT Need to conform to standards – dependence on others.. Long term solutions must be favoured over short term idiosyncratic convenience Or won’t be able to maintain adequate resources. Must maintain production level service (papers), while increasing functionality As transparent as possible to non-expert