Download presentation
Presentation is loading. Please wait.
Published byEmery Austin Modified over 8 years ago
1
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005
2
Run II 2005 Computing Review2 Outline Operational status Globally – continue to do well D Ø Computing model & data flow Ongoing, ‘ long ’ established plan Highlights from the last year Algorithms SAM-Grid & reprocessing of Run II data 10 9 events reprocessed on the grid – largest HEP grid effort Looking forward Budget request Manpower Conclusions
3
Run II 2005 Computing Review3 Snapshot of Current Status Reconstruction keeping up with data taking Data handling is performing well Production computing is off-site and grid based. It continues to grow & work well Over 75 million Monte Carlo events produced in last year Run IIa data set being reprocessed on the grid – 10 9 events Analysis cpu power has been expanded Globally doing well
4
Run II 2005 Computing Review4 Computing Model Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks Scalable Not alone – Joint effort with CD and others, LHC … 1997 – Original Plan All Monte Carlo to be produced off-site SAM to be used for all data handling, provides a ‘ data-grid ’ Now: Monte Carlo and data reprocessing with SAM-Grid Next: Other production tasks e.g. fixing and then user analysis (~in order of increasing complexity)
5
Run II 2005 Computing Review5 Reminder of Data Flow Data acquisition (raw data in evpack format) Currently limited to 50 Hz Level-3 accept rate Request increase to 100 Hz, as planned for Run IIb – see later Reconstruction (tmb/DST in evpack format) Additional information in tmb → tmb ++ (DST format stopped) Sufficient for ‘complex’ corrections, inc track fitting Fixing (tmb in evpack format) Improvements / corrections coming after cut of production release Centrally performed Skimming (tmb in evpack format) Centralised event streaming based on reconstructed physics objects Selection procedures regularly improved Analysis (out: root histogram) Common root-based Analysis Format (CAF) introduced in last year tmb format remains
6
Run II 2005 Computing Review6 Reconstruction Central farm Processing & reprocessing (SAM-Grid) with spare cycles Evolving to shared FNAL farms Reco-timing Significant improvement, especially at higher instantaneous luminosity See Qizhong ’ s talk
7
Run II 2005 Computing Review7 Highlights – “Algorithms” Algorithms reaching maturity P17 improvements include Reco speed-up Full calorimeter calibration Fuller description of detector material Common Analysis Format (CAF) Limits development of different root-based formats Common object-selection, trigger-selection, normalization tools Simplify, accelerate analysis development First 1fb -1 analyses by Moriond See Qizhong’s talk
8
Run II 2005 Computing Review8 SAM continues to perform well, providing a data-grid 50 SAM sites worldwide Over 2.5 PB (50B events) consumed in the last year Up to 300 TB moved per month Larger SAM cache solved tape access issues Continued success of SAM shifters Often remote collaborators Form 1 st line of defense SAMTV monitors SAM & SAM stations Data Handling - SAM http://d0db-prd.fnal.gov/sm_local/SamAtAGlance/
9
Run II 2005 Computing Review9 Remote Production Activities / SAM-Grid Monte Carlo Over 75M events produced in last year, at more than 10 sites More than double last year’s production Vast majority on shared sites (often national Tier 1 sites - primarily “LCG”) SAM-Grid introduced in spring 04, becoming the default Consolidation of SAM-Grid / LCG co-existence Over 17M events produced ‘directly’ on LCG via submission from Nikhef Data reprocessing After significant improvements to reconstruction, reprocess old data P14 – winter 03/04 – from DST - 500M events, 100M off-site P17 – now – from raw – 1B events – SAM-Grid default – basically all off-site Massive task – largest HEP activity on the grid 3200 1GHz PIIIs for 6 months Led to significant improvements to SAM-Grid Collaborative effort
10
Run II 2005 Computing Review10 Reprocessing / SAM-Grid - I More than 10 DØ execution sites http://samgrid.fnal.gov:8080/ http://samgrid.fnal.gov:8080/list_of_schedulers.php http://samgrid.fnal.gov:8080/list_of_resources.php SAM – data handling JIM – job submission & monitoring SAM + JIM SAM-Grid
11
Run II 2005 Computing Review11 ~825 Mevents done Reprocessing / SAM-Grid - II Speed ( = production speed in M events / day) SAM-Grid ‘enables’ a common environment & operation scripts as well as effective book-keeping JIM’s XML-DB used for monitoring / bug tracing SAM avoids data duplication + defines recovery jobs Monitor speed and efficiency – by site or overall (http://samgrid.fnal.gov:8080/cgi-bin/plot_efficiency.cgi) Started end march Comment: Tough deploying a product, under evolution to new sites (we are a running experiment)
12
Run II 2005 Computing Review12 Need access to greater resources as data sets grow Ongoing programme on LCG and OSG interoperability Step 1 (co-existence) – use shared resources with SAM-Grid head-node Widely done for both Reprocessing and MC OSG co-existence shown for data reprocessing Step 2 – SAMGrid-LCG interface SAM does data handling & JIM job submission Basically forwarding mechanism Prototype established at IN2P3/Wuppertal Extending to production level OSG activity increasing – build on LCG experience Limited manpower SAM-Grid Interoperability
13
Run II 2005 Computing Review13 Looking Forward – Budget Request Long planned increase to 100Hz for Run IIb Experiment performing well Run II average data taking eff ~84%, now pushing 90% Making efficient use of data and resources Many analyses published (have a complete analysis with 0.6fb -1 data) “Core” physics program saturates 50Hz rate at 1 x 10 32 Maintaining 50Hz at 2 x 10 32 → an effective loss of 1-2fb -1 http://d0server1.fnal.gov/projects/Computing/Reviews/Sept2005/Index.html Increase requires ~$1.5M in FY06/07, and <$1M after Details to come in Amber’s talk
14
Run II 2005 Computing Review14 Looking Forward – Manpower - I From directorate Taskforce report on manpower issues…. Some vulnerability through limited number of suitably qualified experts in either collaboration or CD Databases – serve most of our needs but concern wrt trigger and luminosity data bases Online system – key dependence on a single individual Offline code management, build and distribution systems Additional areas where central consolidation of hardware support → reduced overall manpower needs e.g. Level-3 trigger hardware support Under discussion with CD
15
Run II 2005 Computing Review15 Looking Forward – Manpower - II Increased data sets require increased resources Route to these is increased use of grid and common tools Have an ongoing joint program, but work to do.. Need effort to Continue development of SAM-Grid Automated production job submission by shifters User analyses Deployment team Bring in new sites in manpower efficient manner Full interoperability Ability to access efficiently all shared resources Additional resources for above recommended by Taskforce Support recommendation that some of additional manpower come via more guest scientists, postdocs and associate scientists
16
Run II 2005 Computing Review16 Conclusions Computing model continues to be successful Significant advances this year: Reco speed-up, Common Analysis Format Extension of grid capabilities P17 reprocessing with SAM-Grid Interoperability Want to maintain / build on this progress Potential issues/ challenges being addressed Short term - Ongoing action on immediate vulnerabilities Longer term – larger data sets Continued development of common tools, increased use of the grid Continued development of above in collaboration with others Manpower injection required to achieve reduced effort in steady state, with increased functionality – see Taskforce talk Globally doing well
17
Run II 2005 Computing Review17 Back-up
18
Run II 2005 Computing Review18 Terms Tevatron Approx equiv challenge to LHC in “today’s” money Running experiments SAM (Sequential Access to Metadata) Well developed metadata and distributed data replication system Originally developed by DØ & FNAL-CD JIM (Job Information and Monitoring) handles job submission and monitoring (all but data handling) SAM + JIM → SAM-Grid – computational grid Tools Runjob- Handles job workflow management dØtools– User interface for job submission dØrte- Specification of runtime needs
19
Run II 2005 Computing Review19 Reprocessing
20
Run II 2005 Computing Review20 The Good and Bad of the Grid Only viable way to go … Increase in resources (cpu and potentially manpower) Work with, not against, LHC Still limited BUT Need to conform to standards – dependence on others.. Long term solutions must be favoured over short term idiosyncratic convenience Or won’t be able to maintain adequate resources. Must maintain production level service (papers), while increasing functionality As transparent as possible to non-expert
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.