Presentation is loading. Please wait.

Presentation is loading. Please wait.

DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.

Similar presentations


Presentation on theme: "DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005."— Presentation transcript:

1 DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005

2 Run II 2005 Computing Review2 Outline Operational status   Globally – continue to do well D Ø Computing model & data flow   Ongoing, ‘ long ’ established plan Highlights from the last year   Algorithms   SAM-Grid & reprocessing of Run II data   10 9 events reprocessed on the grid – largest HEP grid effort Looking forward   Budget request   Manpower Conclusions

3 Run II 2005 Computing Review3 Snapshot of Current Status Reconstruction keeping up with data taking Data handling is performing well Production computing is off-site and grid based. It continues to grow & work well Over 75 million Monte Carlo events produced in last year Run IIa data set being reprocessed on the grid – 10 9 events Analysis cpu power has been expanded Globally doing well

4 Run II 2005 Computing Review4 Computing Model Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks  Scalable  Not alone – Joint effort with CD and others, LHC … 1997 – Original Plan  All Monte Carlo to be produced off-site  SAM to be used for all data handling, provides a ‘ data-grid ’ Now: Monte Carlo and data reprocessing with SAM-Grid Next: Other production tasks e.g. fixing and then user analysis (~in order of increasing complexity)

5 Run II 2005 Computing Review5 Reminder of Data Flow Data acquisition (raw data in evpack format)  Currently limited to 50 Hz Level-3 accept rate  Request increase to 100 Hz, as planned for Run IIb – see later Reconstruction (tmb/DST in evpack format)  Additional information in tmb → tmb ++ (DST format stopped)  Sufficient for ‘complex’ corrections, inc track fitting Fixing (tmb in evpack format)  Improvements / corrections coming after cut of production release  Centrally performed Skimming (tmb in evpack format)  Centralised event streaming based on reconstructed physics objects  Selection procedures regularly improved Analysis (out: root histogram)  Common root-based Analysis Format (CAF) introduced in last year  tmb format remains

6 Run II 2005 Computing Review6 Reconstruction Central farm   Processing   & reprocessing (SAM-Grid) with spare cycles   Evolving to shared FNAL farms Reco-timing   Significant improvement, especially at higher instantaneous luminosity   See Qizhong ’ s talk

7 Run II 2005 Computing Review7 Highlights – “Algorithms” Algorithms reaching maturity P17 improvements include  Reco speed-up  Full calorimeter calibration  Fuller description of detector material Common Analysis Format (CAF)  Limits development of different root-based formats  Common object-selection, trigger-selection, normalization tools  Simplify, accelerate analysis development First 1fb -1 analyses by Moriond See Qizhong’s talk

8 Run II 2005 Computing Review8 SAM continues to perform well, providing a data-grid  50 SAM sites worldwide  Over 2.5 PB (50B events) consumed in the last year  Up to 300 TB moved per month  Larger SAM cache solved tape access issues  Continued success of SAM shifters  Often remote collaborators  Form 1 st line of defense  SAMTV monitors SAM & SAM stations Data Handling - SAM http://d0db-prd.fnal.gov/sm_local/SamAtAGlance/

9 Run II 2005 Computing Review9 Remote Production Activities / SAM-Grid Monte Carlo  Over 75M events produced in last year, at more than 10 sites  More than double last year’s production  Vast majority on shared sites (often national Tier 1 sites - primarily “LCG”)  SAM-Grid introduced in spring 04, becoming the default  Consolidation of SAM-Grid / LCG co-existence  Over 17M events produced ‘directly’ on LCG via submission from Nikhef Data reprocessing  After significant improvements to reconstruction, reprocess old data  P14 – winter 03/04 – from DST - 500M events, 100M off-site  P17 – now – from raw – 1B events – SAM-Grid default – basically all off-site  Massive task – largest HEP activity on the grid  3200 1GHz PIIIs for 6 months  Led to significant improvements to SAM-Grid  Collaborative effort

10 Run II 2005 Computing Review10 Reprocessing / SAM-Grid - I More than 10 DØ execution sites http://samgrid.fnal.gov:8080/ http://samgrid.fnal.gov:8080/list_of_schedulers.php http://samgrid.fnal.gov:8080/list_of_resources.php SAM – data handling JIM – job submission & monitoring SAM + JIM  SAM-Grid

11 Run II 2005 Computing Review11 ~825 Mevents done Reprocessing / SAM-Grid - II Speed ( = production speed in M events / day) SAM-Grid ‘enables’ a common environment & operation scripts as well as effective book-keeping  JIM’s XML-DB used for monitoring / bug tracing  SAM avoids data duplication + defines recovery jobs  Monitor speed and efficiency – by site or overall (http://samgrid.fnal.gov:8080/cgi-bin/plot_efficiency.cgi) Started end march Comment:  Tough deploying a product, under evolution to new sites (we are a running experiment)

12 Run II 2005 Computing Review12 Need access to greater resources as data sets grow Ongoing programme on LCG and OSG interoperability Step 1 (co-existence) – use shared resources with SAM-Grid head-node  Widely done for both Reprocessing and MC  OSG co-existence shown for data reprocessing Step 2 – SAMGrid-LCG interface  SAM does data handling & JIM job submission  Basically forwarding mechanism  Prototype established at IN2P3/Wuppertal  Extending to production level OSG activity increasing – build on LCG experience Limited manpower SAM-Grid Interoperability

13 Run II 2005 Computing Review13 Looking Forward – Budget Request Long planned increase to 100Hz for Run IIb Experiment performing well   Run II average data taking eff ~84%, now pushing 90%   Making efficient use of data and resources   Many analyses published (have a complete analysis with 0.6fb -1 data) “Core” physics program saturates 50Hz rate at 1 x 10 32 Maintaining 50Hz at 2 x 10 32 → an effective loss of 1-2fb -1   http://d0server1.fnal.gov/projects/Computing/Reviews/Sept2005/Index.html Increase requires ~$1.5M in FY06/07, and <$1M after Details to come in Amber’s talk

14 Run II 2005 Computing Review14 Looking Forward – Manpower - I From directorate Taskforce report on manpower issues…. Some vulnerability through limited number of suitably qualified experts in either collaboration or CD  Databases – serve most of our needs but concern wrt trigger and luminosity data bases  Online system – key dependence on a single individual  Offline code management, build and distribution systems Additional areas where central consolidation of hardware support → reduced overall manpower needs  e.g. Level-3 trigger hardware support Under discussion with CD

15 Run II 2005 Computing Review15 Looking Forward – Manpower - II Increased data sets require increased resources Route to these is increased use of grid and common tools Have an ongoing joint program, but work to do.. Need effort to  Continue development of SAM-Grid  Automated production job submission by shifters  User analyses  Deployment team  Bring in new sites in manpower efficient manner  Full interoperability  Ability to access efficiently all shared resources Additional resources for above recommended by Taskforce  Support recommendation that some of additional manpower come via more guest scientists, postdocs and associate scientists

16 Run II 2005 Computing Review16 Conclusions Computing model continues to be successful Significant advances this year:  Reco speed-up, Common Analysis Format  Extension of grid capabilities  P17 reprocessing with SAM-Grid  Interoperability Want to maintain / build on this progress Potential issues/ challenges being addressed  Short term - Ongoing action on immediate vulnerabilities  Longer term – larger data sets  Continued development of common tools, increased use of the grid  Continued development of above in collaboration with others  Manpower injection required to achieve reduced effort in steady state, with increased functionality – see Taskforce talk Globally doing well

17 Run II 2005 Computing Review17 Back-up

18 Run II 2005 Computing Review18 Terms Tevatron  Approx equiv challenge to LHC in “today’s” money  Running experiments SAM (Sequential Access to Metadata)  Well developed metadata and distributed data replication system  Originally developed by DØ & FNAL-CD JIM (Job Information and Monitoring)  handles job submission and monitoring (all but data handling)  SAM + JIM → SAM-Grid – computational grid Tools  Runjob- Handles job workflow management  dØtools– User interface for job submission  dØrte- Specification of runtime needs

19 Run II 2005 Computing Review19 Reprocessing

20 Run II 2005 Computing Review20 The Good and Bad of the Grid Only viable way to go … Increase in resources (cpu and potentially manpower)   Work with, not against, LHC   Still limited BUT Need to conform to standards – dependence on others.. Long term solutions must be favoured over short term idiosyncratic convenience   Or won’t be able to maintain adequate resources. Must maintain production level service (papers), while increasing functionality   As transparent as possible to non-expert


Download ppt "DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005."

Similar presentations


Ads by Google