CHEP'07 September 20071 D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Distribution After Release Tool Natalia Ratnikova.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
International Workshop on HEP Data Grid Nov 9, 2002, KNU Data Storage, Network, Handling, and Clustering in CDF Korea group Intae Yu*, Junghyun Kim, Ilsung.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
Enabling Grids for E-sciencE Agreement-based Workload and Resource Management Tiziana Ferrari, Elisabetta Ronchieri Mar 30-31, 2006.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.
U.S. ATLAS Grid Production Experience
ALICE Monitoring
How to enable computing
Simulation use cases for T2 in ALICE
US CMS Testbed.
Grid Canada Testbed using HEP applications
湖南大学-信息科学与工程学院-计算机与科学系
DØ MC and Data Processing on the Grid
The LHCb Computing Data Challenge DC06
Presentation transcript:

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar Outline Computing Task The Process Issues Summary Presenter: Amber Boehnlein

CHEP'07 September Improved detector understanding and new algorithms require re-reprocessing of the raw detector data Input: 90Tb of detector data Tb in executables Output: 60 Tb of data in 500 CPU years  DZero did not have enough dedicated resources to complete the task in the target 3 months D0 requested OSG to provide 2000 CPU for 4 month. Data reprocessing

CHEP'07 September OSG Distributed computing infrastructure for large- scale scientific research  About 80 facilities ranging from 10s to 1000s of nodes Agreed to share computing cycles with OSG users No required minimal network bandwidth connectivity Non uniform configuration environment  Opportunistic usage model Exact amount of resources at any time can not be guaranteed

CHEP'07 September SAMGrid SAM-Grid is an infrastructure that understands D0 processing needs and maps them into available resources (OSG)  Implements job to resource mappings Both computing and storage  Uses SAM (Sequential Access via Metadata) Automated management of storage elements Metadata cataloguing  Job submission and job status tracking  Progress monitoring

CHEP'07 September Challenge Selecting OSG resources according to D0 requirements for availability, quality, and accessibility (of data)  Optimization work for SAMGrid min (No of sites) && max (available CPUs) && max (IO bandwidth to CPU)  Work on the optimization as an iterative process to make the problem manageable

CHEP'07 September Process Add new resources iteratively  Certification Physics output should be site invariant  Data accessibility Tune data queues Exclude sites/storages that do not provide sufficient data throughput  Enable monitoring views of site-specific issues

CHEP'07 September Compare production at a new site with “standard” production at the D0 farm “the same” Certified! Note: Experienced problems during the certification on virtual OS. *default random seed in python was set to the same value on all machines Reference OSG Cluster

CHEP'07 September Data accessibility Use dedicated data queues for each site and production activity type If required. see below Monitor throughput of data in all directions  Decrease max number of data streams to high bandwidth / low latency sites  Increase max number of data streams to high bandwidth / high latency sites  Exclude sites that do not provide enough bandwidth

CHEP'07 September Data accessibility tests secs to transfer data (30 streams) Acceptable 2000 secs to transfer data (30 streams) NOT Acceptable

CHEP'07 September Issue Tracking and Troubleshooting Support and tracking of site specific issues is an issue by itself  Dozen of sites involved at a time  Initially, few people available to troubleshoot and follow-up Time critical activity as backlog of production quickly grows -> more trouble The OSG Grid Operation Center helped with the tracking and triaging of the issues

CHEP'07 September We monitored log file size distribution to classify and prioritize current failures.  Failure free production generates log files that have a similar size blue 0-8k Worker node incompatibility, Lost standard output, OSG no assign aqua 8k-25k Forwarding node crash, service failure, could not start bootstrap executable. pink 25k-80k SAM problem. Could not get RTE, possibly raw files red 80k-160k SAM problem. Could not get raw files, possibly RTE gray 160k-250k Possible D0 runtime crash green >250k OK

CHEP'07 September Troubleshooting The OSG Troubleshooting team was instrumental to the success of the project. OSG-Related Problems before the intervention of the Troubleshooting Team (03/27/2007) Most jobs succeed (04/17/2007)

CHEP'07 September Success rate at a particular site is not an indicator of the overall system health  Failures between sites may not correlate

CHEP'07 September Issue Summary Start of the effort  50% of all issues were related to OSG site setups Local scratch and permission problems, Condor/job manager state mismatch, libc incompatibilities,etc…  Direct communication with site admins and help from the OSG Troubleshooting team addressed those  50% of data delivery failures Tuning of data handling infrastructure Maximize use of local storage elements ( SRMs ) Site connectivity investigation

CHEP'07 September Issue Summary Middle of the effort  Coordination of resource usage over-subscription / under-subscription of jobs at sites  OSG did not implement automatic resource selection at the time.  Job submission operators had to learn how many jobs to submit to each resource manually. “unscheduled” downtimes (both computing and storage)  System experts had trouble keeping up with issues by themselves. DZero had to ask help to the OSG Troubleshooting team. End of the effort  Some slow-down due to insufficient automation in job failure recovery procedures

CHEP'07 September M collider events delivered to physicists  Reconstructed in fully distributed, opportunistic environment Our result

CHEP'07 September Conclusion Successful and pioneering effort in data intensive production in an opportunistic environment Challenges in support, coordination of resource usage, and reservation of the shared resources Iterative approach in enabling new resources helped make computing problem more manageable

CHEP'07 September Acknowledgments This task required the assistance of many beyond those listed, both at FermiLab and at the remote sites, and we thank them for helping make this project an accomplishment that it was  Special thanks to OSG for supporting DZero data reprocessing