DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges.

Slides:

Advertisements

Similar presentations

GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.

Advertisements

March 25, 2003L. Lueking - CHEP031 DØ Regional Analysis Center Concepts CHEP 2003 UCSD March 24-28, 2003 Lee Lueking The Mission The Resource Potential.

Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.

6/2/2015 Michael Diesburg HCP Distributed Computing at the Tevatron D0 Computing and Event Model Michael Diesburg, Fermilab For the D0 Collaboration.

Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.

Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.

Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.

The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.

High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.

CHEP 03 UCSDd0db.fnal.gov/sam1 DØ Data Handling Operational Experience CHEP03 UCSD March 24-28, 2003 Lee Lueking Fermilab Computing Division DØ overview.

The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.

Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.

November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.

Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.

28 April 2003Lee Lueking, PPDG Review1 BaBar and DØ Experiment Reports DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.

CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.

8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.

May 12-15, 2003Lee Lueking, EDG Int. Proj. Conf.1 DØ Computing Experience and Plans for SAM-Grid EU DataGrid Internal Project Conference May 12-15, 2003.

A Design for KCAF for CDF Experiment Kihyeon Cho (CHEP, Kyungpook National University) and Jysoo Lee (KISTI, Supercomputing Center) The International Workshop.

Jan. 17, 2002DØRAM Proposal DØRACE Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Remote Analysis Station ArchitectureRemote.

SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.

SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.

Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.

GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.

DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.

21 st October 2002BaBar Computing – Stephen J. Gowdy 1 Of 25 BaBar Computing Stephen J. Gowdy BaBar Computing Coordinator SLAC 21 st October 2002 Second.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

26SEP03 2 nd SAR Workshop Oklahoma University Dick Greenwood Louisiana Tech University LaTech IAC Site Report.

D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.

16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.

The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.

DØSAR a Regional Grid within DØ Jae Yu Univ. of Texas, Arlington THEGrid Workshop July 8 – 9, 2004 Univ. of Texas at Arlington.

Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.

1D. Olson, SDM-ISIC Mtg, 26 Mar 2002 Scientific Data Management: An Incomplete Experimental HENP Perspective D. Olson, LBNL 26 March 2002 SDM-ISIC Meeting.

1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,

High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.

UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.

Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.

Frank Wuerthwein, UCSD Update on D0 and CDF computing models and experience Frank Wuerthwein UCSD For CDF and DO collaborations October 2 nd, 2003 Many.

MC Production in Canada Pierre Savard University of Toronto and TRIUMF IFC Meeting October 2003.

Data Management with SAM at DØ The 2 nd International Workshop on HEP Data Grid Kyunpook National University Daegu, Korea August 22-23, 2003 Lee Lueking.

Feb. 13, 2002DØRAM Proposal DØCPB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Partial Workshop ResultsPartial.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.

Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.

1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.

Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.

Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?

DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.

SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001.

DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.

5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.

Monte Carlo Production and Reprocessing at DZero

SAM at CCIN2P3 configuration issues

DØ Computing & Analysis Model

DØ MC and Data Processing on the Grid

Lee Lueking D0RACE January 17, 2002

Proposal for a DØ Remote Analysis Model (DØRAM)

HEP Data Grid for DØ and Its Regional Grid, DØ SAR

Presentation transcript:

DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges and Future Plans Regional Analysis Centres Computing activities Summary Roadmap of Talk

Great Britain 200 all Monte Carlo Production Netherlands 50 France 100 Texas 64 Czech R. 32 fnal.gov DØ computing/data handling/database architecture UNIX hosts ENSTORE movers LINUX farm 300+ dual PIII/IV nodes Startap Chicago switch a: production c: development ADIC AML/2 STK 9310 powderhorn ClueDØ Linux desktop user cluster 227 nodes Fiber to experiment switch DEC4000 d0ola,b,c L3 nodes RIP data logger collector/router a b c SUN 4500 Linux quad d0ora1 d0lxac1 Linux d0dbsrv1 switch SGI Origin R12000 prcsrs 27 TB fiber channel disks Central Analysis Backend (CAB) 160 dual 2GHz Linux nodes 35 GB cache ea. Experimental Hall/office complex CISCO

SAM Data Management System Flexible and scalable distributed model Field hardened code Reliable and Fault Tolerant Adapters for mass storage systems: Enstore, (HPSS, and others planned) Adapters for Transfer Protocols: cp, rcp, scp, encp, bbftp, GridFTP. Useful in many cluster computing environments: SMP w/ compute servers, Desktop, private network (PN), NFS shared disk,… Ubiquitous for DØ users SAM is Sequential data Access via Meta-data Est SAM Station – 1. Collection of SAM servers which manage data delivery and caching for a node or cluster 2. The node or cluster hardware itself

Overview of DØ Data Handling Registered Users600 Number of SAM Stations56 Registered Nodes900 Total Disk Cache40 TB Number Files - physical1.2M Number Files - virtual0.5M Robotic Tape Storage305 TB Regional Center Analysis site Summary of DØ Data HandlingIntegrated Files Consumed vs Month (DØ) Integrated GB Consumed vs Month (DØ) 4.0 M Files Consumed 1.2 PB Consumed Mar2002 Mar2003

Data In and out of Enstore (robotic tape storage) Daily Aug 16 to Sep 20 5 TB outgoing 1 TB Incoming. Shutdown starts

Consumption Applications “consume” data In DH system: consumers can be hungry or satisfied allowing for consumption rate, the next course delivered before asking. 180 TB consumed per month 1.5 PB Consumed in 1yr

Challenges Getting SAM to meet the needs of DØ in the many configurations is and has been an enormous challenge. Some examples include… –File corruption issues. Solved with CRC. –Preemptive distributed caching is prone to race conditions and log jams or Gridlock. These have been solved. –Private networks sometimes require “border” services. This is understood. –NFS shared cache configuration provides additional simplicity and generality, at the price of scalability (star configuration). This works. –Global routing completed. –Installation procedures for the station servers have been quite complex. They are improving and we plan to soon have “push button” and even “opportunistic deployment” installs. –Lots of details with opening ports on firewalls, OS configurations, registration of new hardware, and so on. –Username clashing issues. Moving to GSI and Grid Certificates. –Interoperability with many MSS. –Network attached files. Sometimes, the file does not need to move to the user.

RAC:Why Regions are Important 1.Opportunistic use of ALL computing resources within the region 2.Management for resources within the region 3.Coordination of all processing efforts is easier 4.Security issues within the region are similar, CA’s, policies… 5.Increases the technical support base 6.Speak the same language 7.Share the same time zone 8.Frequent Face-to-face meetings among players within the region. 9.Physics collaboration at a regional level to contribute to results for the global level 10.A little spirited competition among regions is good

Summary of Current & Soon-to-be RACs RACIAC’s CPU  Hz (Total*) Disk (Total*) Archive (Total*) Schedule Aachen, Bonn, Freiburg, Mainz, Munich, Wuppertal, 52 GHz (518 GHz) 5.2 TB (50 TB) 10 TB (100TB) Established as RAC (Southern US) AZ, Cinvestav (Mexico City), LA Tech, Oklahoma, Rice, KU, KSU 160 GHz (320 GHz) 25 TB (50 TB) Summer 2003 Lancaster, Manchester, Imperial College, RAL 46 GHz (556 GHz) 14 TB (170 TB) 44 TBActive, MC production CCin2p3, CEA-Saclay, CPPM- Marseille, IPNL-Lyon, IRES-Strasbourg, ISN- Grenoble, LAL-Orsay, LPNHE- Paris 100 GHz12 TB200 TBActive, MC production (Northern US) Farm, cab, clued0, Central- analysis 1800 GHz 25 TB1 PBEstablished as CAC *Numbers in () represent totals for the center or region, other numbers are DØ’s current allocation.

UK RAC ManchesterLancasterLeScImperial(CMS) RAL 3.6TB FNAL MSS,25TB Global File Routing FNAL throttles transfers Direct access unnecessary Firewalls, policies,… Configurable, with fail-overs

From RAC’s to Riches Summary and Future We feel that the RAC approach is important to more effectively use remote resources Management and organization in each region is as important as the hardware. However… –Physics group collaboration will transcend regional boundaries –Resources within each region will be used by the experiment at large (Grid computing Model) –Our models of usage will be revisited frequently. Experience already indicates that the use of thumbnails differs from that of our RAC model (skims). –No RAC will be completely formed at birth. There are many challenges ahead. We are still learning…

Stay Tuned for SAM-Grid The best is yet to come…

CPU intensive activities Primary reconstruction –On-site, with local help to keep-up. MC production –Anywhere. No input data. Re-reconstruction (reprocessing) –Must be fast to be useful –Use all resources. Thumbnail skims –1 per physics group Common skim – OR of group skims –End up with all events if triggers are good –Defeats the object, i.e. small datasets User analysis – not a priority (CAB can satisfy demand) First on SAMGrid

Current Reprocessing of D0RunII Why now and fast? –Improved tracking for Spring conferences –Tevatron shutdown – include reconstruction farm Reprocess all RunII data –40TB of DST data –40k files (basic unit of Data Handling) –80 million events How –Many sites in US and Europe, inc. UK RAC –qsub initially, but UK will lead move to SAMGrid. –Nikhef (LCG) –Will gather statistics and report.

Runjob and SAMGrid Runjob workflow manager –Maintained by Lancaster. Mainstay of D0 MC production. –No difference between MC production and data (re)processing. SAMGrid integration –Was done for GT2.0, eg.tier1a via EDG1.4 CE –Job Bomb: 1 grid job-to-many local BS jobs, i.e. job has structure. –Request 2.0 gatekeeper(0mth), write custom perl jobmanagers(2mth), or use DAGman to absorb structure(3mth) –Pressure to use grid-submit - want 2.0 for now. 4UK sites, 0.5FTE’s – need to use SAMGrid.

Conclusions SAM enables PB scale HEP computing today. Details are important in production system –PN’s, NFS, scaling, cache management(free space=zero, always), gridlock,… Official & semi-official tasks dominate cpu requirements. –reconstruction, reprocessing, MC production, skims. –by definition these are structured, repeatable – good for grid. User analysis runs locally(still needs DH), or centrally. (Still project goal – just not mine) SAM experience valuable – see report on reprocessing. Have LCG seen how good it is?