CHEP 03 UCSDd0db.fnal.gov/sam1 DØ Data Handling Operational Experience CHEP03 UCSD March 24-28, 2003 Lee Lueking Fermilab Computing Division DØ overview.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)
March 25, 2003L. Lueking - CHEP031 DØ Regional Analysis Center Concepts CHEP 2003 UCSD March 24-28, 2003 Lee Lueking The Mission The Resource Potential.
Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
18 Feb 2004Computing Division Project Status Report1 Project Status Report : SAMGrid  SAMGrid Management, Status, Operations – Merritt  SAMGrid Development.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.
28 April 2003Lee Lueking, PPDG Review1 BaBar and DØ Experiment Reports DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Snapshot of the D0 Computing and Operations Planning Process Amber Boehnlein For the D0 Computing Planning Board.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China.
May 12-15, 2003Lee Lueking, EDG Int. Proj. Conf.1 DØ Computing Experience and Plans for SAM-Grid EU DataGrid Internal Project Conference May 12-15, 2003.
DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges.
A Design for KCAF for CDF Experiment Kihyeon Cho (CHEP, Kyungpook National University) and Jysoo Lee (KISTI, Supercomputing Center) The International Workshop.
Jan. 17, 2002DØRAM Proposal DØRACE Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Remote Analysis Station ArchitectureRemote.
DØ RAC Working Group Report Progress Definition of an RAC Services provided by an RAC Requirements of RAC Pilot RAC program Open Issues DØRACE Meeting.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
International Workshop on HEP Data Grid Nov 9, 2002, KNU Data Storage, Network, Handling, and Clustering in CDF Korea group Intae Yu*, Junghyun Kim, Ilsung.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
ORBMeeting July 11, Outline SAM Overview and Station description Resource Management Station Cache Station Prioritized Fair Share Job Control File.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
Workshop on Computing for Neutrino Experiments - Summary April 24, 2009 Lee Lueking, Heidi Schellman NOvA Collaboration Meeting.
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna 27 th Feb., 2009 University of Birmingham, U.K.
1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.
DØSAR a Regional Grid within DØ Jae Yu Univ. of Texas, Arlington THEGrid Workshop July 8 – 9, 2004 Univ. of Texas at Arlington.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
DCAF (DeCentralized Analysis Farm) Korea CHEP Fermilab (CDF) KorCAF (DCAF in Korea) Kihyeon Cho (CHEP, KNU) (On the behalf of HEP Data Grid Working Group)
Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.
DCAF(DeCentralized Analysis Farm) for CDF experiments HAN DaeHee*, KWON Kihwan, OH Youngdo, CHO Kihyeon, KONG Dae Jung, KIM Minsuk, KIM Jieun, MIAN shabeer,
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
MC Production in Canada Pierre Savard University of Toronto and TRIUMF IFC Meeting October 2003.
Data Management with SAM at DØ The 2 nd International Workshop on HEP Data Grid Kyunpook National University Daegu, Korea August 22-23, 2003 Lee Lueking.
Feb. 13, 2002DØRAM Proposal DØCPB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Partial Workshop ResultsPartial.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna Presently at Fermilab 21 st April, 2009 Post Doctoral Interview University of.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
Sept Wyatt Merritt Run II Computing Review1 Status of SAMGrid / Future Plans for SAMGrid  Brief introduction to SAMGrid  Status and deployments.
SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
Project Status Report : SAMGrid
DØ MC and Data Processing on the Grid
Lee Lueking D0RACE January 17, 2002
Presentation transcript:

CHEP 03 UCSDd0db.fnal.gov/sam1 DØ Data Handling Operational Experience CHEP03 UCSD March 24-28, 2003 Lee Lueking Fermilab Computing Division DØ overview Computing Architecture Operational Statistics Operations strategy and experience Challenges and Future Plans Summary Roadmap of Talk

CHEP 03 UCSDd0db.fnal.gov/sam2 The DØ Experiment D0 Collaboration –18 Countries; 80 institutions –>600 Physicists Detector Data (Run 2a end mid ‘04) –1,000,000 Channels –Event size 250KB –Event rate 25 Hz avg. –Est. 2 year data totals (incl. Processing and analysis): 1 x 10 9 events, ~1.2 PB Monte Carlo Data (Run 2a) –6 remote processing centers –Estimate ~0.3 PB. Run 2b, starting 2005: >1PB/year

CHEP 03 UCSDd0db.fnal.gov/sam3 DØ Data Handling is a Successful Worldwide Effort Thanks to the efforts of many people The SAM team at FNAL: Andrew Baranovski, Diana Bonham, Lauri Carpenter, Lee Lueking, Wyatt Merritt*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Kin Yip (BNL). (*project co- lead with Jeff Tseng from CDF) Major contributions from –Amber Boehnlein (D0 Offline Computing Leader), Iain Bertram (Lancaster), Chip Brock (MSU), Jianming Qian (UM), Rod Walker (IC), Vicky White (FNAL) CD Computing and Communication Fabric Dept. (CCF), in particular the Enstore Team, and Farms Support Group CD Core Computing Support (CCS) Database Support Group (DSG) and Computing and Engineering for Physics Applications (CEPA) Database Applications Group (DBS) for database support CD D0 Department, D0 Operations Team at Fermilab CAB and CLueD0 administrators and support teams Sam Station Administrators, and SAM Shifters Worldwide MC production teams: Lancaster UK, Imperial College UK, Prague CZ, U Texas Arlington, Lyon FR, Amsterdam NL, GridKa Regional Analysis Center at Karlsruhe, Germany: Daniel Wicke (Wuppertal), Christian Schmitt (Wuppertal), Christian Zeitnitz (Mainz)

CHEP 03 UCSDd0db.fnal.gov/sam4 Great Britain 200 all Monte Carlo Production Netherlands 50 France 100 Texas 64 Czech R. 32 fnal.gov DØ computing/data handling/database architecture UNIX hosts ENSTORE movers LINUX farm 300+ dual PIII/IV nodes Startap Chicago switch a: production c: development ADIC AML/2 STK 9310 powderhorn ClueDØ Linux desktop user cluster 227 nodes Fiber to experiment switch DEC4000 d0ola,b,c L3 nodes RIP data logger collector/router a b c SUN 4500 Linux quad d0ora1 d0lxac1 Linux d0dbsrv1 switch SGI Origin R12000 prcsrs 27 TB fiber channel disks Central Analysis Backend (CAB) 160 dual 2GHz Linux nodes 35 GB cache ea. Experimental Hall/office complex CISCO

CHEP 03 UCSDd0db.fnal.gov/sam5 SAM Data Management System Flexible and scalable distributed model Field hardened code Reliable and Fault Tolerant Adapters for many batch systems: LSF, PBS, Condor, FBS Adapters for mass storage systems: Enstore, (HPSS, and others planned) Adapters for Transfer Protocols: cp, rcp, scp, encp, bbftp, GridFTP. Useful in many cluster computing environments: SMP w/ compute servers, Desktop, private network (PN), NFS shared disk,… Ubiquitous for DØ users SAM is Sequential data Access via Meta-data Est SAM Station – 1. Collection of SAM servers which manage data delivery and caching for a node or cluster 2. The node or cluster hardware itself

CHEP 03 UCSDd0db.fnal.gov/sam6 Overview of DØ Data Handling Registered Users600 Number of SAM Stations56 Registered Nodes900 Total Disk Cache40 TB Number Files - physical1.2M Number Files - virtual0.5M Robotic Tape Storage305 TB Regional Center Analysis site Summary of DØ Data HandlingIntegrated Files Consumed vs Month (DØ) Integrated GB Consumed vs Month (DØ) 4.0 M Files Consumed 1.2 PB Consumed Mar2002 Mar2003

CHEP 03 UCSDd0db.fnal.gov/sam7 Data In and out of Enstore (robotic tape storage) Daily Feb 14 to Mar TB outgoing 1.3 TB incoming See Cat. 3 Tuesday at 4:50

CHEP 03 UCSDd0db.fnal.gov/sam8 DØ SAM Station Summary NameLocationNodes/cpuCacheUse/comments Central- analysis FNAL128 SMP*, SGI Origin TBAnalysis & D0 code development CAB (CA Backend) FNAL16 dual 1 GHz dual 1.8 GHz 6.2 TBAnalysis and general purpose FNAL-FarmFNAL100 dual GHz +240 dual 1.8 GHz 3.2 TBReconstruction CLueD0FNAL50 mixed PIII, AMD. (may grow >200) 2 TBUser desktop, General analysis D0karlsruhe (GridKa) Karlsruhe, Germany 1 dual 1.3 GHz gateway, >160 dual PIII & Xeon 3 TB NFS shared General/Workers on PN. Shared facility D0umich (NPACI) U Mich. Ann Arbor 1 dual 1.8 GHz gateway, 100 x dual AMD XP TB NFS shared Re-reconstruction. workers on PN. Shared facility Many Others > 4 dozen WorldwideMostly dual PIII, Xeon, and AMD XP MC production, gen. analysis, testing *IRIX, all others are Linux

CHEP 03 UCSDd0db.fnal.gov/sam9 Station Stats: GB Consumed Daily Feb 14 – Mar 15 Central-Analysis FNAL-farm ClueD0 CAB 2.5 TB Feb GB Feb TB Mar 6 >1.6 TB Feb 28

CHEP 03 UCSDd0db.fnal.gov/sam10 Station Stats: MB Delivered/Sent Daily Feb 14 – March 15 Central-Analysis FNAL-farm ClueD0 CAB Delivered to Sent from 1 TB Feb GB Feb TB Mar GB Feb TB Feb 22 Consumed 270 GB Feb 17 Consumed 1.1 TB Mar 6 Consumed 1.6 TB Feb 28 Consumed

CHEP 03 UCSDd0db.fnal.gov/sam11 FNAL-farm Station and CAB CPU Utilization Feb 14 – March CPUs FNAL-farm Reconstruction Farm Central-Analysis Backend Compute Servers 50% Utilization CAB Usage will increase dramatically in the coming months Also, see CLued0 Talk in Section 3, Monday, 5:10

CHEP 03 UCSDd0db.fnal.gov/sam12 Data to and from Remote Sites Data Forwarding and Routing MSS SAM Station 1 SAM Station 2 SAM Station 3 SAM Station 4 Remote SAM Station Station Configuration Replica location Prefer Avoid Forwarding File stores can be forwarded through other stations Routing Routes for file transfers are configurable Extra-domain transfers use bbftp or GridFTP (parallel transfer protocols) Remote SAM Station Remote SAM Station

CHEP 03 UCSDd0db.fnal.gov/sam13 DØ Karlsruhe Station at GridKa 1.2 TB in Nov TB since June 2002 This is our first Regional Analysis Center (RAC). See DØ RAC Concept talk, Category 1, Tuesday 2:30. Monthly Thumbnail Data Moved to GridKa Cumulative Thumbnail Data Moved to GridKa The GridKa SAM Station uses shared cache config. with workers on a private network Resource Overview: –Compute: 95 x dual PIII 1.2GHz, 68 x dual Xeon 2.2 GHz. D0 requested 6%. (updates in April) –Storage: D0 has 5.2 TB cache. Use of % of ~100TB MSS. (updates in April) –Network: 100Mb connection available to users. –Configuration: SAM w/ shared disk cache, private network, firewall restrictions, OpenPBS, Redhat 7.2, k 2.418, D0 software installed.

CHEP 03 UCSDd0db.fnal.gov/sam14 SAM Shift Stats Overview Weekly In operation since summer 2001 Kin Yip (BNL) is current shift coordinator 27 general shifters from 5 time zones. 7 expert shifters at FNAL Experts still carry much of the load. Problems range from simple user questions to installation issues, hardware, network, bugs... Shifter Time Zones

CHEP 03 UCSDd0db.fnal.gov/sam15

CHEP 03 UCSDd0db.fnal.gov/sam16 Challenges Getting SAM to meet the needs of DØ in the many configurations is and has been an enormous challenge. Some examples include… –File corruption issues. Solved with CRC. –Preemptive distributed caching is prone to race conditions and log jams. These have been solved. –Private networks sometimes require “border” naming services. This is understood. –NFS shared cache configuration provides additional simplicity and generality, at the price of scalability (star configuration). This works. –Global routing completed. –Installation procedures for the station servers have been quite complex. They are improving and we plan to soon have “push button” and even “opportunistic deployment” installs. –Lots of details with opening ports on firewalls, OS configurations, registration of new hardware, and so on. –Username clashing issues. Moving to GSI and Grid Certificates. –Interoperability with many MSS. –Network attached files. Sometimes, the file does not need to move to the user.

CHEP 03 UCSDd0db.fnal.gov/sam17 Stay Tuned for SAM-Grid The best is yet to come… JIM Talk In Cat. 1 Tuesday 1:50 Igor Terekhov

CHEP 03 UCSDd0db.fnal.gov/sam18 Summary The DØ Data Handling operation is a complex system involving a worldwide network of infrastructure and support. SAM provides flexible data management solutions for many hardware configurations, including clusters in private networks, shared NFS cache, and distributed cache. It also provides configurable data routing throughout the install base. The software is stable and provides reliable data delivery and management to production systems at FNAL and worldwide. Many challenging problems have been overcome to achieve this goal. Support is provided through a small group of experts at FNAL, and a network of shifters throughout the world. Many tools are provided to monitor the system, detect and diagnose problems. The system is continually being improved, and additional features are planed as the system moves beyond data handling to complete Grid functionality in the SAM-Grid project (a.k.a. SAM + JIM).