DØ Computing & Analysis Model

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
6/2/2015 Michael Diesburg HCP Distributed Computing at the Tevatron D0 Computing and Event Model Michael Diesburg, Fermilab For the D0 Collaboration.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Current Monte Carlo calculation activities in ATLAS (ATLAS Data Challenges) Oxana Smirnova LCG/ATLAS, Lund University SWEGRID Seminar (April 9, 2003, Uppsala)
L3 Filtering: status and plans D  Computing Review Meeting: 9 th May 2002 Terry Wyatt, on behalf of the L3 Algorithms group. For more details of current.
The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
S. Veseli - SAM Project Status SAMGrid Developments – Part I Siniša Veseli CD/D0CA.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.
International Workshop on HEP Data Grid Nov 9, 2002, KNU Data Storage, Network, Handling, and Clustering in CDF Korea group Intae Yu*, Junghyun Kim, Ilsung.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
Jan 10, 2007, Clermont-Ferrand Tibor Kurca, Tutorial Grille1 DØ Computing Introduction - Fermilab & Tevatron & DØ Experiment DØ Computing Model 1. data.
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Frank Wuerthwein, UCSD Update on D0 and CDF computing models and experience Frank Wuerthwein UCSD For CDF and DO collaborations October 2 nd, 2003 Many.
Analysis Tools at D0 PPDG Analysis Grid Computing Project, CS 11 Caltech Meeting Lee Lueking Femilab Computing Division December 19, 2002.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
Data Management with SAM at DØ The 2 nd International Workshop on HEP Data Grid Kyunpook National University Daegu, Korea August 22-23, 2003 Lee Lueking.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
Compute and Storage For the Farm at Jlab
WP18, High-speed data recording Krzysztof Wrona, European XFEL
CMS High Level Trigger Configuration Management
Pasquale Migliozzi INFN Napoli
LHC experiments Requirements and Concepts ALICE
Data Challenge with the Grid in ATLAS
Production Resources & Issues p20.09 MC-data Regeneration
SAM at CCIN2P3 configuration issues
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
R. Graciani for LHCb Mumbay, Feb 2006
Tibor Kurca, Tutorial CPPM
Nuclear Physics Data Management Needs Bruce G. Gibbard
DØ MC and Data Processing on the Grid
Lee Lueking D0RACE January 17, 2002
ATLAS DC2 & Continuous production
Presentation transcript:

DØ Computing & Analysis Model Tibor Kurča IPN Lyon Introduction DØ Computing Model SAM Analysis Farms - resources, capacity Data Model Evolution - where you can go wrong Summary Mar 15, 2007, Clermont-Ferrand

Computing Enables Physics D A T H N L I G HEP Computing Online : data taking Offline : Data Reconstruction MC- data production Analysis  physics results final goal of the experiment Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Data Flow Analysis Real Data Monte Carlo Data Beam collisions Event generation: software modelling beam particles interactions  production of new particles from those collisions Particles traverse detector Simulation: particles transport in the detectors Readout: Electronic detector signals written to tapes  raw data Digitization: Transformation of the particle drift times, energy deposits into the signals readout by electronics  the same format as real raw data Reconstruction: physics objects, i.e. particles produced in the beams collisions -- electrons, muons, jets… Physics Analysis Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

DØ Computing Model 1997 – planning for RunII was formalized - critical look at RunI production and analysis use cases - datacentric view – metadata (data about data) - scalability with RunII data rates and anticipated budgets Data volumes – inteligent file delivery  caching, buffering - extensive bookkeeping about usage in a central DB Access to the data - consistent interface to the data for anticipated global analysis  transport mechanisms and data stores transparent to the users  replication and location services  security, authentication and authorization The centralization, in turn, required client-server model for scalability and uptime and affordability  client-server model applied to serving calibration data to remote sites Resulting project: Sequential Access via Metadata (SAM) Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

SAM - Data Management System distributed Data Handling System for Run II DØ, CDF experiments - set of servers (stations) communicating via CORBA - central DB (ORACLE @ FNAL) - designed for PETABYTE sized datasets ! SAM functionalities - file storage from online and processing systems  MSS - FNAL Enstore, CCIN2P3 HPSS… disk caches around the world - routed file delivery - user doesn’t care about file locations - file metadata cataloging  datasets creation based on file metadata - analysis bookkeeping  which files processed succesfuly by which application when and where - user interfaces via command line, web and python API - user authentication - registration as SAM user - local and remote monitoring capabilities http://d0db-prd.fnal.gov/sam_local/SamAtAGlance/ http://www-clued0.fnal.gov/%7Esam/samTV/current/ Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Computing Model I DØ computing model built on SAM - first reconstruction done on FNAL farms - all MC produced remotely - all data centralized at FNAL (Enstore)  even MC - no automatic replication - Remote Regional Analysis Centers (RAC) CCIN2P3, GridKa usually prestaging data of interest data routed via central-analysisRACsmaller sites DØ native computing grid – SAMGrid SAMGrid/LCG, SAMGrid/OSG interoperability Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Computing Model II SAM 1st reconstruction MC-production Reprocessing Fixing … SAM ENSTORE Analysis , Individual production … Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Analysis Farm 2002 Central Analysis facility: D0mino SGI Origin 2000-176 300 MHz processors and 30 TB50 TB fibre channel disk - RAID disk for system needs and user home areas - centralized, interactive and batch services for on & off-site users - provided also data movement into a cluster of Linux compute nodes 500 GHz CAB (Central Analysis Backend) SAM enables “remote” analysis - user can run analysis jobs on remote sites with SAM services - 2 analysis farm stations were pulling the majority of their files from tape  large load user data access at FNAL was a bottleneck Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Central Analysis Farms 2003+ SGI Origin …. starting to be phased out D0mino0x : 2004  new Linux based interactive pool Clued0 : cluster of Institutional desktops + rack-mounted nodes as large disk servers 1 Gb Ethernet connection with batch system SAM access (station), local project disk appears as a single integrated cluster to the user managed by the users used for development of analysis tools, small sample tests CAB (Central Analysis Backend): Linux filservers and worker nodes (pioneered by CDF with FNAL/CD) full sample analysis jobs, common analysis samples production Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Central Analysis Farms - 2007 Home areas on NETAPP (Network Appliance) CAB: - Linux nodes - 3 THz of CPU - 400 TB SAM Cache Clued0 - desktop cluster + disk servers - 1+ THz - SAM Cache - 70 TB (nodes) + 160 TB (servers) Before adding 100 TB of Cache,2/3 of transfers could be from tape Enstore Practically all tape transfers occur within 5 min Intra-Station: 60% of cached files are delivered within 20 S 20 sec 5 min Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Data Model in Retrospective Initial data model: - STA : raw data +all reconstructed objects (too big…) - DST : reconstructed objects plus enough info to redo reconstruction - TMB: compact format of selected reconstructed objects - all catalogued and accessible via SAM - formats supported by a standard C++ framework …… physics groups would produce and maintain their specific tuples Reality: - STA never implemented - TMB wasn’t ready when data started to come - DST was ready, but initially people wanted extra info in raw data - Root tuple output intended for debugging was available many started to use it for analysis - threshold for using the standard framework and SAM was high (complex and inadequate documentation) Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Data Model in Retrospective 2 TMB …. Finalized too late (8 months after data taking began)  data disk resident, duplication of algoritms developments …. Slow for analysis (unpacking times large, changes required slow relinks) Divergence between those using standard framework vs root tuples  incompabilities and complications, notably in standard object IDs  need for common format was finally recognized (difficult process) TMBTree effort was made to introduce new common analysis format - still compatibility issues and inertia prevented most root tuple users to to use it - didn’t have a clear support model  never caught on TMB++ - added calorimeter cells information & tracker hits Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

CAF - Common Analysis Format 2004 “CAF” project begins – Common Analysis Format: common root tree format based on existing TMB  central production & storing in SAM  effeciency gains: easier sharing of data and analysis algorithms between physics groups reducing the development and maintenance effort required by the groups  faster turn-around between data taking and publication café CAF-environment has been developped: - single user-friendly, root-based analysis system forming the basis for common tools development – standard an alysis procedures such as trigger selection, object-ID selection, efficiency calculation  benefits for all physics groups Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

CAF Use Taking off 2004 “CAF” begins CAF commissioned in 2006 Working to understanding use cases, Next focus is analysis Red is TMB access Blue is CAF Black is Physics group samples 10B Events consumed/month Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

CPU Usage - Efficiency Cabsrv2: SAM_lo CPU time/wall time April ‘06 70% Sept ‘05 20% Historical average is around 70% CPU/Wall time. Currently I/O dominated Working to understand—multiple “problems” or limitations seems likely  ROOT bug Vitally important to understand analysis use cases/patterns in discussion with Physics groups Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Root Bug Many jobs only getting 20% CPU on CAB Reported to experts (Paul Russo, Philippe Canal) and problem found. Slow lookup of TRef’s in Root. Fixed and a new patch of Root v4.4.2b and p21.04.00 release has new root patch. 12% file opened, TStreamerInfo read 6% read the input tree from the file 7% clone the input tree by Café 10% Do processing 32% unzip tree data 26% move tree data from Root I/O butter to user buffer 7% miscellaneous Use new fixed code and measure CPU performance to see if we continue to see any issues with CPU. Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Analysis over Time 2006 1 PB cabsrv1 cabsrv1 2002 clued0 Events consumed by stations since “the beginning of SAM time” Integrates to 450B events consumed 2006 1 PB cabsrv1 cabsrv1 2002 clued0 Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

SAM Data Consumption/Month 2007 Feb 2006 – Mar 2007 ~800TB/month Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

SAM Cumulated Data Consumption 2007 Mar 2006- Mar 2007 Feb 2006 – Mar 2007 > 10 PB/year ~250 B events/year Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

Summary - Conclusions Analysis – final step in the whole computing chain of physics experiment - most unpredictable usage of computing resources - from their nature I/O oriented jobs - 2 phases in the analysis procedure: 1. developping analysis tools, testing on small samples 2. large scale analysis production User friendly environment, suitable tools - short learning curve - missing user interfaces, painful environment  users resistance Lessons: it’s not only about hardware resources & architecture…. Common data tiers (formats) are very important - need a format that meets needs of all users and all agree on from day one - simplicity of usage - documentation must be ready to use - - use cases, surprises ? “Most basic user’s needs in areas where they interact directly with computing system should be an extremely high priority” Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France