Gavin S. Davies Iowa State University On behalf of the NOvA Collaboration FIFE Workshop, June 16-17 th 2014 FIFE workshop.

Slides:



Advertisements
Similar presentations
Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
OSG Public Storage and iRODS
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.
S. Veseli - SAM Project Status SAMGrid Developments – Part I Siniša Veseli CD/D0CA.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
+ discussion in Software WG: Monte Carlo production on the Grid + discussion in TDAQ WG: Dedicated server for online services + experts meeting (Thusday.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
MiniBooNE Computing Description: Support MiniBooNE online and offline computing by coordinating the use of, and occasionally managing, CD resources. Participants:
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
CDF Offline Production Farms Stephen Wolbers for the CDF Production Farms Group May 30, 2001.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Workshop on Computing for Neutrino Experiments - Summary April 24, 2009 Lee Lueking, Heidi Schellman NOvA Collaboration Meeting.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Stefano Belforte INFN Trieste 1 CMS Simulation at Tier2 June 12, 2006 Simulation (Monte Carlo) Production for CMS Stefano Belforte WLCG-Tier2 workshop.
Hall-D/GlueX Software Status 12 GeV Software Review III February 11[?], 2015 Mark Ito.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
The CMS CERN Analysis Facility (CAF) Peter Kreuzer (RWTH Aachen) - Stephen Gowdy (CERN), Jose Afonso Sanches (UERJ Brazil) on behalf.
May Donatella Lucchesi 1 CDF Status of Computing Donatella Lucchesi INFN and University of Padova.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
FIFE Architecture Figures for V1.2 of document. Servers Desktops and Laptops Desktops and Laptops Off-Site Computing Off-Site Computing Interactive ComputingSoftware.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
Recent Evolution of the Offline Computing Model of the NOA Experiment Talk #200 Craig Group & Alec Habig CHEP 2015 Okinawa, Japan.
Fabric for Frontier Experiments at Fermilab Gabriele Garzoglio Grid and Cloud Services Department, Scientific Computing Division, Fermilab ISGC – Thu,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
1 P. Murat, Mini-review of the CDF Computing Plan 2006, 2005/10/18 An Update to the CDF Offline Plan and FY2006 Budget ● Outline: – CDF computing model.
GLAST LAT ProjectNovember 18, 2004 I&T Two Tower IRR 1 GLAST Large Area Telescope: Integration and Test Two Tower Integration Readiness Review SVAC Elliott.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Overview of the Belle II computing
U.S. ATLAS Grid Production Experience
SuperB and its computing requirements
ALICE analysis preservation
Readiness of ATLAS Computing - A personal view
Computing Infrastructure for DAQ, DM and SC
Haiyan Meng and Douglas Thain
Leigh Grundhoefer Indiana University
 YongPyong-High Jan We appreciate that you give an opportunity to have this talk. Our Belle II computing group would like to report on.
Development of LHCb Computing Model F Harris
Presentation transcript:

Gavin S. Davies Iowa State University On behalf of the NOvA Collaboration FIFE Workshop, June th 2014 FIFE workshop

2 NOvA Computing: FIFE Workshop Construction Completed: Apr 25, 2014 Filling Status: 100% Electronics: 16% complete Estimated Completion: July 2014 Construction Completed: Apr 25, 2014 (14 kt) Electronics: 80% complete (11.25 kt) Estimated Completion: July 2014 NOvA Near Detector NOvA recording data from two detectors and gearing up towards the First Physics Analysis Results late 2014 thus increasing our Scientific Computing needs going forward NOvA Far Detector

3  Virtual Machines – User FNAL Computing gateway  10 virtual machines: novagpvm01 –novagpvm10  Round-robin access through: “ssh nova-offline.fnal.gov”  BlueArc - Interactive data storage /nova/data (140 T), /nova/prod (100 T), /nova/ana (95 T)  Tape - Long term data storage  4 PB of cache disk available for IF experiments  Batch – Data processing:  Local batch cluster: ~40 nodes  Grid slots at Fermilab for NOvA: 1300 nodes  Remote batch slots: Generation/simulation ready!  Off-site resources via novacfs and OSG oasis cvmfs servers NOvA Computing: FIFE Workshop

4  ECL (Electronic Collaboration Logbook)  Two logbooks currently in use for NOvA Control Room - General DAQ and Operations Ash River Construction - Assembly and Outfitting Also utilise ECL as Shift Scheduler and other collaboration tools  Databases  Online & Offline databases (development and production) Improved monitoring tools requested (performance monitoring)  Offline Conditions database access via web server  NOvA Hardware Databases and applications  IF Beams databases and applications layers (beam spill info) NOvA Computing: FIFE Workshop

5  Off-site resources via novacfs and OSG oasis CVMFS servers  NOvA can currently run batch jobs at multiple offsite farms  SMU, OSC, Harvard, Prague, Nebraska, San Diego, Indiana and U.Chicago  We use NOvA-dedicated sites for GENIE simulation generation  Successfully ran a first round of ND cosmics generation with Amazon EC2 (Elastic Cloud Computing) with lots of assistance from FNAL OSG group  Amazon spot-price charges – 1000 ND cosmics jobs: Cloud ~ $40, Data transfer ~ $27 ~230 GB  Jobs can access files using SAM and write output to FNAL  For undirected projects, FermiGrid will consume 75% of jobs  We need a job steering site-prioritisation system to maximise use of our dedicated sites  MC Generation and Reconstruction have been run off-site successfully thus this is a viable use of resources NOvA Computing: FIFE Workshop

6  NOvA uses ART as its underlying framework  We attend and provide input to weekly ART stakeholders meetings  Fast turn around for features required for SAM interfacing (rollout of new ART releases)  Relocatable ups system (/nusoft/app/externals)  External packages ROOT/GEANT4/GENIE etc, ART-dependent (SCD-provided binaries)  nutools – IF experiment-common packages  novasoft (/grid/fermiapp/nova/novaart/novasvn/)  We maintain and is not a ups product  All distributed for slf5 and slf6 via cvmfs  oasis.opensciencegrid.org and novacfs.fnal.gov  Development environment based on SRT build system and svn repository  Proven to work with cmake build system also NOvA Computing: FIFE Workshop

7  Initially we had all Data and Monte Carlo files on BlueArc handled by hand, archived manually and/or deleted  All simulation, reconstruction and particle identification previously processed onsite  With the exception of custom Library Event Matching (LEM) algorithm  LEM processed at Caltech  Requires large amount of total memory on machine (O(100G)) which breaks condor policy  Everyone writes at will to all BlueArc disks  100,000s of files  Organised by directory tree  No user-tools for archive storage and retrieval  A short-term solution before the transition to SAM  Hence the ongoing development of an updated system best suited for scalability NOvA Computing: FIFE Workshop

8 December 19 th & March 27 th !

9 NOvA Computing: FIFE Workshop  Our detector and MC data is more than can be managed with BlueArc local disk  Solution: Use SAM (worked for D0/CDF) for data set management interfaced with tape and large dCache disk  Each file declared to SAM must have metadata associated with it that can be used to define datasets  SAM alleviates the need to store large datasets on local disk storage and helps ensure that all data sets are archived to tape

10  Take advantage of off-site resources  Move to SAM for data set management and delivery to reduce the dependence on local disk usage  Store only final stage data files and CAF (Common Analysis Files) on BlueArc  ~250 TB on tape, 70 TB on disk  Monte Carlo throughput matching estimates  Streamline database usage  Develop a scalable and robust production framework  Define and document work flow for each production task  Improve reporting of production effort and progress.  Need a fast turnaround of Data and MC files processing for collaboration in lead up to conferences and collaboration meetings  Understanding lead times on each step is critical with first analysis results in mind  We successfully implemented and demonstrated the above on NOvA!  Always room for improvement of course NOvA Computing: FIFE Workshop

11  Goal: First physics results late 2014  ν e appearance analysis POT  ν μ disappearance analysis POT  Normalization to near detector  Aggressive scale & schedule  Expect to run/rerun multiple versions of production chain due to immaturity of code base  Final phases of analysis currently being tested  Need development effort/support for the scale of data processing being processed, simulated, analyzed NOvA Computing: FIFE Workshop

12  Perform production data processing and MC generation (matched to detector configurations) about 2 times per year  Productions scheduled in advance of NOvA collaboration meetings and/or Summer/Winter conferences  Full Reprocessing of raw data 1 time per year (current data volume 400+ TB)  Need to store raw physics and calibration data sets from Far Detector as well as processed data from Far Detector  Need to store Monte Carlo sets corresponding to the Near & Far Detectors matched to current production  Need to store data sets processed with multiple versions of reconstruction  Need slots dedicated to production efforts during prod/reprocessing peak to complete simulation/analysis chains within a few weeks NOvA Computing: FIFE Workshop

13 NOvA Computing: FIFE Workshop SystemToolStatus Tape StorageEnstore/dCacheFully integrated with online/offline. Approaching 1 PB to tape FrameworkARTFully integrated with offline, simulation and triggering Data Handling w/ Framework SAM/IFDH-artFully integrated with offline. Used for project accounting, file delivery and copy back Redesign of offline databases SCD conditions DBI and art services In use by experiment. Scaling issues. General Data handling and file delivery SAMweb/IFDH/ dCache/xrootd Fully integrated with production activities. Started adoption by analysis users Grid ResourcesFermigrid, OSGFully integrated with production activities. OSG onboard. Additional dedicated sites available through OSG interfaces

14 NOvA Computing: FIFE Workshop SystemToolStatus Central StorageBluearc Exhausted capacity. Does not scale under production load. Wide spread outages. Maintenance/cleanup nightmare Databases (performance) Conditions, Hardware, IFBeams Concurrency limits Query response times Caching issues (squids) Monitoring tools Job submission & Workflows with SAM Jobsub/IFDH/SAM Complexity of workflows, wrappers, etc… Not fully implemented for all workflows Not fully implemented for final analysis layers File staging (copy back) dCache & Bluearc Fermi-FTS Performance bottleneck to entire production system Access protocols available to validate files buggy

Production (FY14/FY15)  Goal : Updated physics results ~2 times per year  Production with current software over full data set + corresponding Monte Carlo.  Dominated by MC generation/simulation and off-beam (zero-bias) calibration data processing  Average beam event data is tiny (calibration data is 100x beam) Near Det Size: ~6.7 KB Far Det Size: ~247 KB (7.4 TB/yr)  Average Monte Carlo Event size Near Det: 1 MB Average generation time: 10 s/event  Estimated at ~ 1 M CPU hours / year NOvA Computing: FIFE Workshop 15 Motivated NOvA to move a large fraction of Monte Carlo production to offsite facilities (collaborating institutions, OSG and even Amazon Cloud)

Production (FY14/FY15)  Resources we have requested are designed to complete NOvA’s move to offsite MC generation and to ensure the success of the transition to large scale data handling and job management (w/ SAM)  Storage Services  Central Disk (BlueArc) production area is sized to accommodate production + data handling infrastructure Retain ~300 TB of space Will recover and reallocate space currently used by production for use as “project disks” for analysis Utilize available space for final CAF ntuples and custom skims  Additional (non-tape backed) disk storage (dCache) TB Required for production activities. (current non-volatile 266 TB shared) Temporary storage used for intermediate files, data validation, registration to SAM data catalog.  Tape storage will be significant for the coming year (~1.5 PB) NOvA Computing: FIFE Workshop 16

Production (FY14/FY15)  CVMFS (Production service)  Required for distribution of code base, libs FNAL  Enables offsite running and OSG integration OSG provided Oasis servers requires additional engineering to meet NOvA requirements  Batch  Require support and development of improved monitoring tools Expansion of the FIFEmon suite to include user specified reporting  Require support of IF-JOBSUB suite and development effort for requested features to support offsite computing resources  Continued effort and development for integrating off-site MC generation with FNAL grid and OSG infrastructure SMU, Ohio Supercomputing, Harvard and many other sites already demonstrated  Recently generated MC on Amazon Cloud NOvA Computing: FIFE Workshop 17

Production (FY14/FY15)  Data handling  Support for the SAM data management system is essential for NOvA production and analysis  Support for the IF Data Handling (IFDH) and IF File Transfer Service (IF-FTS) are essential for production and analysis  Continued support/development resources for further integration of SAM, IFDH and IF-FTS with the NOvA computing suite are requested  Database  Support of Database Interface (DBI) capable of handling NOvA peak usage and scaling  Support for Development and integration of database resources with offsite Monte Carlo and production running  Additional server infrastructure and improved monitoring may be required.  Development/consulting effort  NOvA is in the early stages of the ramp up to full physics production  Support for the batch system and consulting effort to further integrate the batch computing model with the NOvA software desired  Dedicated SCD or PPD effort to manage production tasks are highly desired! NOvA Computing: FIFE Workshop 18

19 NOvA Computing: FIFE Workshop  Feedback from the collaboration of the utility of SAM is very positive  Scalability is proven in IF experiment environment  Successfully processed required full Data and MC datasets in time for Neutrino 2014 Example SAM project processing NOvA Calibration data: Time vs Process ID where each green box represents a single “nova” job consuming one data file File Busy Time by process

20 NOvA Computing: FIFE Workshop  NOvA Production group provided with a large amount of coordination, cooperation and support from the SCD Data Handling group in transition to SAM  In lead up to Neutrino 2014 we fully utilised the SAM system for production purposes with great success and fast turnaround  Off-site resources performing well – CPU is not a limiting factor  Site prioritisation schema requested  Request improved feedback for completion of pushing software updates  NOvA SAM Tutorial to help educate the collaboration was successful  Collaborators utilising the SAM system with limited cases of difficulty  Crucial efforts to streamline NOvA require future SCD personnel  FTS + SAM/ART streamlining  Database development/support  Offsite MC  NOvA look forward to understanding the future role of FIFE in production job submission/monitoring

21 NOvA Computing: FIFE Workshop

Production Overview  Output from workshop last fall to estimate our needs for production resulting in the following goals:  The footprint for final output of a production run should be less than 100TB.  The production run should be possible to complete in a two week period.  There was also a major effort to understand resources and to streamline production tools. (Caveat – still validating these numbers from latest production round, initial accounting appears to match estimates) NOvA Computing: FIFE Workshop 22

FY14 Resource Request Summary ResourceTypeFY14 AdditionTotal Central DiskBluearc0 TB335 TB Archival StorageEnstore tape1.5 PB> 2 PB Cache DiskDCACHE0.5 PB Batch ProcessingGrid Slots--1300* Local Batch--40 Interactive LoginSLF VMs--10/10 FY14 Resource Request Summary *Does not include opportunistic usage, offsite resources and requested “burst” capabilities NOvA Computing: FIFE Workshop 23

FY15 Resource Request Summary ResourceTypeFY15 AdditionTotal Central DiskBluearc0 TB335 TB Archival StorageEnstore tape1.5 PB>3.5 PB Cache DiskDCACHE0.51 PB Batch ProcessingGrid Slots500*1800* Local Batch--40 Interactive LoginSLF VMs--10/10 *Assumes that we can demonstrate need for increased quota Note: Does not include opportunistic usage, offsite resources and requested “burst” capabilities FY15 Resource Request Summary *Assumes that we can demonstrate need for increased quota Note: Does not include opportunistic usage, offsite resources and requested “burst” capabilities NOvA Computing: FIFE Workshop 24

Fermigrid CPU usage over last month and year. NOvA Computing: FIFE Workshop 25

26  While integrating the SAM system into our production model, the database was identified as a bottleneck  February 7 th meeting with NOvA and SCD key personnel  Several improvements implemented  Replica server, more slots in queues, “try n times”  Also many recent updates to IFBeam DB  Where we get our Protons-on-target count  Our stress tests have proven that >1500 jobs can now run concurrently  Areas for improvement identified and implemented NOvA Computing: FIFE Workshop

27 NOvA Computing: FIFE Workshop  Tests of the system are ongoing  Require more resources, but we need to understand our needs and specify our requirements.  Final assessment of recent production run and new database metrics in our jobs will go a long way to help here  We need improved Database monitoring tools  E.g. We appear to have a short-lived memory spike that pushes some jobs over the condor memory threshold (4GB) – debugging is proving very difficult

28 NOvA Computing: FIFE Workshop Connections with ~1400 concurrent NuMI PCHit jobs ifdbprod ifdbrep Ifdbprod and ifdbrep are each configured to take 200 direct connections. So, they handle 1400 concurrent jobs well with headroom. Improved monitoring of DB connections would be useful (under discussion).